* [PATCH 0/2] KVM: x86/mmu: Fix TLB flushing bugs in TDP MMU @ 2021-03-19 23:20 Sean Christopherson 2021-03-19 23:20 ` [PATCH 1/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap Sean Christopherson 2021-03-19 23:20 ` [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping Sean Christopherson 0 siblings, 2 replies; 10+ messages in thread From: Sean Christopherson @ 2021-03-19 23:20 UTC (permalink / raw) To: Paolo Bonzini Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, linux-kernel, Ben Gardon Two bug fixes involving the TDP MMU. Found by inspection while working on a series to consolidate MMU notifier memslot walks across architectures, which I'll hopefully post next week. Patch 1 fixes a bug where KVM yields, e.g. due to lock contention, without performing a pending TLB flush that was required from a previous root. Patch 2 fixes a much more egregious bug where it fails to handle TDP MMU flushes in NX huge page recovery, as well as a similar bug to patch 1 where KVM can yield without correctly handling a previously triggered pending TLB flush. Sean Christopherson (2): KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping arch/x86/kvm/mmu/mmu.c | 15 ++++++++++----- arch/x86/kvm/mmu/tdp_mmu.c | 29 +++++++++++++++-------------- arch/x86/kvm/mmu/tdp_mmu.h | 3 ++- 3 files changed, 27 insertions(+), 20 deletions(-) -- 2.31.0.rc2.261.g7f71774620-goog ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap 2021-03-19 23:20 [PATCH 0/2] KVM: x86/mmu: Fix TLB flushing bugs in TDP MMU Sean Christopherson @ 2021-03-19 23:20 ` Sean Christopherson 2021-03-22 21:27 ` Ben Gardon 2021-03-19 23:20 ` [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping Sean Christopherson 1 sibling, 1 reply; 10+ messages in thread From: Sean Christopherson @ 2021-03-19 23:20 UTC (permalink / raw) To: Paolo Bonzini Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, linux-kernel, Ben Gardon When flushing a range of GFNs across multiple roots, ensure any pending flush from a previous root is honored before yielding while walking the tables of the current root. Note, kvm_tdp_mmu_zap_gfn_range() now intentionally overwrites it local "flush" with the result to avoid redundant flushes. zap_gfn_range() preserves and return the incoming "flush", unless of course the flush was performed prior to yielding and no new flush was triggered. Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/mmu/tdp_mmu.c | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index f0c99fa04ef2..6cf08c3c537f 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -86,7 +86,7 @@ static inline struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm, list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, - gfn_t start, gfn_t end, bool can_yield); + gfn_t start, gfn_t end, bool can_yield, bool flush); void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root) { @@ -99,7 +99,7 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root) list_del(&root->link); - zap_gfn_range(kvm, root, 0, max_gfn, false); + zap_gfn_range(kvm, root, 0, max_gfn, false, false); free_page((unsigned long)root->spt); kmem_cache_free(mmu_page_header_cache, root); @@ -664,20 +664,21 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm, * scheduler needs the CPU or there is contention on the MMU lock. If this * function cannot yield, it will not release the MMU lock or reschedule and * the caller must ensure it does not supply too large a GFN range, or the - * operation can cause a soft lockup. + * operation can cause a soft lockup. Note, in some use cases a flush may be + * required by prior actions. Ensure the pending flush is performed prior to + * yielding. */ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, - gfn_t start, gfn_t end, bool can_yield) + gfn_t start, gfn_t end, bool can_yield, bool flush) { struct tdp_iter iter; - bool flush_needed = false; rcu_read_lock(); tdp_root_for_each_pte(iter, root, start, end) { if (can_yield && - tdp_mmu_iter_cond_resched(kvm, &iter, flush_needed)) { - flush_needed = false; + tdp_mmu_iter_cond_resched(kvm, &iter, flush)) { + flush = false; continue; } @@ -695,11 +696,11 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, continue; tdp_mmu_set_spte(kvm, &iter, 0); - flush_needed = true; + flush = true; } rcu_read_unlock(); - return flush_needed; + return flush; } /* @@ -714,7 +715,7 @@ bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end) bool flush = false; for_each_tdp_mmu_root_yield_safe(kvm, root) - flush |= zap_gfn_range(kvm, root, start, end, true); + flush = zap_gfn_range(kvm, root, start, end, true, flush); return flush; } @@ -931,7 +932,7 @@ static int zap_gfn_range_hva_wrapper(struct kvm *kvm, struct kvm_mmu_page *root, gfn_t start, gfn_t end, unsigned long unused) { - return zap_gfn_range(kvm, root, start, end, false); + return zap_gfn_range(kvm, root, start, end, false, false); } int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start, -- 2.31.0.rc2.261.g7f71774620-goog ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap 2021-03-19 23:20 ` [PATCH 1/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap Sean Christopherson @ 2021-03-22 21:27 ` Ben Gardon 0 siblings, 0 replies; 10+ messages in thread From: Ben Gardon @ 2021-03-22 21:27 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, LKML On Fri, Mar 19, 2021 at 4:20 PM Sean Christopherson <seanjc@google.com> wrote: > > When flushing a range of GFNs across multiple roots, ensure any pending > flush from a previous root is honored before yielding while walking the > tables of the current root. > > Note, kvm_tdp_mmu_zap_gfn_range() now intentionally overwrites it local > "flush" with the result to avoid redundant flushes. zap_gfn_range() > preserves and return the incoming "flush", unless of course the flush was > performed prior to yielding and no new flush was triggered. > > Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") > Cc: stable@vger.kernel.org > Cc: Ben Gardon <bgardon@google.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-By: Ben Gardon <bgardon@google.com> > --- > arch/x86/kvm/mmu/tdp_mmu.c | 23 ++++++++++++----------- > 1 file changed, 12 insertions(+), 11 deletions(-) > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index f0c99fa04ef2..6cf08c3c537f 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -86,7 +86,7 @@ static inline struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm, > list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) > > static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, > - gfn_t start, gfn_t end, bool can_yield); > + gfn_t start, gfn_t end, bool can_yield, bool flush); This function is going to acquire so many arguments. Don't need to do anything about it here, but this is going to need some kind of cleanup at some point. I'll have to add another "shared" type arg for running this function under the read lock in a series I'm prepping. > > void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root) > { > @@ -99,7 +99,7 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root) > > list_del(&root->link); > > - zap_gfn_range(kvm, root, 0, max_gfn, false); > + zap_gfn_range(kvm, root, 0, max_gfn, false, false); > > free_page((unsigned long)root->spt); > kmem_cache_free(mmu_page_header_cache, root); > @@ -664,20 +664,21 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm, > * scheduler needs the CPU or there is contention on the MMU lock. If this > * function cannot yield, it will not release the MMU lock or reschedule and > * the caller must ensure it does not supply too large a GFN range, or the > - * operation can cause a soft lockup. > + * operation can cause a soft lockup. Note, in some use cases a flush may be > + * required by prior actions. Ensure the pending flush is performed prior to > + * yielding. > */ > static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, > - gfn_t start, gfn_t end, bool can_yield) > + gfn_t start, gfn_t end, bool can_yield, bool flush) > { > struct tdp_iter iter; > - bool flush_needed = false; > > rcu_read_lock(); > > tdp_root_for_each_pte(iter, root, start, end) { > if (can_yield && > - tdp_mmu_iter_cond_resched(kvm, &iter, flush_needed)) { > - flush_needed = false; > + tdp_mmu_iter_cond_resched(kvm, &iter, flush)) { > + flush = false; > continue; > } > > @@ -695,11 +696,11 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, > continue; > > tdp_mmu_set_spte(kvm, &iter, 0); > - flush_needed = true; > + flush = true; > } > > rcu_read_unlock(); > - return flush_needed; > + return flush; > } > > /* > @@ -714,7 +715,7 @@ bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end) > bool flush = false; > > for_each_tdp_mmu_root_yield_safe(kvm, root) > - flush |= zap_gfn_range(kvm, root, start, end, true); > + flush = zap_gfn_range(kvm, root, start, end, true, flush); > > return flush; > } > @@ -931,7 +932,7 @@ static int zap_gfn_range_hva_wrapper(struct kvm *kvm, > struct kvm_mmu_page *root, gfn_t start, > gfn_t end, unsigned long unused) > { > - return zap_gfn_range(kvm, root, start, end, false); > + return zap_gfn_range(kvm, root, start, end, false, false); > } > > int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start, > -- > 2.31.0.rc2.261.g7f71774620-goog > ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping 2021-03-19 23:20 [PATCH 0/2] KVM: x86/mmu: Fix TLB flushing bugs in TDP MMU Sean Christopherson 2021-03-19 23:20 ` [PATCH 1/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap Sean Christopherson @ 2021-03-19 23:20 ` Sean Christopherson 2021-03-22 21:27 ` Ben Gardon 1 sibling, 1 reply; 10+ messages in thread From: Sean Christopherson @ 2021-03-19 23:20 UTC (permalink / raw) To: Paolo Bonzini Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, linux-kernel, Ben Gardon Fix two intertwined bugs in the NX huge page zapping that were introduced by the incorporation of the TDP MMU. Because there is a unified list of NX huge pages, zapping can encounter both TDP MMU and legacy MMU pages, and the two MMUs have different tracking for TLB flushing. If one flavor needs a flush, but the code for the other flavor yields, KVM will fail to flush before yielding. First, honor the "flush needed" return from kvm_tdp_mmu_zap_gfn_range(), which does the flush itself if and only if it yields, and otherwise expects the caller to do the flush. This requires feeding the result into kvm_mmu_remote_flush_or_zap(), and so also fixes the case where the TDP MMU needs a flush, the legacy MMU does not, and the main loop yields. Second, tell the TDP MMU a flush is pending if the list of zapped pages from legacy MMUs is not empty, i.e. the legacy MMU needs a flush. This fixes the case where the TDP MMU yields, but it iteslf does not require a flush. Fixes: 29cf0f5007a2 ("kvm: x86/mmu: NX largepage recovery for TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/mmu/mmu.c | 15 ++++++++++----- arch/x86/kvm/mmu/tdp_mmu.c | 6 +++--- arch/x86/kvm/mmu/tdp_mmu.h | 3 ++- 3 files changed, 15 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index c6ed633594a2..413d6259340e 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5517,7 +5517,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) } if (is_tdp_mmu_enabled(kvm)) { - flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end); + flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end, + false); if (flush) kvm_flush_remote_tlbs(kvm); } @@ -5939,6 +5940,8 @@ static void kvm_recover_nx_lpages(struct kvm *kvm) struct kvm_mmu_page *sp; unsigned int ratio; LIST_HEAD(invalid_list); + bool flush = false; + gfn_t gfn_end; ulong to_zap; rcu_idx = srcu_read_lock(&kvm->srcu); @@ -5960,19 +5963,21 @@ static void kvm_recover_nx_lpages(struct kvm *kvm) lpage_disallowed_link); WARN_ON_ONCE(!sp->lpage_disallowed); if (is_tdp_mmu_page(sp)) { - kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, - sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level)); + gfn_end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level); + flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, gfn_end, + flush || !list_empty(&invalid_list)); } else { kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); WARN_ON_ONCE(sp->lpage_disallowed); } if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { - kvm_mmu_commit_zap_page(kvm, &invalid_list); + kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); cond_resched_rwlock_write(&kvm->mmu_lock); + flush = false; } } - kvm_mmu_commit_zap_page(kvm, &invalid_list); + kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); write_unlock(&kvm->mmu_lock); srcu_read_unlock(&kvm->srcu, rcu_idx); diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 6cf08c3c537f..367f12bf1026 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -709,10 +709,10 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, * SPTEs have been cleared and a TLB flush is needed before releasing the * MMU lock. */ -bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end) +bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end, + bool flush) { struct kvm_mmu_page *root; - bool flush = false; for_each_tdp_mmu_root_yield_safe(kvm, root) flush = zap_gfn_range(kvm, root, start, end, true, flush); @@ -725,7 +725,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm) gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT); bool flush; - flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn); + flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn, false); if (flush) kvm_flush_remote_tlbs(kvm); } diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index 3b761c111bff..e39bee52d49e 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -8,7 +8,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu); void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root); -bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end); +bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end, + bool flush); void kvm_tdp_mmu_zap_all(struct kvm *kvm); int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, -- 2.31.0.rc2.261.g7f71774620-goog ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping 2021-03-19 23:20 ` [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping Sean Christopherson @ 2021-03-22 21:27 ` Ben Gardon 2021-03-23 0:15 ` Sean Christopherson 0 siblings, 1 reply; 10+ messages in thread From: Ben Gardon @ 2021-03-22 21:27 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, LKML On Fri, Mar 19, 2021 at 4:20 PM Sean Christopherson <seanjc@google.com> wrote: > > Fix two intertwined bugs in the NX huge page zapping that were introduced > by the incorporation of the TDP MMU. Because there is a unified list of > NX huge pages, zapping can encounter both TDP MMU and legacy MMU pages, > and the two MMUs have different tracking for TLB flushing. If one flavor > needs a flush, but the code for the other flavor yields, KVM will fail to > flush before yielding. > > First, honor the "flush needed" return from kvm_tdp_mmu_zap_gfn_range(), > which does the flush itself if and only if it yields, and otherwise > expects the caller to do the flush. This requires feeding the result > into kvm_mmu_remote_flush_or_zap(), and so also fixes the case where the > TDP MMU needs a flush, the legacy MMU does not, and the main loop yields. > > Second, tell the TDP MMU a flush is pending if the list of zapped pages > from legacy MMUs is not empty, i.e. the legacy MMU needs a flush. This > fixes the case where the TDP MMU yields, but it iteslf does not require a > flush. > > Fixes: 29cf0f5007a2 ("kvm: x86/mmu: NX largepage recovery for TDP MMU") > Cc: stable@vger.kernel.org > Cc: Ben Gardon <bgardon@google.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-By: Ben Gardon <bgardon@google.com> This preserves an extremely unlikely degenerate case, which could cause an unexpected delay. The scenario is described below, but I don't think this change needs to be blocked on it. > --- > arch/x86/kvm/mmu/mmu.c | 15 ++++++++++----- > arch/x86/kvm/mmu/tdp_mmu.c | 6 +++--- > arch/x86/kvm/mmu/tdp_mmu.h | 3 ++- > 3 files changed, 15 insertions(+), 9 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index c6ed633594a2..413d6259340e 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -5517,7 +5517,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) > } > > if (is_tdp_mmu_enabled(kvm)) { > - flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end); > + flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end, > + false); > if (flush) > kvm_flush_remote_tlbs(kvm); > } > @@ -5939,6 +5940,8 @@ static void kvm_recover_nx_lpages(struct kvm *kvm) > struct kvm_mmu_page *sp; > unsigned int ratio; > LIST_HEAD(invalid_list); > + bool flush = false; > + gfn_t gfn_end; > ulong to_zap; > > rcu_idx = srcu_read_lock(&kvm->srcu); > @@ -5960,19 +5963,21 @@ static void kvm_recover_nx_lpages(struct kvm *kvm) > lpage_disallowed_link); > WARN_ON_ONCE(!sp->lpage_disallowed); > if (is_tdp_mmu_page(sp)) { > - kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, > - sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level)); > + gfn_end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level); > + flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, gfn_end, > + flush || !list_empty(&invalid_list)); > } else { > kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); > WARN_ON_ONCE(sp->lpage_disallowed); > } > > if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { > - kvm_mmu_commit_zap_page(kvm, &invalid_list); > + kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); This pattern of waiting until a yield is needed or lock contention is detected has always been a little suspect to me because kvm_mmu_commit_zap_page does work proportional to the work done before the yield was needed. That seems like more work than we should like to be doing at that point. The yield in kvm_tdp_mmu_zap_gfn_range makes that phenomenon even worse. Because we can satisfy the need to yield without clearing out the invalid list, we can potentially queue many more pages which will then all need to have their zaps committed at once. This is an admittedly contrived case which could only be hit in a high load nested scenario. It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from yielding. Since we should only need to zap one SPTE, the yield should not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure that only one SPTE is zapped we would have to specify the root though. Otherwise we could end up zapping all the entries for the same GFN range under an unrelated root. Anyway, I can address this in another patch. > cond_resched_rwlock_write(&kvm->mmu_lock); > + flush = false; > } > } > - kvm_mmu_commit_zap_page(kvm, &invalid_list); > + kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); > > write_unlock(&kvm->mmu_lock); > srcu_read_unlock(&kvm->srcu, rcu_idx); > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 6cf08c3c537f..367f12bf1026 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -709,10 +709,10 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, > * SPTEs have been cleared and a TLB flush is needed before releasing the > * MMU lock. > */ > -bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end) > +bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end, > + bool flush) > { > struct kvm_mmu_page *root; > - bool flush = false; > > for_each_tdp_mmu_root_yield_safe(kvm, root) > flush = zap_gfn_range(kvm, root, start, end, true, flush); > @@ -725,7 +725,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm) > gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT); > bool flush; > > - flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn); > + flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn, false); > if (flush) > kvm_flush_remote_tlbs(kvm); > } > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h > index 3b761c111bff..e39bee52d49e 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.h > +++ b/arch/x86/kvm/mmu/tdp_mmu.h > @@ -8,7 +8,8 @@ > hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu); > void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root); > > -bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end); > +bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end, > + bool flush); > void kvm_tdp_mmu_zap_all(struct kvm *kvm); > > int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, > -- > 2.31.0.rc2.261.g7f71774620-goog > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping 2021-03-22 21:27 ` Ben Gardon @ 2021-03-23 0:15 ` Sean Christopherson 2021-03-23 16:26 ` Ben Gardon 0 siblings, 1 reply; 10+ messages in thread From: Sean Christopherson @ 2021-03-23 0:15 UTC (permalink / raw) To: Ben Gardon Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, LKML On Mon, Mar 22, 2021, Ben Gardon wrote: > On Fri, Mar 19, 2021 at 4:20 PM Sean Christopherson <seanjc@google.com> wrote: > > @@ -5960,19 +5963,21 @@ static void kvm_recover_nx_lpages(struct kvm *kvm) > > lpage_disallowed_link); > > WARN_ON_ONCE(!sp->lpage_disallowed); > > if (is_tdp_mmu_page(sp)) { > > - kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, > > - sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level)); > > + gfn_end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level); > > + flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, gfn_end, > > + flush || !list_empty(&invalid_list)); > > } else { > > kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); > > WARN_ON_ONCE(sp->lpage_disallowed); > > } > > > > if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { > > - kvm_mmu_commit_zap_page(kvm, &invalid_list); > > + kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); > > This pattern of waiting until a yield is needed or lock contention is > detected has always been a little suspect to me because > kvm_mmu_commit_zap_page does work proportional to the work done before > the yield was needed. That seems like more work than we should like to > be doing at that point. > > The yield in kvm_tdp_mmu_zap_gfn_range makes that phenomenon even > worse. Because we can satisfy the need to yield without clearing out > the invalid list, we can potentially queue many more pages which will > then all need to have their zaps committed at once. This is an > admittedly contrived case which could only be hit in a high load > nested scenario. > > It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from > yielding. Since we should only need to zap one SPTE, the yield should > not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure > that only one SPTE is zapped we would have to specify the root though. > Otherwise we could end up zapping all the entries for the same GFN > range under an unrelated root. Hmm, I originally did exactly that, but changed my mind because this zaps far more than 1 SPTE. This is zapping a SP that could be huge, but is not, which means it's guaranteed to have a non-zero number of child SPTEs. The worst case scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k SPTEs. But, I didn't consider the interplay between invalid_list and the TDP MMU yielding. Hrm. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping 2021-03-23 0:15 ` Sean Christopherson @ 2021-03-23 16:26 ` Ben Gardon 2021-03-23 18:58 ` Sean Christopherson 0 siblings, 1 reply; 10+ messages in thread From: Ben Gardon @ 2021-03-23 16:26 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, LKML On Mon, Mar 22, 2021 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Mar 22, 2021, Ben Gardon wrote: > > On Fri, Mar 19, 2021 at 4:20 PM Sean Christopherson <seanjc@google.com> wrote: > > > @@ -5960,19 +5963,21 @@ static void kvm_recover_nx_lpages(struct kvm *kvm) > > > lpage_disallowed_link); > > > WARN_ON_ONCE(!sp->lpage_disallowed); > > > if (is_tdp_mmu_page(sp)) { > > > - kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, > > > - sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level)); > > > + gfn_end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level); > > > + flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, gfn_end, > > > + flush || !list_empty(&invalid_list)); > > > } else { > > > kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); > > > WARN_ON_ONCE(sp->lpage_disallowed); > > > } > > > > > > if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { > > > - kvm_mmu_commit_zap_page(kvm, &invalid_list); > > > + kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); > > > > This pattern of waiting until a yield is needed or lock contention is > > detected has always been a little suspect to me because > > kvm_mmu_commit_zap_page does work proportional to the work done before > > the yield was needed. That seems like more work than we should like to > > be doing at that point. > > > > The yield in kvm_tdp_mmu_zap_gfn_range makes that phenomenon even > > worse. Because we can satisfy the need to yield without clearing out > > the invalid list, we can potentially queue many more pages which will > > then all need to have their zaps committed at once. This is an > > admittedly contrived case which could only be hit in a high load > > nested scenario. > > > > It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from > > yielding. Since we should only need to zap one SPTE, the yield should > > not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure > > that only one SPTE is zapped we would have to specify the root though. > > Otherwise we could end up zapping all the entries for the same GFN > > range under an unrelated root. > > Hmm, I originally did exactly that, but changed my mind because this zaps far > more than 1 SPTE. This is zapping a SP that could be huge, but is not, which > means it's guaranteed to have a non-zero number of child SPTEs. The worst case > scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k SPTEs. It's true that there are potentially 512^2 child sptes, but the code to clear those after the single PUD spte is cleared doesn't yield anyway. If the TDP MMU is only operating with one root (as we would expect in most cases), there should only be one chance for it to yield. I've considered how we could allow the recursive changed spte handlers to yield, but it gets complicated quite fast because the caller needs to know if it yielded and reset the TDP iterator to the root, and there are some cases (mmu notifiers + vCPU path) where yielding is not desirable. > > But, I didn't consider the interplay between invalid_list and the TDP MMU > yielding. Hrm. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping 2021-03-23 16:26 ` Ben Gardon @ 2021-03-23 18:58 ` Sean Christopherson 2021-03-23 20:34 ` Ben Gardon 0 siblings, 1 reply; 10+ messages in thread From: Sean Christopherson @ 2021-03-23 18:58 UTC (permalink / raw) To: Ben Gardon Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, LKML On Tue, Mar 23, 2021, Ben Gardon wrote: > On Mon, Mar 22, 2021 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Mon, Mar 22, 2021, Ben Gardon wrote: > > > It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from > > > yielding. Since we should only need to zap one SPTE, the yield should > > > not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure > > > that only one SPTE is zapped we would have to specify the root though. > > > Otherwise we could end up zapping all the entries for the same GFN > > > range under an unrelated root. > > > > Hmm, I originally did exactly that, but changed my mind because this zaps far > > more than 1 SPTE. This is zapping a SP that could be huge, but is not, which > > means it's guaranteed to have a non-zero number of child SPTEs. The worst case > > scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k SPTEs. > > It's true that there are potentially 512^2 child sptes, but the code > to clear those after the single PUD spte is cleared doesn't yield > anyway. If the TDP MMU is only operating with one root (as we would > expect in most cases), there should only be one chance for it to > yield. Ah, right, I was thinking all the iterative flows yielded. Disallowing kvm_tdp_mmu_zap_gfn_range() from yielding in this case does seem like the best fix. Any objection to me sending v2 with that? > I've considered how we could allow the recursive changed spte handlers > to yield, but it gets complicated quite fast because the caller needs > to know if it yielded and reset the TDP iterator to the root, and > there are some cases (mmu notifiers + vCPU path) where yielding is not > desirable. Urgh, yeah, seems like we'd quickly end up with a mess resembling the legacy MMU iterators. > > > > But, I didn't consider the interplay between invalid_list and the TDP MMU > > yielding. Hrm. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping 2021-03-23 18:58 ` Sean Christopherson @ 2021-03-23 20:34 ` Ben Gardon 2021-03-25 19:15 ` Sean Christopherson 0 siblings, 1 reply; 10+ messages in thread From: Ben Gardon @ 2021-03-23 20:34 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, LKML On Tue, Mar 23, 2021 at 11:58 AM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Mar 23, 2021, Ben Gardon wrote: > > On Mon, Mar 22, 2021 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Mon, Mar 22, 2021, Ben Gardon wrote: > > > > It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from > > > > yielding. Since we should only need to zap one SPTE, the yield should > > > > not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure > > > > that only one SPTE is zapped we would have to specify the root though. > > > > Otherwise we could end up zapping all the entries for the same GFN > > > > range under an unrelated root. > > > > > > Hmm, I originally did exactly that, but changed my mind because this zaps far > > > more than 1 SPTE. This is zapping a SP that could be huge, but is not, which > > > means it's guaranteed to have a non-zero number of child SPTEs. The worst case > > > scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k SPTEs. > > > > It's true that there are potentially 512^2 child sptes, but the code > > to clear those after the single PUD spte is cleared doesn't yield > > anyway. If the TDP MMU is only operating with one root (as we would > > expect in most cases), there should only be one chance for it to > > yield. > > Ah, right, I was thinking all the iterative flows yielded. Disallowing > kvm_tdp_mmu_zap_gfn_range() from yielding in this case does seem like the best > fix. Any objection to me sending v2 with that? That sounds good to me. > > > I've considered how we could allow the recursive changed spte handlers > > to yield, but it gets complicated quite fast because the caller needs > > to know if it yielded and reset the TDP iterator to the root, and > > there are some cases (mmu notifiers + vCPU path) where yielding is not > > desirable. > > Urgh, yeah, seems like we'd quickly end up with a mess resembling the legacy MMU > iterators. > > > > > > > But, I didn't consider the interplay between invalid_list and the TDP MMU > > > yielding. Hrm. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping 2021-03-23 20:34 ` Ben Gardon @ 2021-03-25 19:15 ` Sean Christopherson 0 siblings, 0 replies; 10+ messages in thread From: Sean Christopherson @ 2021-03-25 19:15 UTC (permalink / raw) To: Ben Gardon Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, LKML On Tue, Mar 23, 2021, Ben Gardon wrote: > On Tue, Mar 23, 2021 at 11:58 AM Sean Christopherson <seanjc@google.com> wrote: > > > > On Tue, Mar 23, 2021, Ben Gardon wrote: > > > On Mon, Mar 22, 2021 at 5:15 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > On Mon, Mar 22, 2021, Ben Gardon wrote: > > > > > It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from > > > > > yielding. Since we should only need to zap one SPTE, the yield should > > > > > not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure > > > > > that only one SPTE is zapped we would have to specify the root though. > > > > > Otherwise we could end up zapping all the entries for the same GFN > > > > > range under an unrelated root. > > > > > > > > Hmm, I originally did exactly that, but changed my mind because this zaps far > > > > more than 1 SPTE. This is zapping a SP that could be huge, but is not, which > > > > means it's guaranteed to have a non-zero number of child SPTEs. The worst case > > > > scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k SPTEs. > > > > > > It's true that there are potentially 512^2 child sptes, but the code > > > to clear those after the single PUD spte is cleared doesn't yield > > > anyway. If the TDP MMU is only operating with one root (as we would > > > expect in most cases), there should only be one chance for it to > > > yield. > > > > Ah, right, I was thinking all the iterative flows yielded. Disallowing > > kvm_tdp_mmu_zap_gfn_range() from yielding in this case does seem like the best > > fix. Any objection to me sending v2 with that? > > That sounds good to me. Ewww. This analysis isn't 100% accurate. It's actually impossible for zap_gfn_range() to yield in this case. Even though it may walk multiple roots and levels, "yielded_gfn == next_last_level_gfn" will hold true until the iter attempts to step sideways. When the iter attempts to step sideways, it will always do so at the level that matches the zapping level, and so will always step past "end". Thus, tdp_root_for_each_pte() will break without ever yielding. That being said, I'm still going to send a patch to explicitly prevent this path from yielding. Relying on the above is waaaay too subtle and fragile. > > > I've considered how we could allow the recursive changed spte handlers > > > to yield, but it gets complicated quite fast because the caller needs > > > to know if it yielded and reset the TDP iterator to the root, and > > > there are some cases (mmu notifiers + vCPU path) where yielding is not > > > desirable. > > > > Urgh, yeah, seems like we'd quickly end up with a mess resembling the legacy MMU > > iterators. > > > > > > > > > > But, I didn't consider the interplay between invalid_list and the TDP MMU > > > > yielding. Hrm. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2021-03-25 19:16 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-03-19 23:20 [PATCH 0/2] KVM: x86/mmu: Fix TLB flushing bugs in TDP MMU Sean Christopherson 2021-03-19 23:20 ` [PATCH 1/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap Sean Christopherson 2021-03-22 21:27 ` Ben Gardon 2021-03-19 23:20 ` [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping Sean Christopherson 2021-03-22 21:27 ` Ben Gardon 2021-03-23 0:15 ` Sean Christopherson 2021-03-23 16:26 ` Ben Gardon 2021-03-23 18:58 ` Sean Christopherson 2021-03-23 20:34 ` Ben Gardon 2021-03-25 19:15 ` Sean Christopherson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).