* [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism @ 2019-02-05 20:54 Sean Christopherson 2019-02-05 20:54 ` [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots Sean Christopherson ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Sean Christopherson @ 2019-02-05 20:54 UTC (permalink / raw) To: Paolo Bonzini, Radim Krčmář, James Hogan, Paul Mackerras, Christian Borntraeger, Janosch Frank, Christoffer Dall, Marc Zyngier Cc: linux-s390, kvm, David Hildenbrand, Cornelia Huck, linux-mips, kvm-ppc, Xiao Guangrong, kvmarm, linux-arm-kernel ...and clean up the MMIO spte generation code. === Non-x86 Folks === Patch 01/27 (which you should also be cc'd on) changes the prototype for a function that is defined by all architectures but only actually used by x86. Feel free to skip the rest of this cover letter. === x86 Folks === Though not explicitly stated, for all intents and purposes the fast invalidate mechanism was added to speed up the scenario where removing a memslot, e.g. when accessing PCI ROM, caused KVM to flush all shadow entries[1]. The other use cases of "flush everything" are VM teardown and handling MMIO generation overflow, neither of which is a performance critial path (see "Other Uses" below). For the memslot case, zapping all shadow entries is overkill, i.e. KVM only needs to zap the entries associated with the memslot, but KVM has historically used a big hammer because if all you have is a hammer, everything looks like a nail. Rather than zap all pages when removing a memslot, zap only the shadow entries associated with said memslot. I see a performance improvement of ~5% when zapping only the pages for the deleted memslot when using a slightly modified version of the original benchmark[2][3][4] (I don't have easy access to a system with hundreds of gigs of memory). $ cat shell.sh #!/bin/sh echo 3 > /proc/sys/vm/drop_caches ./mmtest -c 8 -m 2000 -e ./rom.sh I.e. running 8 threads and 2gb of memory per thread, time in milliseconds: Before: 89.117 After: 84.768 With the memslot use case gone, maintaining the fast invalidate path adds a moderate amount of complexity but provides little to no value. Furhtermore, its existence may give the impression that it's "ok" to zap all shadow pages. Remove the fast invalidate mechanism to simplify the code and to discourage future code from zapping all pages as using such a big hammer should be a last resort. Unfortunately, simply reverting the fast invalidate code re-introduces shortcomings in the previous code, notably that KVM may hold a spinlock and not schedule for an extended period of time. Releasing the lock to voluntarily reschedule is perfectly doable, and for the VM teardown case of "zap everything" it can be added with no additional changes since KVM doesn't need to provide any ordering guarantees about sptes in that case. For the MMIO case, KVM needs to prevent vCPUs from consuming stale sptes created with an old generation number. This should be a non-issue as KVM is supposed to prevent consuming stale MMIO sptes regardless of the zapping mechanism employed, releasing mmu_lock while zapping simply makes it much more likely to hit any bug leading to stale spte usage. As luck would have it, there are a number of corner cases in the MMIO spte invalidation flow that could theoretically result in reusing stale sptes. So before reworking memslot zapping and reverting to the slow invalidate mechanism, fix a number of bugs related to MMIO generations and opportunistically clean up the memslots/MMIO generation code. === History === Flushing of shadow pages when removing a memslot was originally added by commit 34d4cb8fca1f ("KVM: MMU: nuke shadowed pgtable pages and ptes on memslot destruction"), and obviously emphasized functionality over performance. Commit 2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow") added a path to allow flushing only the removed slot's shadow pages, but x86 just redirected to the "zap all" flow. Eventually, it became evident that zapping everything is slow, and so x86 developed a more efficient hammer in the form of the fast invalidate mechanism. === Other Uses === When a VM is being destroyed, either there are no active vcpus, i.e. there's no lock contention, or the VM has ungracefully terminated, in which case we want to reclaim its pages as quickly as possible, i.e. not release the MMU lock if there are still CPUs executing in the VM. The MMIO generation scenario is almost literally a one-in-a-million occurrence, i.e. is not a performance sensitive scenario. It's worth noting that prior to the "fast invalidate" series being applied, there was actually another use case of kvm_mmu_zap_all() in emulator_fix_hypercall() whose existence may have contributed to improving the performance of "zap all" instead of avoiding it altogether. But that usage was removed by the series itself in commit 758ccc89b83c ("KVM: x86: drop calling kvm_mmu_zap_all in emulator_fix_hypercall"). [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com [2] https://lkml.kernel.org/r/1368706673-8530-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com [3] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277.html [4] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277/mmtest.tar.bz2 v1: https://patchwork.kernel.org/cover/10756579/ v2: - Fix a zap vs flush priority bug in kvm_mmu_remote_flush_or_zap() [me] - Voluntarily reschedule and/or drop mmu_lock as needed when zapping all sptes or all MMIO sptes [Paolo] - Add patches to clean up the memslots/MMIO generation code and fix a variety of theoretically corner case bugs Sean Christopherson (27): KVM: Call kvm_arch_memslots_updated() before updating memslots KVM: x86/mmu: Detect MMIO generation wrap in any address space KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux KVM: Explicitly define the "memslot update in-progress" bit KVM: x86: Use a u64 when passing the MMIO gen around KVM: x86: Refactor the MMIO SPTE generation handling KVM: Remove the hack to trigger memslot generation wraparound KVM: Move the memslot update in-progress flag to bit 63 KVM: x86/mmu: Move slot_level_*() helper functions up a few lines KVM: x86/mmu: Split remote_flush+zap case out of kvm_mmu_flush_or_zap() KVM: x86/mmu: Zap only the relevant pages when removing a memslot Revert "KVM: MMU: document fast invalidate all pages" Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes" KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes KVM: x86/mmu: Remove is_obsolete() call Revert "KVM: MMU: reclaim the zapped-obsolete page first" Revert "KVM: MMU: collapse TLB flushes when zap all pages" Revert "KVM: MMU: zap pages in batch" Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages" Revert "KVM: MMU: show mmu_valid_gen in shadow page related tracepoints" Revert "KVM: x86: use the fast way to invalidate all pages" KVM: x86/mmu: skip over invalid root pages when zapping all sptes KVM: x86/mmu: Voluntarily reschedule as needed when zapping all sptes Revert "KVM: MMU: fast invalidate all pages" KVM: x86/mmu: Differentiate between nr zapped and list unstable KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes() Documentation/virtual/kvm/mmu.txt | 41 +-- arch/mips/include/asm/kvm_host.h | 2 +- arch/powerpc/include/asm/kvm_host.h | 2 +- arch/s390/include/asm/kvm_host.h | 2 +- arch/x86/include/asm/kvm_host.h | 9 +- arch/x86/kvm/mmu.c | 442 +++++++++++++--------------- arch/x86/kvm/mmu.h | 1 - arch/x86/kvm/mmutrace.h | 42 +-- arch/x86/kvm/x86.c | 7 +- arch/x86/kvm/x86.h | 7 +- include/linux/kvm_host.h | 23 +- virt/kvm/arm/mmu.c | 2 +- virt/kvm/kvm_main.c | 39 ++- 13 files changed, 286 insertions(+), 333 deletions(-) -- 2.20.1 ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots 2019-02-05 20:54 [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism Sean Christopherson @ 2019-02-05 20:54 ` Sean Christopherson 2019-02-06 9:12 ` Cornelia Huck 2019-02-12 12:36 ` [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism Paolo Bonzini [not found] ` <20190205210137.1377-11-sean.j.christopherson@intel.com> 2 siblings, 1 reply; 38+ messages in thread From: Sean Christopherson @ 2019-02-05 20:54 UTC (permalink / raw) To: Paolo Bonzini, Radim Krčmář, James Hogan, Paul Mackerras, Christian Borntraeger, Janosch Frank, Christoffer Dall, Marc Zyngier Cc: linux-s390, kvm, David Hildenbrand, Cornelia Huck, linux-mips, kvm-ppc, Xiao Guangrong, kvmarm, linux-arm-kernel kvm_arch_memslots_updated() is at this point in time an x86-specific hook for handling MMIO generation wraparound. x86 stashes 19 bits of the memslots generation number in its MMIO sptes in order to avoid full page fault walks for repeat faults on emulated MMIO addresses. Because only 19 bits are used, wrapping the MMIO generation number is possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that the generation has changed so that it can invalidate all MMIO sptes in case the effective MMIO generation has wrapped so as to avoid using a stale spte, e.g. a (very) old spte that was created with generation==0. Given that the purpose of kvm_arch_memslots_updated() is to prevent consuming stale entries, it needs to be called before the new generation is propagated to memslots. Invalidating the MMIO sptes after updating memslots means that there is a window where a vCPU could dereference the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO spte that was created with (pre-wrap) generation==0. Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()") Cc: <stable@vger.kernel.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> --- arch/mips/include/asm/kvm_host.h | 2 +- arch/powerpc/include/asm/kvm_host.h | 2 +- arch/s390/include/asm/kvm_host.h | 2 +- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/mmu.c | 4 ++-- arch/x86/kvm/x86.c | 4 ++-- include/linux/kvm_host.h | 2 +- virt/kvm/arm/mmu.c | 2 +- virt/kvm/kvm_main.c | 7 +++++-- 9 files changed, 15 insertions(+), 12 deletions(-) diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h index 71c3f21d80d5..f764a2de68e2 100644 --- a/arch/mips/include/asm/kvm_host.h +++ b/arch/mips/include/asm/kvm_host.h @@ -1131,7 +1131,7 @@ static inline void kvm_arch_hardware_unsetup(void) {} static inline void kvm_arch_sync_events(struct kvm *kvm) {} static inline void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free, struct kvm_memory_slot *dont) {} -static inline void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) {} +static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {} static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {} static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {} diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 0f98f00da2ea..19693b8add93 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -837,7 +837,7 @@ struct kvm_vcpu_arch { static inline void kvm_arch_hardware_disable(void) {} static inline void kvm_arch_hardware_unsetup(void) {} static inline void kvm_arch_sync_events(struct kvm *kvm) {} -static inline void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) {} +static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {} static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {} static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} static inline void kvm_arch_exit(void) {} diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h index d5d24889c3bc..c2b8c8c6c9be 100644 --- a/arch/s390/include/asm/kvm_host.h +++ b/arch/s390/include/asm/kvm_host.h @@ -878,7 +878,7 @@ static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {} static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} static inline void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free, struct kvm_memory_slot *dont) {} -static inline void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) {} +static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {} static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {} static inline void kvm_arch_flush_shadow_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) {} diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4660ce90de7f..1c7f726e3ad5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1253,7 +1253,7 @@ void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn_offset, unsigned long mask); void kvm_mmu_zap_all(struct kvm *kvm); -void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots *slots); +void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen); unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm); void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ce770b446238..fb040ea42454 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -5890,13 +5890,13 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm) return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages)); } -void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots *slots) +void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen) { /* * The very rare case: if the generation-number is round, * zap all shadow pages. */ - if (unlikely((slots->generation & MMIO_GEN_MASK) == 0)) { + if (unlikely((gen & MMIO_GEN_MASK) == 0)) { kvm_debug_ratelimited("kvm: zapping shadow pages for mmio generation wraparound\n"); kvm_mmu_invalidate_zap_all_pages(kvm); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9ce1b8e2d3b9..4d05fca04fe1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9337,13 +9337,13 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, return -ENOMEM; } -void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) +void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) { /* * memslots->generation has been incremented. * mmio generation may have reached its maximum value. */ - kvm_mmu_invalidate_mmio_sptes(kvm, slots); + kvm_mmu_invalidate_mmio_sptes(kvm, gen); } int kvm_arch_prepare_memory_region(struct kvm *kvm, diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c38cc5eb7e73..cf761ff58224 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -634,7 +634,7 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free, struct kvm_memory_slot *dont); int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, unsigned long npages); -void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots); +void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); int kvm_arch_prepare_memory_region(struct kvm *kvm, struct kvm_memory_slot *memslot, const struct kvm_userspace_memory_region *mem, diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c index 3053bf2584f8..3ab0daee3fea 100644 --- a/virt/kvm/arm/mmu.c +++ b/virt/kvm/arm/mmu.c @@ -2350,7 +2350,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, return 0; } -void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) +void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) { } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index cf7cc0554094..df5dc7224f02 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -878,6 +878,7 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm, int as_id, struct kvm_memslots *slots) { struct kvm_memslots *old_memslots = __kvm_memslots(kvm, as_id); + u64 gen; /* * Set the low bit in the generation, which disables SPTE caching @@ -900,9 +901,11 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm, * space 0 will use generations 0, 4, 8, ... while * address space 1 will * use generations 2, 6, 10, 14, ... */ - slots->generation += KVM_ADDRESS_SPACE_NUM * 2 - 1; + gen = slots->generation + KVM_ADDRESS_SPACE_NUM * 2 - 1; - kvm_arch_memslots_updated(kvm, slots); + kvm_arch_memslots_updated(kvm, gen); + + slots->generation = gen; return old_memslots; } -- 2.20.1 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots 2019-02-05 20:54 ` [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots Sean Christopherson @ 2019-02-06 9:12 ` Cornelia Huck 0 siblings, 0 replies; 38+ messages in thread From: Cornelia Huck @ 2019-02-06 9:12 UTC (permalink / raw) To: Sean Christopherson Cc: linux-s390, Janosch Frank, kvm, Marc Zyngier, James Hogan, David Hildenbrand, kvm-ppc, linux-mips, Paul Mackerras, Christian Borntraeger, Xiao Guangrong, Paolo Bonzini, kvmarm, linux-arm-kernel On Tue, 5 Feb 2019 12:54:17 -0800 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > kvm_arch_memslots_updated() is at this point in time an x86-specific > hook for handling MMIO generation wraparound. x86 stashes 19 bits of > the memslots generation number in its MMIO sptes in order to avoid > full page fault walks for repeat faults on emulated MMIO addresses. > Because only 19 bits are used, wrapping the MMIO generation number is > possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that > the generation has changed so that it can invalidate all MMIO sptes in > case the effective MMIO generation has wrapped so as to avoid using a > stale spte, e.g. a (very) old spte that was created with generation==0. > > Given that the purpose of kvm_arch_memslots_updated() is to prevent > consuming stale entries, it needs to be called before the new generation > is propagated to memslots. Invalidating the MMIO sptes after updating > memslots means that there is a window where a vCPU could dereference > the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO > spte that was created with (pre-wrap) generation==0. > > Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()") > Cc: <stable@vger.kernel.org> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > --- > arch/mips/include/asm/kvm_host.h | 2 +- > arch/powerpc/include/asm/kvm_host.h | 2 +- > arch/s390/include/asm/kvm_host.h | 2 +- > arch/x86/include/asm/kvm_host.h | 2 +- > arch/x86/kvm/mmu.c | 4 ++-- > arch/x86/kvm/x86.c | 4 ++-- > include/linux/kvm_host.h | 2 +- > virt/kvm/arm/mmu.c | 2 +- > virt/kvm/kvm_main.c | 7 +++++-- > 9 files changed, 15 insertions(+), 12 deletions(-) Not an x86 person, but I think that makes sense. Reviewed-by: Cornelia Huck <cohuck@redhat.com> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism 2019-02-05 20:54 [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism Sean Christopherson 2019-02-05 20:54 ` [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots Sean Christopherson @ 2019-02-12 12:36 ` Paolo Bonzini [not found] ` <20190205210137.1377-11-sean.j.christopherson@intel.com> 2 siblings, 0 replies; 38+ messages in thread From: Paolo Bonzini @ 2019-02-12 12:36 UTC (permalink / raw) To: Sean Christopherson, Radim Krčmář, James Hogan, Paul Mackerras, Christian Borntraeger, Janosch Frank, Christoffer Dall, Marc Zyngier Cc: linux-s390, kvm, David Hildenbrand, Cornelia Huck, linux-mips, kvm-ppc, Xiao Guangrong, kvmarm, linux-arm-kernel On 05/02/19 21:54, Sean Christopherson wrote: > ...and clean up the MMIO spte generation code. > > === Non-x86 Folks === > Patch 01/27 (which you should also be cc'd on) changes the prototype for > a function that is defined by all architectures but only actually used > by x86. Feel free to skip the rest of this cover letter. > > === x86 Folks === > Though not explicitly stated, for all intents and purposes the fast > invalidate mechanism was added to speed up the scenario where removing > a memslot, e.g. when accessing PCI ROM, caused KVM to flush all shadow > entries[1]. > > The other use cases of "flush everything" are VM teardown and handling > MMIO generation overflow, neither of which is a performance critial > path (see "Other Uses" below). > > For the memslot case, zapping all shadow entries is overkill, i.e. KVM > only needs to zap the entries associated with the memslot, but KVM has > historically used a big hammer because if all you have is a hammer, > everything looks like a nail. > > Rather than zap all pages when removing a memslot, zap only the shadow > entries associated with said memslot. I see a performance improvement > of ~5% when zapping only the pages for the deleted memslot when using a > slightly modified version of the original benchmark[2][3][4] (I don't > have easy access to a system with hundreds of gigs of memory). > > $ cat shell.sh > #!/bin/sh > > echo 3 > /proc/sys/vm/drop_caches > ./mmtest -c 8 -m 2000 -e ./rom.sh > > I.e. running 8 threads and 2gb of memory per thread, time in milliseconds: > > Before: 89.117 > After: 84.768 > > > With the memslot use case gone, maintaining the fast invalidate path > adds a moderate amount of complexity but provides little to no value. > Furhtermore, its existence may give the impression that it's "ok" to zap > all shadow pages. Remove the fast invalidate mechanism to simplify the > code and to discourage future code from zapping all pages as using such > a big hammer should be a last resort. > > Unfortunately, simply reverting the fast invalidate code re-introduces > shortcomings in the previous code, notably that KVM may hold a spinlock > and not schedule for an extended period of time. Releasing the lock to > voluntarily reschedule is perfectly doable, and for the VM teardown case > of "zap everything" it can be added with no additional changes since KVM > doesn't need to provide any ordering guarantees about sptes in that case. > > For the MMIO case, KVM needs to prevent vCPUs from consuming stale sptes > created with an old generation number. This should be a non-issue as > KVM is supposed to prevent consuming stale MMIO sptes regardless of the > zapping mechanism employed, releasing mmu_lock while zapping simply > makes it much more likely to hit any bug leading to stale spte usage. > As luck would have it, there are a number of corner cases in the MMIO > spte invalidation flow that could theoretically result in reusing stale > sptes. So before reworking memslot zapping and reverting to the slow > invalidate mechanism, fix a number of bugs related to MMIO generations > and opportunistically clean up the memslots/MMIO generation code. > > > === History === > Flushing of shadow pages when removing a memslot was originally added > by commit 34d4cb8fca1f ("KVM: MMU: nuke shadowed pgtable pages and ptes > on memslot destruction"), and obviously emphasized functionality over > performance. Commit 2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow") > added a path to allow flushing only the removed slot's shadow pages, > but x86 just redirected to the "zap all" flow. > > Eventually, it became evident that zapping everything is slow, and so > x86 developed a more efficient hammer in the form of the fast invalidate > mechanism. > > === Other Uses === > When a VM is being destroyed, either there are no active vcpus, i.e. > there's no lock contention, or the VM has ungracefully terminated, in > which case we want to reclaim its pages as quickly as possible, i.e. > not release the MMU lock if there are still CPUs executing in the VM. > > The MMIO generation scenario is almost literally a one-in-a-million > occurrence, i.e. is not a performance sensitive scenario. > > It's worth noting that prior to the "fast invalidate" series being > applied, there was actually another use case of kvm_mmu_zap_all() in > emulator_fix_hypercall() whose existence may have contributed to > improving the performance of "zap all" instead of avoiding it altogether. > But that usage was removed by the series itself in commit 758ccc89b83c > ("KVM: x86: drop calling kvm_mmu_zap_all in emulator_fix_hypercall"). > > [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com > [2] https://lkml.kernel.org/r/1368706673-8530-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com > [3] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277.html > [4] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277/mmtest.tar.bz2 > > v1: https://patchwork.kernel.org/cover/10756579/ > v2: > - Fix a zap vs flush priority bug in kvm_mmu_remote_flush_or_zap() [me] > - Voluntarily reschedule and/or drop mmu_lock as needed when zapping > all sptes or all MMIO sptes [Paolo] > - Add patches to clean up the memslots/MMIO generation code and fix > a variety of theoretically corner case bugs > > > Sean Christopherson (27): > KVM: Call kvm_arch_memslots_updated() before updating memslots > KVM: x86/mmu: Detect MMIO generation wrap in any address space > KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux > KVM: Explicitly define the "memslot update in-progress" bit > KVM: x86: Use a u64 when passing the MMIO gen around > KVM: x86: Refactor the MMIO SPTE generation handling > KVM: Remove the hack to trigger memslot generation wraparound > KVM: Move the memslot update in-progress flag to bit 63 > KVM: x86/mmu: Move slot_level_*() helper functions up a few lines > KVM: x86/mmu: Split remote_flush+zap case out of > kvm_mmu_flush_or_zap() > KVM: x86/mmu: Zap only the relevant pages when removing a memslot > Revert "KVM: MMU: document fast invalidate all pages" > Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes" > KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes > KVM: x86/mmu: Remove is_obsolete() call > Revert "KVM: MMU: reclaim the zapped-obsolete page first" > Revert "KVM: MMU: collapse TLB flushes when zap all pages" > Revert "KVM: MMU: zap pages in batch" > Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages" > Revert "KVM: MMU: show mmu_valid_gen in shadow page related > tracepoints" > Revert "KVM: x86: use the fast way to invalidate all pages" > KVM: x86/mmu: skip over invalid root pages when zapping all sptes > KVM: x86/mmu: Voluntarily reschedule as needed when zapping all sptes > Revert "KVM: MMU: fast invalidate all pages" > KVM: x86/mmu: Differentiate between nr zapped and list unstable > KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children > KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and > kvm_mmu_zap_mmio_sptes() > > Documentation/virtual/kvm/mmu.txt | 41 +-- > arch/mips/include/asm/kvm_host.h | 2 +- > arch/powerpc/include/asm/kvm_host.h | 2 +- > arch/s390/include/asm/kvm_host.h | 2 +- > arch/x86/include/asm/kvm_host.h | 9 +- > arch/x86/kvm/mmu.c | 442 +++++++++++++--------------- > arch/x86/kvm/mmu.h | 1 - > arch/x86/kvm/mmutrace.h | 42 +-- > arch/x86/kvm/x86.c | 7 +- > arch/x86/kvm/x86.h | 7 +- > include/linux/kvm_host.h | 23 +- > virt/kvm/arm/mmu.c | 2 +- > virt/kvm/kvm_main.c | 39 ++- > 13 files changed, 286 insertions(+), 333 deletions(-) > Queued, thanks. Paolo ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <20190205210137.1377-11-sean.j.christopherson@intel.com>]
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot [not found] ` <20190205210137.1377-11-sean.j.christopherson@intel.com> @ 2019-08-13 16:04 ` Alex Williamson 2019-08-13 17:04 ` Sean Christopherson 0 siblings, 1 reply; 38+ messages in thread From: Alex Williamson @ 2019-08-13 16:04 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Tue, 5 Feb 2019 13:01:21 -0800 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs > that actually belong to the memslot being removed. This improves > performance, especially why the deleted memslot has only a few shadow > entries, or even no entries. E.g. a microbenchmark to access regular > memory while concurrently reading PCI ROM to trigger memslot deletion > showed a 5% improvement in throughput. > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > --- > arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++- > 1 file changed, 32 insertions(+), 1 deletion(-) A number of vfio users are reporting VM instability issues since v5.1, some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot"), which I've confirmed via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and re-verified on current 5.3-rc4 using the below patch to toggle the broken behavior. My reproducer is a Windows 10 VM with assigned GeForce GPU running a variety of tests, including FurMark and PassMark Performance Test. With the code enabled as exists in upstream currently, PassMark will generally introduce graphics glitches or hangs. Sometimes it's necessary to reboot the VM to see these issues. Flipping the 0/1 in the below patch appears to resolve the issue. I'd appreciate any insights into further debugging this block of code so that we can fix this regression. Thanks, Alex diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 24843cf49579..6a77306940f7 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -5653,6 +5653,9 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, struct kvm_page_track_notifier_node *node) { +#if 0 + kvm_mmu_zap_all(kvm); +#else struct kvm_mmu_page *sp; LIST_HEAD(invalid_list); unsigned long i; @@ -5685,6 +5688,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, out_unlock: spin_unlock(&kvm->mmu_lock); +#endif } void kvm_mmu_init_vm(struct kvm *kvm) ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-13 16:04 ` [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot Alex Williamson @ 2019-08-13 17:04 ` Sean Christopherson 2019-08-13 17:57 ` Alex Williamson 0 siblings, 1 reply; 38+ messages in thread From: Sean Christopherson @ 2019-08-13 17:04 UTC (permalink / raw) To: Alex Williamson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Tue, Aug 13, 2019 at 10:04:58AM -0600, Alex Williamson wrote: > On Tue, 5 Feb 2019 13:01:21 -0800 > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's > > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs > > that actually belong to the memslot being removed. This improves > > performance, especially why the deleted memslot has only a few shadow > > entries, or even no entries. E.g. a microbenchmark to access regular > > memory while concurrently reading PCI ROM to trigger memslot deletion > > showed a 5% improvement in throughput. > > > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > > --- > > arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++- > > 1 file changed, 32 insertions(+), 1 deletion(-) > > A number of vfio users are reporting VM instability issues since v5.1, > some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap > only the relevant pages when removing a memslot"), which I've confirmed > via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and > re-verified on current 5.3-rc4 using the below patch to toggle the > broken behavior. > > My reproducer is a Windows 10 VM with assigned GeForce GPU running a > variety of tests, including FurMark and PassMark Performance Test. > With the code enabled as exists in upstream currently, PassMark will > generally introduce graphics glitches or hangs. Sometimes it's > necessary to reboot the VM to see these issues. As in, the issue only shows up when the VM is rebooted? Just want to double check that that's not a typo. > Flipping the 0/1 in the below patch appears to resolve the issue. > > I'd appreciate any insights into further debugging this block of code > so that we can fix this regression. Thanks, If it's not too painful to reproduce, I'd say start by determining whether it's a problem with the basic logic or if the cond_resched_lock() handling is wrong. I.e. comment/ifdef out this chunk: if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); flush = false; cond_resched_lock(&kvm->mmu_lock); } > > Alex > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 24843cf49579..6a77306940f7 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -5653,6 +5653,9 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, > struct kvm_memory_slot *slot, > struct kvm_page_track_notifier_node *node) > { > +#if 0 > + kvm_mmu_zap_all(kvm); > +#else > struct kvm_mmu_page *sp; > LIST_HEAD(invalid_list); > unsigned long i; > @@ -5685,6 +5688,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, > > out_unlock: > spin_unlock(&kvm->mmu_lock); > +#endif > } > > void kvm_mmu_init_vm(struct kvm *kvm) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-13 17:04 ` Sean Christopherson @ 2019-08-13 17:57 ` Alex Williamson 2019-08-13 19:33 ` Alex Williamson 0 siblings, 1 reply; 38+ messages in thread From: Alex Williamson @ 2019-08-13 17:57 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Tue, 13 Aug 2019 10:04:41 -0700 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > On Tue, Aug 13, 2019 at 10:04:58AM -0600, Alex Williamson wrote: > > On Tue, 5 Feb 2019 13:01:21 -0800 > > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > > > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's > > > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs > > > that actually belong to the memslot being removed. This improves > > > performance, especially why the deleted memslot has only a few shadow > > > entries, or even no entries. E.g. a microbenchmark to access regular > > > memory while concurrently reading PCI ROM to trigger memslot deletion > > > showed a 5% improvement in throughput. > > > > > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com> > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > > > --- > > > arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++- > > > 1 file changed, 32 insertions(+), 1 deletion(-) > > > > A number of vfio users are reporting VM instability issues since v5.1, > > some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap > > only the relevant pages when removing a memslot"), which I've confirmed > > via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and > > re-verified on current 5.3-rc4 using the below patch to toggle the > > broken behavior. > > > > My reproducer is a Windows 10 VM with assigned GeForce GPU running a > > variety of tests, including FurMark and PassMark Performance Test. > > With the code enabled as exists in upstream currently, PassMark will > > generally introduce graphics glitches or hangs. Sometimes it's > > necessary to reboot the VM to see these issues. > > As in, the issue only shows up when the VM is rebooted? Just want to > double check that that's not a typo. No, it can occur on the first boot as well, it's just that the recipe to induce a failure is not well understood and manifests itself in different ways. I generally run the tests, then if it still hasn't reproduced, I reboot the VM a couple times, running a couple apps in between to try to trigger/notice bad behavior. > > Flipping the 0/1 in the below patch appears to resolve the issue. > > > > I'd appreciate any insights into further debugging this block of code > > so that we can fix this regression. Thanks, > > If it's not too painful to reproduce, I'd say start by determining whether > it's a problem with the basic logic or if the cond_resched_lock() handling > is wrong. I.e. comment/ifdef out this chunk: > > if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { > kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); > flush = false; > cond_resched_lock(&kvm->mmu_lock); > } If anything, removing this chunk seems to make things worse. Thanks, Alex ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-13 17:57 ` Alex Williamson @ 2019-08-13 19:33 ` Alex Williamson 2019-08-13 20:19 ` Sean Christopherson 0 siblings, 1 reply; 38+ messages in thread From: Alex Williamson @ 2019-08-13 19:33 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Tue, 13 Aug 2019 11:57:37 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > On Tue, 13 Aug 2019 10:04:41 -0700 > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > On Tue, Aug 13, 2019 at 10:04:58AM -0600, Alex Williamson wrote: > > > On Tue, 5 Feb 2019 13:01:21 -0800 > > > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > > > > > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's > > > > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs > > > > that actually belong to the memslot being removed. This improves > > > > performance, especially why the deleted memslot has only a few shadow > > > > entries, or even no entries. E.g. a microbenchmark to access regular > > > > memory while concurrently reading PCI ROM to trigger memslot deletion > > > > showed a 5% improvement in throughput. > > > > > > > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com> > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > > > > --- > > > > arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++- > > > > 1 file changed, 32 insertions(+), 1 deletion(-) > > > > > > A number of vfio users are reporting VM instability issues since v5.1, > > > some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap > > > only the relevant pages when removing a memslot"), which I've confirmed > > > via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and > > > re-verified on current 5.3-rc4 using the below patch to toggle the > > > broken behavior. > > > > > > My reproducer is a Windows 10 VM with assigned GeForce GPU running a > > > variety of tests, including FurMark and PassMark Performance Test. > > > With the code enabled as exists in upstream currently, PassMark will > > > generally introduce graphics glitches or hangs. Sometimes it's > > > necessary to reboot the VM to see these issues. > > > > As in, the issue only shows up when the VM is rebooted? Just want to > > double check that that's not a typo. > > No, it can occur on the first boot as well, it's just that the recipe > to induce a failure is not well understood and manifests itself in > different ways. I generally run the tests, then if it still hasn't > reproduced, I reboot the VM a couple times, running a couple apps in > between to try to trigger/notice bad behavior. > > > > Flipping the 0/1 in the below patch appears to resolve the issue. > > > > > > I'd appreciate any insights into further debugging this block of code > > > so that we can fix this regression. Thanks, > > > > If it's not too painful to reproduce, I'd say start by determining whether > > it's a problem with the basic logic or if the cond_resched_lock() handling > > is wrong. I.e. comment/ifdef out this chunk: > > > > if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { > > kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); > > flush = false; > > cond_resched_lock(&kvm->mmu_lock); > > } > > If anything, removing this chunk seems to make things worse. Could it be something with the gfn test: if (sp->gfn != gfn) continue; If I remove it, I can't trigger the misbehavior. If I log it, I only get hits on VM boot/reboot and some of the gfns look suspiciously like they could be the assigned GPU BARs and maybe MSI mappings: (sp->gfn) != (gfn) [ 71.829450] gfn fec00 != c02c4 [ 71.835554] gfn ffe00 != c046f [ 71.841664] gfn 0 != c0720 [ 71.847084] gfn 0 != c0720 [ 71.852489] gfn 0 != c0720 [ 71.857899] gfn 0 != c0720 [ 71.863306] gfn 0 != c0720 [ 71.868717] gfn 0 != c0720 [ 71.874122] gfn 0 != c0720 [ 71.879531] gfn 0 != c0720 [ 71.884937] gfn 0 != c0720 [ 71.890349] gfn 0 != c0720 [ 71.895757] gfn 0 != c0720 [ 71.901163] gfn 0 != c0720 [ 71.906569] gfn 0 != c0720 [ 71.911980] gfn 0 != c0720 [ 71.917387] gfn 0 != c0720 [ 71.922808] gfn fee00 != c0edc [ 71.928915] gfn fee00 != c0edc [ 71.935018] gfn fee00 != c0edc [ 71.941730] gfn c1000 != 8002d7 [ 71.948039] gfn 80000 != 8006d5 [ 71.954328] gfn 80000 != 8006d5 [ 71.960600] gfn 80000 != 8006d5 [ 71.966874] gfn 80000 != 8006d5 [ 71.992272] gfn 0 != c0720 [ 71.997683] gfn 0 != c0720 [ 72.003725] gfn 80000 != 8006d5 [ 72.044333] gfn 0 != c0720 [ 72.049743] gfn 0 != c0720 [ 72.055846] gfn 80000 != 8006d5 [ 72.177341] gfn ffe00 != c046f [ 72.183453] gfn 0 != c0720 [ 72.188864] gfn 0 != c0720 [ 72.194290] gfn 0 != c0720 [ 72.200308] gfn 80000 != 8006d5 [ 82.539023] gfn fec00 != c02c4 [ 82.545142] gfn 40000 != c0377 [ 82.551343] gfn ffe00 != c046f [ 82.557466] gfn 100000 != c066f [ 82.563839] gfn 800000 != c06ec [ 82.570133] gfn 800000 != c06ec [ 82.576408] gfn 0 != c0720 [ 82.581850] gfn 0 != c0720 [ 82.587275] gfn 0 != c0720 [ 82.592685] gfn 0 != c0720 [ 82.598131] gfn 0 != c0720 [ 82.603552] gfn 0 != c0720 [ 82.608978] gfn 0 != c0720 [ 82.614419] gfn 0 != c0720 [ 82.619844] gfn 0 != c0720 [ 82.625291] gfn 0 != c0720 [ 82.630791] gfn 0 != c0720 [ 82.636208] gfn 0 != c0720 [ 82.641635] gfn 80a000 != c085e [ 82.647929] gfn fee00 != c0edc [ 82.654062] gfn fee00 != c0edc [ 82.660504] gfn 100000 != c066f [ 82.666800] gfn 0 != c0720 [ 82.672211] gfn 0 != c0720 [ 82.677635] gfn 0 != c0720 [ 82.683060] gfn 0 != c0720 [ 82.689209] gfn c1000 != 8002d7 [ 82.695501] gfn 80000 != 8006d5 [ 82.701796] gfn 80000 != 8006d5 [ 82.708092] gfn 100000 != 80099b [ 82.714547] gfn 0 != 800a4c [ 82.720154] gfn 0 != 800a4c [ 82.725752] gfn 0 != 800a4c [ 82.731370] gfn 0 != 800a4c [ 82.738705] gfn 100000 != 80099b [ 82.745201] gfn 0 != 800a4c [ 82.750793] gfn 0 != 800a4c [ 82.756381] gfn 0 != 800a4c [ 82.761979] gfn 0 != 800a4c [ 82.768122] gfn 100000 != 8083a4 [ 82.774605] gfn 0 != 8094aa [ 82.780196] gfn 0 != 8094aa [ 82.785796] gfn 0 != 8094aa [ 82.791412] gfn 0 != 8094aa [ 82.797523] gfn 100000 != 8083a4 [ 82.803977] gfn 0 != 8094aa [ 82.809576] gfn 0 != 8094aa [ 82.815193] gfn 0 != 8094aa [ 82.820809] gfn 0 != 8094aa (GPU has a BAR mapped at 0x80000000) Is this gfn optimization correct? Overzealous? Doesn't account correctly for something about MMIO mappings? Thanks, Alex ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-13 19:33 ` Alex Williamson @ 2019-08-13 20:19 ` Sean Christopherson 2019-08-13 20:37 ` Paolo Bonzini 2019-08-15 15:23 ` Alex Williamson 0 siblings, 2 replies; 38+ messages in thread From: Sean Christopherson @ 2019-08-13 20:19 UTC (permalink / raw) To: Alex Williamson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Tue, Aug 13, 2019 at 01:33:16PM -0600, Alex Williamson wrote: > On Tue, 13 Aug 2019 11:57:37 -0600 > Alex Williamson <alex.williamson@redhat.com> wrote: > Could it be something with the gfn test: > > if (sp->gfn != gfn) > continue; > > If I remove it, I can't trigger the misbehavior. If I log it, I only > get hits on VM boot/reboot and some of the gfns look suspiciously like > they could be the assigned GPU BARs and maybe MSI mappings: > > (sp->gfn) != (gfn) Hits at boot/reboot makes sense, memslots get zapped when userspace removes a memory region/slot, e.g. remaps BARs and whatnot. ... > Is this gfn optimization correct? Overzealous? Doesn't account > correctly for something about MMIO mappings? Thanks, Yes? Shadow pages are stored in a hash table, for_each_valid_sp() walks all entries for a given gfn. The sp->gfn check is there to skip entries that hashed to the same list but for a completely different gfn. Skipping the gfn check would be sort of a lightweight zap all in the sense that it would zap shadow pages that happend to collide with the target memslot/gfn but are otherwise unrelated. What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.: if (sp->gfn != gfn && sp->gfn != 0x80000) continue; If that doesn't work, it might be worth trying other gfns to see if you can pinpoint which sp is being zapped as collateral damage. It's possible there is a pre-existing bug somewhere else that was being hidden because KVM was effectively zapping all SPTEs during (re)boot, and the hash collision is also hiding the bug by zapping the stale entry. Of course it's also possible this code is wrong, :-) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-13 20:19 ` Sean Christopherson @ 2019-08-13 20:37 ` Paolo Bonzini 2019-08-13 21:14 ` Alex Williamson 2019-08-15 15:23 ` Alex Williamson 1 sibling, 1 reply; 38+ messages in thread From: Paolo Bonzini @ 2019-08-13 20:37 UTC (permalink / raw) To: Sean Christopherson, Alex Williamson Cc: Radim Krčmář, kvm, Xiao Guangrong On 13/08/19 22:19, Sean Christopherson wrote: > Yes? Shadow pages are stored in a hash table, for_each_valid_sp() walks > all entries for a given gfn. The sp->gfn check is there to skip entries > that hashed to the same list but for a completely different gfn. > > Skipping the gfn check would be sort of a lightweight zap all in the > sense that it would zap shadow pages that happend to collide with the > target memslot/gfn but are otherwise unrelated. > > What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.: > > if (sp->gfn != gfn && sp->gfn != 0x80000) > continue; > > If that doesn't work, it might be worth trying other gfns to see if you > can pinpoint which sp is being zapped as collateral damage. > > It's possible there is a pre-existing bug somewhere else that was being > hidden because KVM was effectively zapping all SPTEs during (re)boot, > and the hash collision is also hiding the bug by zapping the stale entry. > > Of course it's also possible this code is wrong, :-) Also, can you reproduce it with one vCPU? This could (though not really 100%) distinguish a missing invalidation from a race condition. Do we even need the call to slot_handle_all_level? The rmap update should be done already by kvm_mmu_prepare_zap_page (via kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte). Alex, can you replace it with just "flush = false;"? Thanks, Paolo ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-13 20:37 ` Paolo Bonzini @ 2019-08-13 21:14 ` Alex Williamson 2019-08-13 21:15 ` Paolo Bonzini 2019-08-15 14:46 ` Sean Christopherson 0 siblings, 2 replies; 38+ messages in thread From: Alex Williamson @ 2019-08-13 21:14 UTC (permalink / raw) To: Paolo Bonzini Cc: Sean Christopherson, Radim Krčmář, kvm, Xiao Guangrong On Tue, 13 Aug 2019 22:37:14 +0200 Paolo Bonzini <pbonzini@redhat.com> wrote: > On 13/08/19 22:19, Sean Christopherson wrote: > > Yes? Shadow pages are stored in a hash table, for_each_valid_sp() walks > > all entries for a given gfn. The sp->gfn check is there to skip entries > > that hashed to the same list but for a completely different gfn. > > > > Skipping the gfn check would be sort of a lightweight zap all in the > > sense that it would zap shadow pages that happend to collide with the > > target memslot/gfn but are otherwise unrelated. > > > > What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.: > > > > if (sp->gfn != gfn && sp->gfn != 0x80000) > > continue; Not having any luck with this yet. Tried 0x80000, 0x8xxxxx, 0. > > If that doesn't work, it might be worth trying other gfns to see if you > > can pinpoint which sp is being zapped as collateral damage. > > > > It's possible there is a pre-existing bug somewhere else that was being > > hidden because KVM was effectively zapping all SPTEs during (re)boot, > > and the hash collision is also hiding the bug by zapping the stale entry. > > > > Of course it's also possible this code is wrong, :-) > > Also, can you reproduce it with one vCPU? This could (though not really > 100%) distinguish a missing invalidation from a race condition. That's a pretty big change, I'll give it a shot, but not sure how conclusive it would be. > Do we even need the call to slot_handle_all_level? The rmap update > should be done already by kvm_mmu_prepare_zap_page (via > kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte). > > Alex, can you replace it with just "flush = false;"? Replace the continue w/ flush = false? I'm not clear on this suggestion. Thanks, Alex ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-13 21:14 ` Alex Williamson @ 2019-08-13 21:15 ` Paolo Bonzini 2019-08-13 22:10 ` Alex Williamson 2019-08-15 14:46 ` Sean Christopherson 1 sibling, 1 reply; 38+ messages in thread From: Paolo Bonzini @ 2019-08-13 21:15 UTC (permalink / raw) To: Alex Williamson Cc: Sean Christopherson, Radim Krčmář, kvm, Xiao Guangrong On 13/08/19 23:14, Alex Williamson wrote: >> Do we even need the call to slot_handle_all_level? The rmap update >> should be done already by kvm_mmu_prepare_zap_page (via >> kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte). >> >> Alex, can you replace it with just "flush = false;"? > Replace the continue w/ flush = false? I'm not clear on this > suggestion. Thanks, diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 24843cf49579..382b3ee303e3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -5664,7 +5668,7 @@ kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm if (list_empty(&kvm->arch.active_mmu_pages)) goto out_unlock; - flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, false); + flush = false; for (i = 0; i < slot->npages; i++) { gfn = slot->base_gfn + i; ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-13 21:15 ` Paolo Bonzini @ 2019-08-13 22:10 ` Alex Williamson 0 siblings, 0 replies; 38+ messages in thread From: Alex Williamson @ 2019-08-13 22:10 UTC (permalink / raw) To: Paolo Bonzini Cc: Sean Christopherson, Radim Krčmář, kvm, Xiao Guangrong On Tue, 13 Aug 2019 23:15:38 +0200 Paolo Bonzini <pbonzini@redhat.com> wrote: > On 13/08/19 23:14, Alex Williamson wrote: > >> Do we even need the call to slot_handle_all_level? The rmap update > >> should be done already by kvm_mmu_prepare_zap_page (via > >> kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte). > >> > >> Alex, can you replace it with just "flush = false;"? > > Replace the continue w/ flush = false? I'm not clear on this > > suggestion. Thanks, > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 24843cf49579..382b3ee303e3 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -5664,7 +5668,7 @@ kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm > if (list_empty(&kvm->arch.active_mmu_pages)) > goto out_unlock; > > - flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, false); > + flush = false; > > for (i = 0; i < slot->npages; i++) { > gfn = slot->base_gfn + i; > Things are not happy with that, it managed to crash the host: [ 871.244789] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 871.258680] #PF: supervisor read access in kernel mode [ 871.268939] #PF: error_code(0x0000) - not-present page [ 871.279202] PGD 0 P4D 0 [ 871.284264] Oops: 0000 [#1] SMP PTI [ 871.291232] CPU: 3 PID: 1954 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4 [ 871.304264] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 [ 871.322523] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm] [ 871.332085] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d [ 871.369560] RSP: 0018:ffffc10300a0bb18 EFLAGS: 00010202 [ 871.379995] RAX: 00000000000c1040 RBX: ffffc103007cd000 RCX: 0000000000000014 [ 871.394243] RDX: ffff9b3bbff18130 RSI: 00000000000c1080 RDI: 0000000000000004 [ 871.408491] RBP: ffff9b3bced15400 R08: ffff9b3b94a90008 R09: ffff9b3b94a90000 [ 871.422739] R10: ffff9b3b94a90168 R11: 0000000000000004 R12: ffff9b3bbff18130 [ 871.436986] R13: ffffc103007cd000 R14: 0000000000000000 R15: 00000000000c1000 [ 871.451234] FS: 00007fc27b37d700(0000) GS:ffff9b3f9f4c0000(0000) knlGS:0000000000000000 [ 871.467390] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 871.478865] CR2: 0000000000000000 CR3: 000000003bb2a004 CR4: 00000000001626e0 [ 871.493113] Call Trace: [ 871.498012] drop_spte+0x77/0xa0 [kvm] [ 871.505500] mmu_page_zap_pte+0xac/0xe0 [kvm] [ 871.514200] __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm] [ 871.524809] kvm_mmu_invalidate_zap_pages_in_memslot+0xb7/0x130 [kvm] [ 871.537670] kvm_page_track_flush_slot+0x55/0x80 [kvm] [ 871.547928] __kvm_set_memory_region+0x821/0xaa0 [kvm] [ 871.558191] kvm_set_memory_region+0x26/0x40 [kvm] [ 871.567759] kvm_vm_ioctl+0x59a/0x940 [kvm] [ 871.576106] ? __seccomp_filter+0x7a/0x680 [ 871.584287] do_vfs_ioctl+0xa4/0x630 [ 871.591430] ? security_file_ioctl+0x32/0x50 [ 871.599954] ksys_ioctl+0x60/0x90 [ 871.606575] __x64_sys_ioctl+0x16/0x20 [ 871.614066] do_syscall_64+0x5f/0x1a0 [ 871.621379] entry_SYSCALL_64_after_hwframe+0x44/0xa9 It also failed with the 1 vcpu test. Thanks, Alex ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-13 21:14 ` Alex Williamson 2019-08-13 21:15 ` Paolo Bonzini @ 2019-08-15 14:46 ` Sean Christopherson 1 sibling, 0 replies; 38+ messages in thread From: Sean Christopherson @ 2019-08-15 14:46 UTC (permalink / raw) To: Alex Williamson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong [-- Attachment #1: Type: text/plain, Size: 1208 bytes --] On Tue, Aug 13, 2019 at 03:14:17PM -0600, Alex Williamson wrote: > On Tue, 13 Aug 2019 22:37:14 +0200 > Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 13/08/19 22:19, Sean Christopherson wrote: > > > Yes? Shadow pages are stored in a hash table, for_each_valid_sp() walks > > > all entries for a given gfn. The sp->gfn check is there to skip entries > > > that hashed to the same list but for a completely different gfn. > > > > > > Skipping the gfn check would be sort of a lightweight zap all in the > > > sense that it would zap shadow pages that happend to collide with the > > > target memslot/gfn but are otherwise unrelated. > > > > > > What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.: > > > > > > if (sp->gfn != gfn && sp->gfn != 0x80000) > > > continue; > > Not having any luck with this yet. Tried 0x80000, 0x8xxxxx, 0. I've no idea if it would actually be interesting, but something to try would be to zap only emulated mmio SPTEs (in addition to the memslot). If that test passes then I think it might indicate a problem with device enumeration as opposed to the mapping of the device itself ("think" and "might" being the operative words). Patch attached. [-- Attachment #2: emulated_mmio_only.patch --] [-- Type: text/x-diff, Size: 947 bytes --] From df7d1bf7025c20703f71e126dc673ba58230605f Mon Sep 17 00:00:00 2001 From: Sean Christopherson <sean.j.christopherson@intel.com> Date: Thu, 15 Aug 2019 07:27:05 -0700 Subject: [PATCH] tmp --- arch/x86/kvm/mmu.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 24843cf49579..d5bbb7ed716f 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -5649,6 +5649,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu) return alloc_mmu_pages(vcpu); } +static void __kvm_mmu_zap_all(struct kvm *kvm, bool mmio_only); + static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, struct kvm_page_track_notifier_node *node) @@ -5685,6 +5687,8 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, out_unlock: spin_unlock(&kvm->mmu_lock); + + __kvm_mmu_zap_all(kvm, true); } void kvm_mmu_init_vm(struct kvm *kvm) -- 2.22.0 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-13 20:19 ` Sean Christopherson 2019-08-13 20:37 ` Paolo Bonzini @ 2019-08-15 15:23 ` Alex Williamson 2019-08-15 16:00 ` Sean Christopherson 2019-08-19 16:03 ` Paolo Bonzini 1 sibling, 2 replies; 38+ messages in thread From: Alex Williamson @ 2019-08-15 15:23 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Tue, 13 Aug 2019 13:19:14 -0700 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > On Tue, Aug 13, 2019 at 01:33:16PM -0600, Alex Williamson wrote: > > On Tue, 13 Aug 2019 11:57:37 -0600 > > Alex Williamson <alex.williamson@redhat.com> wrote: > > > Could it be something with the gfn test: > > > > if (sp->gfn != gfn) > > continue; > > > > If I remove it, I can't trigger the misbehavior. If I log it, I only > > get hits on VM boot/reboot and some of the gfns look suspiciously like > > they could be the assigned GPU BARs and maybe MSI mappings: > > > > (sp->gfn) != (gfn) > > Hits at boot/reboot makes sense, memslots get zapped when userspace > removes a memory region/slot, e.g. remaps BARs and whatnot. > > ... > > > Is this gfn optimization correct? Overzealous? Doesn't account > > correctly for something about MMIO mappings? Thanks, > > Yes? Shadow pages are stored in a hash table, for_each_valid_sp() walks > all entries for a given gfn. The sp->gfn check is there to skip entries > that hashed to the same list but for a completely different gfn. > > Skipping the gfn check would be sort of a lightweight zap all in the > sense that it would zap shadow pages that happend to collide with the > target memslot/gfn but are otherwise unrelated. > > What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.: > > if (sp->gfn != gfn && sp->gfn != 0x80000) > continue; > > If that doesn't work, it might be worth trying other gfns to see if you > can pinpoint which sp is being zapped as collateral damage. > > It's possible there is a pre-existing bug somewhere else that was being > hidden because KVM was effectively zapping all SPTEs during (re)boot, > and the hash collision is also hiding the bug by zapping the stale entry. > > Of course it's also possible this code is wrong, :-) Ok, fun day of trying to figure out which ranges are relevant, I've narrowed it down to all of these: 0xffe00 0xfee00 0xfec00 0xc1000 0x80a000 0x800000 0x100000 ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and 0x80000 can take the continue branch without seeing bad behavior in the VM. The assigned GPU has BARs at GPAs: 0xc0000000-0xc0ffffff 0x800000000-0x808000000 0x808000000-0x809ffffff And the assigned companion audio function is at GPA: 0xc1080000-0xc1083fff Only one of those seems to align very well with a gfn base involved here. The virtio ethernet has an mmio range at GPA 0x80a000000, otherwise I don't find any other I/O devices coincident with the gfns above. I'm running the VM with 2MB hugepages, but I believe the issue still occurs with standard pages. When run with standard pages I see more hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to the set above that cannot take the continue branch. I don't know if that means anything. Any further ideas what to look for? Thanks, Alex PS - I see the posted workaround patch, I'll test that in the interim. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-15 15:23 ` Alex Williamson @ 2019-08-15 16:00 ` Sean Christopherson 2019-08-15 18:16 ` Alex Williamson 2019-08-19 16:03 ` Paolo Bonzini 1 sibling, 1 reply; 38+ messages in thread From: Sean Christopherson @ 2019-08-15 16:00 UTC (permalink / raw) To: Alex Williamson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Thu, Aug 15, 2019 at 09:23:24AM -0600, Alex Williamson wrote: > Ok, fun day of trying to figure out which ranges are relevant, I've > narrowed it down to all of these: > > 0xffe00 > 0xfee00 > 0xfec00 APIC and I/O APIC stuff > 0xc1000 Assigned audio > 0x80a000 ? > 0x800000 GPU BAR > 0x100000 ? The APIC ranges are puzzling, I wouldn't expect their mappings to change. > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and > 0x80000 can take the continue branch without seeing bad behavior in the > VM. > > The assigned GPU has BARs at GPAs: > > 0xc0000000-0xc0ffffff > 0x800000000-0x808000000 > 0x808000000-0x809ffffff > > And the assigned companion audio function is at GPA: > > 0xc1080000-0xc1083fff > > Only one of those seems to align very well with a gfn base involved > here. The virtio ethernet has an mmio range at GPA 0x80a000000, > otherwise I don't find any other I/O devices coincident with the gfns > above. > > I'm running the VM with 2MB hugepages, but I believe the issue still > occurs with standard pages. When run with standard pages I see more > hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to > the set above that cannot take the continue branch. I don't know if > that means anything. > > Any further ideas what to look for? Thanks, Maybe try isolating which memslot removal causes problems? E.g. flush the affected ranges if base_gfn == (xxx || yyy || zzz), otherwise flush only the memslot's gfns. Based on the log you sent a while back for gfn mismatches, I'm guessing the culprits are all GPU BARs, but it's probably worth confirming. That might also explain why gfn == 0x80000 can take the continue branch, i.e. if removing the corresponding memslot is what's causing problems, then it's being flushed and not actually taking the continue path. One other thought would be to force a call to kvm_flush_remote_tlbs(kvm), e.g. set flush=true just before the final kvm_mmu_remote_flush_or_zap(). Maybe it's a case where there are no SPTEs for the memslot, but the TLB flush is needed for some reason. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-15 16:00 ` Sean Christopherson @ 2019-08-15 18:16 ` Alex Williamson 2019-08-15 19:25 ` Sean Christopherson 0 siblings, 1 reply; 38+ messages in thread From: Alex Williamson @ 2019-08-15 18:16 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Thu, 15 Aug 2019 09:00:06 -0700 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > On Thu, Aug 15, 2019 at 09:23:24AM -0600, Alex Williamson wrote: > > Ok, fun day of trying to figure out which ranges are relevant, I've > > narrowed it down to all of these: > > > > 0xffe00 > > 0xfee00 > > 0xfec00 > > APIC and I/O APIC stuff > > > 0xc1000 > > Assigned audio > > > 0x80a000 > > ? > > > 0x800000 > > GPU BAR > > > 0x100000 > > ? > > The APIC ranges are puzzling, I wouldn't expect their mappings to change. > > > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and > > 0x80000 can take the continue branch without seeing bad behavior in the > > VM. > > > > The assigned GPU has BARs at GPAs: > > > > 0xc0000000-0xc0ffffff > > 0x800000000-0x808000000 > > 0x808000000-0x809ffffff > > > > And the assigned companion audio function is at GPA: > > > > 0xc1080000-0xc1083fff > > > > Only one of those seems to align very well with a gfn base involved > > here. The virtio ethernet has an mmio range at GPA 0x80a000000, > > otherwise I don't find any other I/O devices coincident with the gfns > > above. > > > > I'm running the VM with 2MB hugepages, but I believe the issue still > > occurs with standard pages. When run with standard pages I see more > > hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to > > the set above that cannot take the continue branch. I don't know if > > that means anything. > > > > Any further ideas what to look for? Thanks, > > Maybe try isolating which memslot removal causes problems? E.g. flush > the affected ranges if base_gfn == (xxx || yyy || zzz), otherwise flush > only the memslot's gfns. Based on the log you sent a while back for gfn > mismatches, I'm guessing the culprits are all GPU BARs, but it's > probably worth confirming. That might also explain why gfn == 0x80000 > can take the continue branch, i.e. if removing the corresponding memslot > is what's causing problems, then it's being flushed and not actually > taking the continue path. If I print out the memslot base_gfn, it seems pretty evident that only the assigned device mappings are triggering this branch. The base_gfns exclusively include: 0x800000 0x808000 0xc0089 Where the first two clearly match the 64bit BARs and the last is the result of a page that we need to emulate within the BAR @0xc0000000 at offset 0x88000, so the base_gfn is the remaining direct mapping. I don't know if this implies we're doing something wrong for assigned device slots, but maybe a more targeted workaround would be if we could specifically identify these slots, though there's no special registration of them versus other slots. Did you have any non-device assignment test cases that took this branch when developing the series? > One other thought would be to force a call to kvm_flush_remote_tlbs(kvm), > e.g. set flush=true just before the final kvm_mmu_remote_flush_or_zap(). > Maybe it's a case where there are no SPTEs for the memslot, but the TLB > flush is needed for some reason. This doesn't work. Thanks, Alex ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-15 18:16 ` Alex Williamson @ 2019-08-15 19:25 ` Sean Christopherson 2019-08-15 20:11 ` Alex Williamson 0 siblings, 1 reply; 38+ messages in thread From: Sean Christopherson @ 2019-08-15 19:25 UTC (permalink / raw) To: Alex Williamson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Thu, Aug 15, 2019 at 12:16:07PM -0600, Alex Williamson wrote: > On Thu, 15 Aug 2019 09:00:06 -0700 > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > If I print out the memslot base_gfn, it seems pretty evident that only > the assigned device mappings are triggering this branch. The base_gfns > exclusively include: > > 0x800000 > 0x808000 > 0xc0089 > > Where the first two clearly match the 64bit BARs and the last is the > result of a page that we need to emulate within the BAR @0xc0000000 at > offset 0x88000, so the base_gfn is the remaining direct mapping. That's consistent with my understanding of userspace, e.g. normal memory regions aren't deleted until the VM is shut down (barring hot unplug). > I don't know if this implies we're doing something wrong for assigned > device slots, but maybe a more targeted workaround would be if we could > specifically identify these slots, though there's no special > registration of them versus other slots. What is triggering the memslot removal/update? Is it possible that whatever action is occuring is modifying multiple memslots? E.g. KVM's memslot-only zapping is allowing the guest to access stale entries for the unzapped-but-related memslots, whereas the full zap does not. FYI, my VFIO/GPU/PCI knowledge is abysmal, please speak up if any of my ideas are nonsensical. > Did you have any non-device > assignment test cases that took this branch when developing the series? The primary testing was performance oriented, using a slightly modified version of a synthetic benchmark[1] from a previous series[2] that touched the memslot flushing flow. From a functional perspective, I highly doubt that test would have been able expose an improper zapping bug. We do have some amount of coverage via kvm-unit-tests, as an EPT test was triggering a slab bug due not actually zapping the collected SPTEs[3]. [1] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277/mmtest.tar.bz2 [2] https://lkml.kernel.org/r/1368706673-8530-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com [3] https://patchwork.kernel.org/patch/10899283/ > > One other thought would be to force a call to kvm_flush_remote_tlbs(kvm), > > e.g. set flush=true just before the final kvm_mmu_remote_flush_or_zap(). > > Maybe it's a case where there are no SPTEs for the memslot, but the TLB > > flush is needed for some reason. > > This doesn't work. Thanks, > > Alex ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-15 19:25 ` Sean Christopherson @ 2019-08-15 20:11 ` Alex Williamson 0 siblings, 0 replies; 38+ messages in thread From: Alex Williamson @ 2019-08-15 20:11 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Thu, 15 Aug 2019 12:25:31 -0700 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > On Thu, Aug 15, 2019 at 12:16:07PM -0600, Alex Williamson wrote: > > On Thu, 15 Aug 2019 09:00:06 -0700 > > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > > If I print out the memslot base_gfn, it seems pretty evident that only > > the assigned device mappings are triggering this branch. The base_gfns > > exclusively include: > > > > 0x800000 > > 0x808000 > > 0xc0089 > > > > Where the first two clearly match the 64bit BARs and the last is the > > result of a page that we need to emulate within the BAR @0xc0000000 at > > offset 0x88000, so the base_gfn is the remaining direct mapping. > > That's consistent with my understanding of userspace, e.g. normal memory > regions aren't deleted until the VM is shut down (barring hot unplug). > > > I don't know if this implies we're doing something wrong for assigned > > device slots, but maybe a more targeted workaround would be if we could > > specifically identify these slots, though there's no special > > registration of them versus other slots. > > What is triggering the memslot removal/update? Is it possible that > whatever action is occuring is modifying multiple memslots? E.g. KVM's > memslot-only zapping is allowing the guest to access stale entries for > the unzapped-but-related memslots, whereas the full zap does not. > > FYI, my VFIO/GPU/PCI knowledge is abysmal, please speak up if any of my > ideas are nonsensical. The memory bit in the PCI command register of config space for each device controls whether the device decoders are active for the MMIO BAR ranges. These get flipped as both the guest firmware and guest OS enumerate and assign resources to the PCI subsystem. Generally these are not further manipulated while the guest OS is running except for hotplug operations. The guest OS device driver will generally perform the final enable of these ranges and they'll remain enabled until the guest is rebooted. I recall somewhere in this thread you referenced reading the ROM as part of the performance testing of this series. The ROM has it's own enable bit within the ROM BAR as the PCI spec allows devices to share decoders between the standard BARs and the ROM BAR. Enabling and disabling the enable bit in the ROM BAR should be very similar in memslot behavior to the overall memory enable bit for the other BARs within the device. Note that often the faults that I'm seeing occur a long time after BAR mappings are finalized, usually (not always) the VM boots to a fully functional desktop and it's only as I run various tests do the glitches start to appear. For instance, when I allowed sp->gfn 0xfec00 to take the continue branch, I got an OpenCL error. For either 0xffee00 or 0xc1000 I got graphics glitches, for example stray geometric artifacts flashed on the screen. For 0x100000 and 0x800000 I'd get a black screen or blank 3D graphics window. For 0x80a000 the VM froze (apparently). I can't say whether each of these is a consistent failure mode, I only tested to the point of determining whether a range generates an error. > > Did you have any non-device > > assignment test cases that took this branch when developing the series? > > The primary testing was performance oriented, using a slightly modified > version of a synthetic benchmark[1] from a previous series[2] that touched > the memslot flushing flow. From a functional perspective, I highly doubt > that test would have been able expose an improper zapping bug. :-\ It seems like there's probably some sort of inflection point where it becomes faster to zap all pages versus the overhead of walking every page in a memory slot, was that evaluated? Not sure if that's relevant here, but curious. Thanks, Alex ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-15 15:23 ` Alex Williamson 2019-08-15 16:00 ` Sean Christopherson @ 2019-08-19 16:03 ` Paolo Bonzini 2019-08-20 20:03 ` Sean Christopherson 1 sibling, 1 reply; 38+ messages in thread From: Paolo Bonzini @ 2019-08-19 16:03 UTC (permalink / raw) To: Alex Williamson, Sean Christopherson Cc: Radim Krčmář, kvm, Xiao Guangrong On 15/08/19 17:23, Alex Williamson wrote: > 0xffe00 > 0xfee00 > 0xfec00 > 0xc1000 > 0x80a000 > 0x800000 > 0x100000 > > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and > 0x80000 can take the continue branch without seeing bad behavior in the > VM. > > The assigned GPU has BARs at GPAs: > > 0xc0000000-0xc0ffffff > 0x800000000-0x808000000 > 0x808000000-0x809ffffff > > And the assigned companion audio function is at GPA: > > 0xc1080000-0xc1083fff > > Only one of those seems to align very well with a gfn base involved > here. The virtio ethernet has an mmio range at GPA 0x80a000000, > otherwise I don't find any other I/O devices coincident with the gfns > above. The IOAPIC and LAPIC are respectively gfn 0xfec00 and 0xfee00. The audio function BAR is only 16 KiB, so the 2 MiB PDE starting at 0xc1000 includes both userspace-MMIO and device-MMIO memory. The virtio-net BAR is also userspace-MMIO. It seems like the problem occurs when the sp->gfn you "continue over" includes a userspace-MMIO gfn. But since I have no better ideas right now, I'm going to apply the revert (we don't know for sure that it only happens with assigned devices). Paolo > I'm running the VM with 2MB hugepages, but I believe the issue still > occurs with standard pages. When run with standard pages I see more > hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to > the set above that cannot take the continue branch. I don't know if > that means anything. > > Any further ideas what to look for? Thanks, > > Alex > > PS - I see the posted workaround patch, I'll test that in the interim. > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-19 16:03 ` Paolo Bonzini @ 2019-08-20 20:03 ` Sean Christopherson 2019-08-20 20:42 ` Alex Williamson 2020-06-26 17:32 ` Sean Christopherson 0 siblings, 2 replies; 38+ messages in thread From: Sean Christopherson @ 2019-08-20 20:03 UTC (permalink / raw) To: Paolo Bonzini Cc: Alex Williamson, Radim Krčmář, kvm, Xiao Guangrong [-- Attachment #1: Type: text/plain, Size: 2853 bytes --] On Mon, Aug 19, 2019 at 06:03:05PM +0200, Paolo Bonzini wrote: > On 15/08/19 17:23, Alex Williamson wrote: > > 0xffe00 > > 0xfee00 > > 0xfec00 > > 0xc1000 > > 0x80a000 > > 0x800000 > > 0x100000 > > > > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and > > 0x80000 can take the continue branch without seeing bad behavior in the > > VM. > > > > The assigned GPU has BARs at GPAs: > > > > 0xc0000000-0xc0ffffff > > 0x800000000-0x808000000 > > 0x808000000-0x809ffffff > > > > And the assigned companion audio function is at GPA: > > > > 0xc1080000-0xc1083fff > > > > Only one of those seems to align very well with a gfn base involved > > here. The virtio ethernet has an mmio range at GPA 0x80a000000, > > otherwise I don't find any other I/O devices coincident with the gfns > > above. > > The IOAPIC and LAPIC are respectively gfn 0xfec00 and 0xfee00. The > audio function BAR is only 16 KiB, so the 2 MiB PDE starting at 0xc1000 > includes both userspace-MMIO and device-MMIO memory. The virtio-net BAR > is also userspace-MMIO. > > It seems like the problem occurs when the sp->gfn you "continue over" > includes a userspace-MMIO gfn. But since I have no better ideas right > now, I'm going to apply the revert (we don't know for sure that it only > happens with assigned devices). After many hours of staring, I've come to the conclusion that kvm_mmu_invalidate_zap_pages_in_memslot() is completely broken, i.e. a revert is definitely in order for 5.3 and stable. mmu_page_hash is indexed by the gfn of the shadow pages, not the gfn of the shadow ptes, e.g. for_each_valid_sp() will find page tables, page directories, etc... Passing in the raw gfns of the memslot doesn't work because the gfn isn't properly adjusted/aligned to match how KVM tracks gfns for shadow pages, e.g. removal of the companion audio memslot that occupies gfns 0xc1080 - 0xc1083 would need to query gfn 0xc1000 to find the shadow page table containing the relevant sptes. This is why Paolo's suggestion to remove slot_handle_all_level() on kvm_zap_rmapp() caused a BUG(), as zapping the rmaps cleaned up KVM's accounting without actually zapping the relevant sptes. All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00 were sensitive to (lack of) zapping. My theory is that zapping what were effectively random-but-interesting shadow pages cleaned things up enough to avoid noticeable badness. Alex, Can you please test the attached patch? It implements a very slimmed down version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes pointing at the memslot being removed, which was the original intent of kvm_mmu_invalidate_zap_pages_in_memslot(). I apologize in advance if it crashes the host. I'm hopeful it's correct, but given how broken the previous version was, I'm not exactly confident. [-- Attachment #2: 0001-KVM-x86-MMU-Rewrite-zapping-pages-in-memslot-to-fix-.patch --] [-- Type: text/x-diff, Size: 2936 bytes --] From 90e4b3dcb33a52da267f1099289ec7b71ad6a975 Mon Sep 17 00:00:00 2001 From: Sean Christopherson <sean.j.christopherson@intel.com> Date: Tue, 20 Aug 2019 12:00:13 -0700 Subject: [PATCH] KVM: x86/MMU: Rewrite zapping pages in memslot to fix major flaws TODO: write a proper changelog if this actually works Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> --- arch/x86/kvm/mmu.c | 58 +++++++++++++++++++++++++++------------------- 1 file changed, 34 insertions(+), 24 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 24843cf49579..c91c3472821b 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -5653,37 +5653,47 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, struct kvm_page_track_notifier_node *node) { - struct kvm_mmu_page *sp; + struct kvm_mmu_page *sp, *tmp; LIST_HEAD(invalid_list); - unsigned long i; - bool flush; - gfn_t gfn; + int max_level, ign; + gfn_t gfn_mask; + + /* + * Zapping upper level shadow pages isn't required, worst case scenario + * we'll have unused shadow pages with no children that aren't zapped + * until they're recycled due to age or when the VM is destroyed. Skip + * shadow pages that can't point directly at the memslot. + */ + max_level = kvm_largepages_enabled() ? kvm_x86_ops->get_lpage_level() : + PT_PAGE_TABLE_LEVEL; + while (slot->npages < KVM_PAGES_PER_HPAGE(max_level)) + --max_level; spin_lock(&kvm->mmu_lock); +restart: + list_for_each_entry_safe(sp, tmp, &kvm->arch.active_mmu_pages, link) { + if (sp->role.level > max_level) + continue; + if (sp->role.invalid && sp->root_count) + continue; - if (list_empty(&kvm->arch.active_mmu_pages)) - goto out_unlock; + /* + * Note, the mask is calculated using level+1. We're looking + * for shadow pages with sptes that point at the to-be-removed + * memslot, not just for shadow pages residing in the memslot. + */ + gfn_mask = ~(KVM_PAGES_PER_HPAGE(sp->role.level + 1) - 1); + if (sp->gfn < (slot->base_gfn & gfn_mask) || + sp->gfn > ((slot->base_gfn + slot->npages - 1) & gfn_mask)) + continue; - flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, false); + if (__kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign)) + goto restart; - for (i = 0; i < slot->npages; i++) { - gfn = slot->base_gfn + i; - - for_each_valid_sp(kvm, sp, gfn) { - if (sp->gfn != gfn) - continue; - - kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); - } - if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { - kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); - flush = false; - cond_resched_lock(&kvm->mmu_lock); - } + if (cond_resched_lock(&kvm->mmu_lock)) + goto restart; } - kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); - -out_unlock: + kvm_mmu_commit_zap_page(kvm, &invalid_list); spin_unlock(&kvm->mmu_lock); } -- 2.22.0 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-20 20:03 ` Sean Christopherson @ 2019-08-20 20:42 ` Alex Williamson 2019-08-20 21:02 ` Sean Christopherson 2020-06-26 17:32 ` Sean Christopherson 1 sibling, 1 reply; 38+ messages in thread From: Alex Williamson @ 2019-08-20 20:42 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Tue, 20 Aug 2019 13:03:19 -0700 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > On Mon, Aug 19, 2019 at 06:03:05PM +0200, Paolo Bonzini wrote: > > On 15/08/19 17:23, Alex Williamson wrote: > > > 0xffe00 > > > 0xfee00 > > > 0xfec00 > > > 0xc1000 > > > 0x80a000 > > > 0x800000 > > > 0x100000 > > > > > > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and > > > 0x80000 can take the continue branch without seeing bad behavior in the > > > VM. > > > > > > The assigned GPU has BARs at GPAs: > > > > > > 0xc0000000-0xc0ffffff > > > 0x800000000-0x808000000 > > > 0x808000000-0x809ffffff > > > > > > And the assigned companion audio function is at GPA: > > > > > > 0xc1080000-0xc1083fff > > > > > > Only one of those seems to align very well with a gfn base involved > > > here. The virtio ethernet has an mmio range at GPA 0x80a000000, > > > otherwise I don't find any other I/O devices coincident with the gfns > > > above. > > > > The IOAPIC and LAPIC are respectively gfn 0xfec00 and 0xfee00. The > > audio function BAR is only 16 KiB, so the 2 MiB PDE starting at 0xc1000 > > includes both userspace-MMIO and device-MMIO memory. The virtio-net BAR > > is also userspace-MMIO. > > > > It seems like the problem occurs when the sp->gfn you "continue over" > > includes a userspace-MMIO gfn. But since I have no better ideas right > > now, I'm going to apply the revert (we don't know for sure that it only > > happens with assigned devices). > > After many hours of staring, I've come to the conclusion that > kvm_mmu_invalidate_zap_pages_in_memslot() is completely broken, i.e. > a revert is definitely in order for 5.3 and stable. > > mmu_page_hash is indexed by the gfn of the shadow pages, not the gfn of > the shadow ptes, e.g. for_each_valid_sp() will find page tables, page > directories, etc... Passing in the raw gfns of the memslot doesn't work > because the gfn isn't properly adjusted/aligned to match how KVM tracks > gfns for shadow pages, e.g. removal of the companion audio memslot that > occupies gfns 0xc1080 - 0xc1083 would need to query gfn 0xc1000 to find > the shadow page table containing the relevant sptes. > > This is why Paolo's suggestion to remove slot_handle_all_level() on > kvm_zap_rmapp() caused a BUG(), as zapping the rmaps cleaned up KVM's > accounting without actually zapping the relevant sptes. > > All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00 > were sensitive to (lack of) zapping. My theory is that zapping what were > effectively random-but-interesting shadow pages cleaned things up enough > to avoid noticeable badness. > > > Alex, > > Can you please test the attached patch? It implements a very slimmed down > version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes > pointing at the memslot being removed, which was the original intent of > kvm_mmu_invalidate_zap_pages_in_memslot(). I apologize in advance if it > crashes the host. I'm hopeful it's correct, but given how broken the > previous version was, I'm not exactly confident. It doesn't crash the host, but the guest is not happy, failing to boot the desktop in one case and triggering errors in the guest w/o even running test programs in another case. Seems like it might be worse than previous. Thanks, Alex ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-20 20:42 ` Alex Williamson @ 2019-08-20 21:02 ` Sean Christopherson 2019-08-21 19:08 ` Alex Williamson 0 siblings, 1 reply; 38+ messages in thread From: Sean Christopherson @ 2019-08-20 21:02 UTC (permalink / raw) To: Alex Williamson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Tue, Aug 20, 2019 at 02:42:04PM -0600, Alex Williamson wrote: > On Tue, 20 Aug 2019 13:03:19 -0700 > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00 > > were sensitive to (lack of) zapping. My theory is that zapping what were > > effectively random-but-interesting shadow pages cleaned things up enough > > to avoid noticeable badness. > > > > > > Alex, > > > > Can you please test the attached patch? It implements a very slimmed down > > version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes > > pointing at the memslot being removed, which was the original intent of > > kvm_mmu_invalidate_zap_pages_in_memslot(). I apologize in advance if it > > crashes the host. I'm hopeful it's correct, but given how broken the > > previous version was, I'm not exactly confident. > > It doesn't crash the host, but the guest is not happy, failing to boot > the desktop in one case and triggering errors in the guest w/o even > running test programs in another case. Seems like it might be worse > than previous. Thanks, Hrm, I'm back to being completely flummoxed. Would you be able to generate a trace of all events/kvmmmu, using the latest patch? I'd like to rule out a stupid code bug if it's not too much trouble. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-20 21:02 ` Sean Christopherson @ 2019-08-21 19:08 ` Alex Williamson 2019-08-21 19:35 ` Alex Williamson 2019-08-21 20:10 ` Sean Christopherson 0 siblings, 2 replies; 38+ messages in thread From: Alex Williamson @ 2019-08-21 19:08 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Tue, 20 Aug 2019 14:02:45 -0700 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > On Tue, Aug 20, 2019 at 02:42:04PM -0600, Alex Williamson wrote: > > On Tue, 20 Aug 2019 13:03:19 -0700 > > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00 > > > were sensitive to (lack of) zapping. My theory is that zapping what were > > > effectively random-but-interesting shadow pages cleaned things up enough > > > to avoid noticeable badness. > > > > > > > > > Alex, > > > > > > Can you please test the attached patch? It implements a very slimmed down > > > version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes > > > pointing at the memslot being removed, which was the original intent of > > > kvm_mmu_invalidate_zap_pages_in_memslot(). I apologize in advance if it > > > crashes the host. I'm hopeful it's correct, but given how broken the > > > previous version was, I'm not exactly confident. > > > > It doesn't crash the host, but the guest is not happy, failing to boot > > the desktop in one case and triggering errors in the guest w/o even > > running test programs in another case. Seems like it might be worse > > than previous. Thanks, > > Hrm, I'm back to being completely flummoxed. > > Would you be able to generate a trace of all events/kvmmmu, using the > latest patch? I'd like to rule out a stupid code bug if it's not too > much trouble. I tried to simplify the patch, making it closer to zap_all, so I removed the max_level calculation and exclusion based on s->role.level, as well as the gfn range filtering. For good measure I even removed the sp->root_count test, so any sp not marked invalid is zapped. This works, and I can also add back the sp->root_count test and things remain working. From there I added back the gfn range test, but I left out the gfn_mask because I'm not doing the level filtering and I think(?) this is just another optimization. So essentially I only add: if (sp->gfn < slot->base_gfn || sp->gfn > (slot->base_gfn + slot->npages - 1)) continue; Not only does this not work, the host will sometimes oops: [ 808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 808.555065] #PF: supervisor read access in kernel mode [ 808.565326] #PF: error_code(0x0000) - not-present page [ 808.575588] PGD 0 P4D 0 [ 808.580649] Oops: 0000 [#1] SMP PTI [ 808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4 [ 808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 [ 808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm] [ 808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d [ 808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202 [ 808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014 [ 808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004 [ 808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000 [ 808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260 [ 808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004 [ 808.747620] FS: 00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000 [ 808.763776] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0 [ 808.789499] Call Trace: [ 808.794399] drop_spte+0x77/0xa0 [kvm] [ 808.801885] mmu_page_zap_pte+0xac/0xe0 [kvm] [ 808.810587] __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm] [ 808.821196] kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm] [ 808.833881] kvm_page_track_flush_slot+0x55/0x80 [kvm] [ 808.844140] __kvm_set_memory_region+0x821/0xaa0 [kvm] [ 808.854402] kvm_set_memory_region+0x26/0x40 [kvm] [ 808.863971] kvm_vm_ioctl+0x59a/0x940 [kvm] [ 808.872318] ? pagevec_lru_move_fn+0xb8/0xd0 [ 808.880846] ? __seccomp_filter+0x7a/0x680 [ 808.889028] do_vfs_ioctl+0xa4/0x630 [ 808.896168] ? security_file_ioctl+0x32/0x50 [ 808.904695] ksys_ioctl+0x60/0x90 [ 808.911316] __x64_sys_ioctl+0x16/0x20 [ 808.918807] do_syscall_64+0x5f/0x1a0 [ 808.926121] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 808.936209] RIP: 0033:0x7f28ebf2b0fb Does this suggests something is still fundamentally wrong with the premise of this change or have I done something stupid? Thanks, Alex ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-21 19:08 ` Alex Williamson @ 2019-08-21 19:35 ` Alex Williamson 2019-08-21 20:30 ` Sean Christopherson 2019-08-21 20:10 ` Sean Christopherson 1 sibling, 1 reply; 38+ messages in thread From: Alex Williamson @ 2019-08-21 19:35 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Wed, 21 Aug 2019 13:08:59 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > On Tue, 20 Aug 2019 14:02:45 -0700 > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > On Tue, Aug 20, 2019 at 02:42:04PM -0600, Alex Williamson wrote: > > > On Tue, 20 Aug 2019 13:03:19 -0700 > > > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > > All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00 > > > > were sensitive to (lack of) zapping. My theory is that zapping what were > > > > effectively random-but-interesting shadow pages cleaned things up enough > > > > to avoid noticeable badness. > > > > > > > > > > > > Alex, > > > > > > > > Can you please test the attached patch? It implements a very slimmed down > > > > version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes > > > > pointing at the memslot being removed, which was the original intent of > > > > kvm_mmu_invalidate_zap_pages_in_memslot(). I apologize in advance if it > > > > crashes the host. I'm hopeful it's correct, but given how broken the > > > > previous version was, I'm not exactly confident. > > > > > > It doesn't crash the host, but the guest is not happy, failing to boot > > > the desktop in one case and triggering errors in the guest w/o even > > > running test programs in another case. Seems like it might be worse > > > than previous. Thanks, > > > > Hrm, I'm back to being completely flummoxed. > > > > Would you be able to generate a trace of all events/kvmmmu, using the > > latest patch? I'd like to rule out a stupid code bug if it's not too > > much trouble. > > I tried to simplify the patch, making it closer to zap_all, so I > removed the max_level calculation and exclusion based on s->role.level, > as well as the gfn range filtering. For good measure I even removed > the sp->root_count test, so any sp not marked invalid is zapped. This > works, and I can also add back the sp->root_count test and things > remain working. > > From there I added back the gfn range test, but I left out the gfn_mask > because I'm not doing the level filtering and I think(?) this is just > another optimization. So essentially I only add: > > if (sp->gfn < slot->base_gfn || > sp->gfn > (slot->base_gfn + slot->npages - 1)) > continue; > > Not only does this not work, the host will sometimes oops: > > [ 808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000 > [ 808.555065] #PF: supervisor read access in kernel mode > [ 808.565326] #PF: error_code(0x0000) - not-present page > [ 808.575588] PGD 0 P4D 0 > [ 808.580649] Oops: 0000 [#1] SMP PTI > [ 808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4 > [ 808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 > [ 808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm] > [ 808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d > [ 808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202 > [ 808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014 > [ 808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004 > [ 808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000 > [ 808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260 > [ 808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004 > [ 808.747620] FS: 00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000 > [ 808.763776] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0 > [ 808.789499] Call Trace: > [ 808.794399] drop_spte+0x77/0xa0 [kvm] > [ 808.801885] mmu_page_zap_pte+0xac/0xe0 [kvm] > [ 808.810587] __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm] > [ 808.821196] kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm] > [ 808.833881] kvm_page_track_flush_slot+0x55/0x80 [kvm] > [ 808.844140] __kvm_set_memory_region+0x821/0xaa0 [kvm] > [ 808.854402] kvm_set_memory_region+0x26/0x40 [kvm] > [ 808.863971] kvm_vm_ioctl+0x59a/0x940 [kvm] > [ 808.872318] ? pagevec_lru_move_fn+0xb8/0xd0 > [ 808.880846] ? __seccomp_filter+0x7a/0x680 > [ 808.889028] do_vfs_ioctl+0xa4/0x630 > [ 808.896168] ? security_file_ioctl+0x32/0x50 > [ 808.904695] ksys_ioctl+0x60/0x90 > [ 808.911316] __x64_sys_ioctl+0x16/0x20 > [ 808.918807] do_syscall_64+0x5f/0x1a0 > [ 808.926121] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 808.936209] RIP: 0033:0x7f28ebf2b0fb > > Does this suggests something is still fundamentally wrong with the > premise of this change or have I done something stupid? Seems the latter, particularly your comment that we're looking for pages pointing to the gfn range to be removed, not just those in the range. Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or c0000-c0000, zapping zero or c0000, and I think one of the ones you were looking for c1080-c1083 is reduce to c1000-c1000 and therefore zaps sp->gfn c1000. I'll keep looking. Thanks, Alex ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-21 19:35 ` Alex Williamson @ 2019-08-21 20:30 ` Sean Christopherson 2019-08-23 2:25 ` Sean Christopherson 0 siblings, 1 reply; 38+ messages in thread From: Sean Christopherson @ 2019-08-21 20:30 UTC (permalink / raw) To: Alex Williamson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Wed, Aug 21, 2019 at 01:35:04PM -0600, Alex Williamson wrote: > On Wed, 21 Aug 2019 13:08:59 -0600 > Alex Williamson <alex.williamson@redhat.com> wrote: > > Does this suggests something is still fundamentally wrong with the > > premise of this change or have I done something stupid? > > Seems the latter, particularly your comment that we're looking for > pages pointing to the gfn range to be removed, not just those in the > range. Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or > c0000-c0000, zapping zero or c0000, and I think one of the ones you > were looking for c1080-c1083 is reduce to c1000-c1000 and therefore > zaps sp->gfn c1000. I'll keep looking. Thanks, Ya. As far as where to look, at this point I don't think it's an issue of incorrect zapping. Not because I'm 100% confident the zapping logic is correct, but because many of the tests, e.g. removing 'sp->gfn != gfn' and not being able to exclude APIC/IOAPIC ranges, suggest that the badness is 'fixed' by zapping seemingly unrelated sps. In other words, it may be fundamentally wrong to zap only the memslot being removed, but I really want to know why. History isn't helpful as KVM has always zapped all pages when removing a memslot (on x86), and the introduction of the per-memslot flush hook in commit 2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow") was all about refactoring generic code, and doesn't have any information on whether per-memslot flushing was actually tried for x86. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-21 20:30 ` Sean Christopherson @ 2019-08-23 2:25 ` Sean Christopherson 2019-08-23 22:05 ` Alex Williamson 0 siblings, 1 reply; 38+ messages in thread From: Sean Christopherson @ 2019-08-23 2:25 UTC (permalink / raw) To: Alex Williamson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Wed, Aug 21, 2019 at 01:30:41PM -0700, Sean Christopherson wrote: > On Wed, Aug 21, 2019 at 01:35:04PM -0600, Alex Williamson wrote: > > On Wed, 21 Aug 2019 13:08:59 -0600 > > Alex Williamson <alex.williamson@redhat.com> wrote: > > > Does this suggests something is still fundamentally wrong with the > > > premise of this change or have I done something stupid? > > > > Seems the latter, particularly your comment that we're looking for > > pages pointing to the gfn range to be removed, not just those in the > > range. Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or > > c0000-c0000, zapping zero or c0000, and I think one of the ones you > > were looking for c1080-c1083 is reduce to c1000-c1000 and therefore > > zaps sp->gfn c1000. I'll keep looking. Thanks, > > Ya. As far as where to look, at this point I don't think it's an issue of > incorrect zapping. Not because I'm 100% confident the zapping logic is > correct, but because many of the tests, e.g. removing 'sp->gfn != gfn' and > not being able to exclude APIC/IOAPIC ranges, suggest that the badness is > 'fixed' by zapping seemingly unrelated sps. > > In other words, it may be fundamentally wrong to zap only the memslot > being removed, but I really want to know why. History isn't helpful as > KVM has always zapped all pages when removing a memslot (on x86), and the > introduction of the per-memslot flush hook in commit > > 2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow") > > was all about refactoring generic code, and doesn't have any information > on whether per-memslot flushing was actually tried for x86. One semi-random idea would be to zap mmio pages, i.e. don't skip pages for which sp->mmio_cached is true, regardless of their gfn or level. I don't expect it to make a difference, but it would shrink the haystack on the off change it does "fix" the issues. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-23 2:25 ` Sean Christopherson @ 2019-08-23 22:05 ` Alex Williamson 0 siblings, 0 replies; 38+ messages in thread From: Alex Williamson @ 2019-08-23 22:05 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Thu, 22 Aug 2019 19:25:02 -0700 Sean Christopherson <sean.j.christopherson@intel.com> wrote: > On Wed, Aug 21, 2019 at 01:30:41PM -0700, Sean Christopherson wrote: > > On Wed, Aug 21, 2019 at 01:35:04PM -0600, Alex Williamson wrote: > > > On Wed, 21 Aug 2019 13:08:59 -0600 > > > Alex Williamson <alex.williamson@redhat.com> wrote: > > > > Does this suggests something is still fundamentally wrong with the > > > > premise of this change or have I done something stupid? > > > > > > Seems the latter, particularly your comment that we're looking for > > > pages pointing to the gfn range to be removed, not just those in the > > > range. Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or > > > c0000-c0000, zapping zero or c0000, and I think one of the ones you > > > were looking for c1080-c1083 is reduce to c1000-c1000 and therefore > > > zaps sp->gfn c1000. I'll keep looking. Thanks, > > > > Ya. As far as where to look, at this point I don't think it's an issue of > > incorrect zapping. Not because I'm 100% confident the zapping logic is > > correct, but because many of the tests, e.g. removing 'sp->gfn != gfn' and > > not being able to exclude APIC/IOAPIC ranges, suggest that the badness is > > 'fixed' by zapping seemingly unrelated sps. > > > > In other words, it may be fundamentally wrong to zap only the memslot > > being removed, but I really want to know why. History isn't helpful as > > KVM has always zapped all pages when removing a memslot (on x86), and the > > introduction of the per-memslot flush hook in commit > > > > 2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow") > > > > was all about refactoring generic code, and doesn't have any information > > on whether per-memslot flushing was actually tried for x86. > > One semi-random idea would be to zap mmio pages, i.e. don't skip pages > for which sp->mmio_cached is true, regardless of their gfn or level. I > don't expect it to make a difference, but it would shrink the haystack on > the off change it does "fix" the issues. You're right, it doesn't fix it. All of the logging I've been staring at suggests your patch does exactly what it's intended to do, but it still breaks GPU assignment in weird ways. I have no idea why. Thanks, Alex ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-21 19:08 ` Alex Williamson 2019-08-21 19:35 ` Alex Williamson @ 2019-08-21 20:10 ` Sean Christopherson 2019-08-26 7:36 ` Tian, Kevin 2019-08-26 14:56 ` Sean Christopherson 1 sibling, 2 replies; 38+ messages in thread From: Sean Christopherson @ 2019-08-21 20:10 UTC (permalink / raw) To: Alex Williamson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Wed, Aug 21, 2019 at 01:08:59PM -0600, Alex Williamson wrote: > On Tue, 20 Aug 2019 14:02:45 -0700 > Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > Hrm, I'm back to being completely flummoxed. > > > > Would you be able to generate a trace of all events/kvmmmu, using the > > latest patch? I'd like to rule out a stupid code bug if it's not too > > much trouble. > > I tried to simplify the patch, making it closer to zap_all, so I > removed the max_level calculation and exclusion based on s->role.level, > as well as the gfn range filtering. For good measure I even removed > the sp->root_count test, so any sp not marked invalid is zapped. This > works, and I can also add back the sp->root_count test and things > remain working. The sp->root_count test is necessary to avoid an infinite loop when not using tdp, e.g. EPT. The root sp can't be moved off active_mmu_pages and thus freed until all references to the root are gone. When not using tdp, the loop can restart in response to __kvm_mmu_prepare_zap_page(), even if the root sp page was already invalid. Functionally, skipping an invalid sp *shouldn't* change anything, as the actual ptes have been zapped, it's only the kernel resources that aren't yet freed. TL;DR: completely expected that removing the sp->root_count didn't affect the result. > From there I added back the gfn range test, but I left out the gfn_mask > because I'm not doing the level filtering and I think(?) this is just > another optimization. So essentially I only add: > > if (sp->gfn < slot->base_gfn || > sp->gfn > (slot->base_gfn + slot->npages - 1)) > continue; This doesn't work because of what sp->gfn contains. The shadow page is the page/table holding the spt entries, e.g. a level 1 sp refers to a page table for entries defining 4k pages. sp->gfn holds the base gfn of the range of gfns covered by the sp, which allows finding all parent sps for a given gfn. E.g. for level 1 sp, sp->gfn is found by masking the gfn of a given page by 0xFFFFFFFFFFFFFE00, level 2 sp by 0xFFFFFFFFFFFC0000, and so on up the chain. This is what the original code botched, as it would only find all sps for a memslot if the base of the memslot were 2mb aligned for 4k pages, 1gb aligned for 2mb pages, etc... This just happened to mostly work since memslots are usually aligned to such requirements. In the original code, removing the "sp->gfn != gfn" check caused KVM to zap random sps that just happened to hash to the same entry. Likewise, omitting the gfn filter in this code means everything gets zapped. What I don't understand is why zapping everything, or at least userspace MMIO addresses, is necessary when removing the GPU BAR memslot. The IOAPIC sp in particular makes no sense. AIUI, the IOAPIC is always emulated, i.e. not passed through, and its base/size is static (or at least static after init). Zapping an IOAPIC sp will send KVM down different paths but AFAIK the final outsome should be unaffected. > Not only does this not work, the host will sometimes oops: > > [ 808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000 > [ 808.555065] #PF: supervisor read access in kernel mode > [ 808.565326] #PF: error_code(0x0000) - not-present page > [ 808.575588] PGD 0 P4D 0 > [ 808.580649] Oops: 0000 [#1] SMP PTI > [ 808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4 > [ 808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 > [ 808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm] > [ 808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d > [ 808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202 > [ 808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014 > [ 808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004 > [ 808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000 > [ 808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260 > [ 808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004 > [ 808.747620] FS: 00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000 > [ 808.763776] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0 > [ 808.789499] Call Trace: > [ 808.794399] drop_spte+0x77/0xa0 [kvm] > [ 808.801885] mmu_page_zap_pte+0xac/0xe0 [kvm] > [ 808.810587] __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm] > [ 808.821196] kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm] > [ 808.833881] kvm_page_track_flush_slot+0x55/0x80 [kvm] > [ 808.844140] __kvm_set_memory_region+0x821/0xaa0 [kvm] > [ 808.854402] kvm_set_memory_region+0x26/0x40 [kvm] > [ 808.863971] kvm_vm_ioctl+0x59a/0x940 [kvm] > [ 808.872318] ? pagevec_lru_move_fn+0xb8/0xd0 > [ 808.880846] ? __seccomp_filter+0x7a/0x680 > [ 808.889028] do_vfs_ioctl+0xa4/0x630 > [ 808.896168] ? security_file_ioctl+0x32/0x50 > [ 808.904695] ksys_ioctl+0x60/0x90 > [ 808.911316] __x64_sys_ioctl+0x16/0x20 > [ 808.918807] do_syscall_64+0x5f/0x1a0 > [ 808.926121] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 808.936209] RIP: 0033:0x7f28ebf2b0fb > > Does this suggests something is still fundamentally wrong with the > premise of this change or have I done something stupid? Thanks, The NULL pointer thing is unexpected, it means we have a spte, i.e. the actual entry seen/used by hardware, that KVM thinks is present but doesn't have the expected KVM tracking. I'll take a look, my understanding is that zapping shadow pages at random shouldn't cause problems. ^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-21 20:10 ` Sean Christopherson @ 2019-08-26 7:36 ` Tian, Kevin 2019-08-26 14:56 ` Sean Christopherson 1 sibling, 0 replies; 38+ messages in thread From: Tian, Kevin @ 2019-08-26 7:36 UTC (permalink / raw) To: Christopherson, Sean J, Alex Williamson Cc: Paolo Bonzini, Radim Krcmár, kvm, Xiao Guangrong > From: Sean Christopherson > Sent: Thursday, August 22, 2019 4:11 AM > [...] > > > From there I added back the gfn range test, but I left out the gfn_mask > > because I'm not doing the level filtering and I think(?) this is just > > another optimization. So essentially I only add: > > > > if (sp->gfn < slot->base_gfn || > > sp->gfn > (slot->base_gfn + slot->npages - 1)) > > continue; > > This doesn't work because of what sp->gfn contains. The shadow page is > the page/table holding the spt entries, e.g. a level 1 sp refers to a page > table for entries defining 4k pages. sp->gfn holds the base gfn of the > range of gfns covered by the sp, which allows finding all parent sps for a > given gfn. E.g. for level 1 sp, sp->gfn is found by masking the gfn of a > given page by 0xFFFFFFFFFFFFFE00, level 2 sp by 0xFFFFFFFFFFFC0000, and so > on up the chain. > > This is what the original code botched, as it would only find all sps for > a memslot if the base of the memslot were 2mb aligned for 4k pages, 1gb > aligned for 2mb pages, etc... This just happened to mostly work since > memslots are usually aligned to such requirements. In the original code, > removing the "sp->gfn != gfn" check caused KVM to zap random sps that just > happened to hash to the same entry. Likewise, omitting the gfn filter in > this code means everything gets zapped. > > What I don't understand is why zapping everything, or at least userspace > MMIO addresses, is necessary when removing the GPU BAR memslot. The > IOAPIC sp in particular makes no sense. AIUI, the IOAPIC is always > emulated, i.e. not passed through, and its base/size is static (or at > least static after init). Zapping an IOAPIC sp will send KVM down > different paths but AFAIK the final outsome should be unaffected. > I have the same feeling. the base address of IOAPIC is described by ACPI MADT thus static. It is possible to reprogram LAPIC base but no commodity OS does it today. btw I found Alex mentioned gfn 0x100000 in earlier post. This one is usually the starting of high memory. If true, it becomes even more weird, not just about MMIO thing. Thanks Kevin ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-21 20:10 ` Sean Christopherson 2019-08-26 7:36 ` Tian, Kevin @ 2019-08-26 14:56 ` Sean Christopherson 1 sibling, 0 replies; 38+ messages in thread From: Sean Christopherson @ 2019-08-26 14:56 UTC (permalink / raw) To: Alex Williamson Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong On Wed, Aug 21, 2019 at 01:10:43PM -0700, Sean Christopherson wrote: > On Wed, Aug 21, 2019 at 01:08:59PM -0600, Alex Williamson wrote: > > Not only does this not work, the host will sometimes oops: > > > > [ 808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000 > > [ 808.555065] #PF: supervisor read access in kernel mode > > [ 808.565326] #PF: error_code(0x0000) - not-present page > > [ 808.575588] PGD 0 P4D 0 > > [ 808.580649] Oops: 0000 [#1] SMP PTI > > [ 808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4 > > [ 808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 > > [ 808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm] > > [ 808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d > > [ 808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202 > > [ 808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014 > > [ 808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004 > > [ 808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000 > > [ 808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260 > > [ 808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004 > > [ 808.747620] FS: 00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000 > > [ 808.763776] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0 > > [ 808.789499] Call Trace: > > [ 808.794399] drop_spte+0x77/0xa0 [kvm] > > [ 808.801885] mmu_page_zap_pte+0xac/0xe0 [kvm] > > [ 808.810587] __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm] > > [ 808.821196] kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm] > > [ 808.833881] kvm_page_track_flush_slot+0x55/0x80 [kvm] > > [ 808.844140] __kvm_set_memory_region+0x821/0xaa0 [kvm] > > [ 808.854402] kvm_set_memory_region+0x26/0x40 [kvm] > > [ 808.863971] kvm_vm_ioctl+0x59a/0x940 [kvm] > > [ 808.872318] ? pagevec_lru_move_fn+0xb8/0xd0 > > [ 808.880846] ? __seccomp_filter+0x7a/0x680 > > [ 808.889028] do_vfs_ioctl+0xa4/0x630 > > [ 808.896168] ? security_file_ioctl+0x32/0x50 > > [ 808.904695] ksys_ioctl+0x60/0x90 > > [ 808.911316] __x64_sys_ioctl+0x16/0x20 > > [ 808.918807] do_syscall_64+0x5f/0x1a0 > > [ 808.926121] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [ 808.936209] RIP: 0033:0x7f28ebf2b0fb > > > > Does this suggests something is still fundamentally wrong with the > > premise of this change or have I done something stupid? Thanks, > > The NULL pointer thing is unexpected, it means we have a spte, i.e. the > actual entry seen/used by hardware, that KVM thinks is present but doesn't > have the expected KVM tracking. I'll take a look, my understanding is > that zapping shadow pages at random shouldn't cause problems. The NULL pointer dereference is expected given the flawed implementation, i.e. there isn't another bug lurking for that particular problem. The issue isn't zapping random sptes, but rather that the flawed logic leaves dangling sptes. When a different action, e.g. zapping all memslots, triggers zapping of the dangling spte(s), gfn_to_rmap() attempts to find the corresponding memslot and hits the above BUG because the memslot no longer exists. On the flip side, not hitting that condition provides additional confidence in the reworked flow, i.e. proves to some degree that it's zapping all sptes in the to-be-removed memslot. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2019-08-20 20:03 ` Sean Christopherson 2019-08-20 20:42 ` Alex Williamson @ 2020-06-26 17:32 ` Sean Christopherson 2022-10-20 18:31 ` Alexander Graf 1 sibling, 1 reply; 38+ messages in thread From: Sean Christopherson @ 2020-06-26 17:32 UTC (permalink / raw) To: Paolo Bonzini Cc: Alex Williamson, Radim Krčmář, kvm, Xiao Guangrong /cast <thread necromancy> On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: > On Mon, Aug 19, 2019 at 06:03:05PM +0200, Paolo Bonzini wrote: > > It seems like the problem occurs when the sp->gfn you "continue over" > > includes a userspace-MMIO gfn. But since I have no better ideas right > > now, I'm going to apply the revert (we don't know for sure that it only > > happens with assigned devices). > > After many hours of staring, I've come to the conclusion that > kvm_mmu_invalidate_zap_pages_in_memslot() is completely broken, i.e. > a revert is definitely in order for 5.3 and stable. Whelp, past me was wrong. The patch wasn't completely broken, as the rmap zapping piece of kvm_mmu_invalidate_zap_pages_in_memslot() was sufficient to satisfy removal of the memslot. I.e. zapping leaf PTEs (including large pages) should prevent referencing the old memslot, the fact that zapping upper level shadow pages was broken was irrelevant because there should be no need to zap non-leaf PTEs. The one thing that sticks out as potentially concerning is passing %false for @lock_flush_tlb. Dropping mmu_lock during slot_handle_level_range() without flushing would allow a vCPU to create and use a new entry while a different vCPU has the old entry cached in its TLB. I think that could even happen with a single vCPU if the memslot is deleted by a helper task, and the zapped page was a large page that was fractured into small pages when inserted into the TLB. TL;DR: Assuming no other bugs in the kernel, this should be sufficient if the goal is simply to prevent usage of a memslot: static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, struct kvm_page_track_notifier_node *node) { bool flush; spin_lock(&kvm->mmu_lock); flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, true); if (flush) kvm_flush_remote_tlbs(kvm); spin_unlock(& } > mmu_page_hash is indexed by the gfn of the shadow pages, not the gfn of > the shadow ptes, e.g. for_each_valid_sp() will find page tables, page > directories, etc... Passing in the raw gfns of the memslot doesn't work > because the gfn isn't properly adjusted/aligned to match how KVM tracks > gfns for shadow pages, e.g. removal of the companion audio memslot that > occupies gfns 0xc1080 - 0xc1083 would need to query gfn 0xc1000 to find > the shadow page table containing the relevant sptes. > > This is why Paolo's suggestion to remove slot_handle_all_level() on > kvm_zap_rmapp() caused a BUG(), as zapping the rmaps cleaned up KVM's > accounting without actually zapping the relevant sptes. This is just straight up wrong, zapping the rmaps does zap the leaf sptes. The BUG() occurs because gfn_to_rmap() works on the _new_ memslots instance, and if a memslot is being deleted there is no memslot for the gfn, hence the NULL pointer bug when mmu_page_zap_pte() attempts to zap a PTE. Zapping the rmaps (leaf/last PTEs) first "fixes" the issue by making it so that mmu_page_zap_pte() will never see a present PTE for a non-existent meslot. I don't think any of this explains the pass-through GPU issue. But, we have a few use cases where zapping the entire MMU is undesirable, so I'm going to retry upstreaming this patch as with per-VM opt-in. I wanted to set the record straight for posterity before doing so. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2020-06-26 17:32 ` Sean Christopherson @ 2022-10-20 18:31 ` Alexander Graf 2022-10-20 20:37 ` Sean Christopherson 0 siblings, 1 reply; 38+ messages in thread From: Alexander Graf @ 2022-10-20 18:31 UTC (permalink / raw) To: Sean Christopherson Cc: Alex Williamson, Radim Krčmář, kvm, Xiao Guangrong, Chandrasekaran, Siddharth, Paolo Bonzini On 26.06.20 19:32, Sean Christopherson wrote: > /cast <thread necromancy> > > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: [...] > I don't think any of this explains the pass-through GPU issue. But, we > have a few use cases where zapping the entire MMU is undesirable, so I'm > going to retry upstreaming this patch as with per-VM opt-in. I wanted to > set the record straight for posterity before doing so. Hey Sean, Did you ever get around to upstream or rework the zap optimization? The way I read current upstream, a memslot change still always wipes all SPTEs, not only the ones that were changed. Thanks, Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2022-10-20 18:31 ` Alexander Graf @ 2022-10-20 20:37 ` Sean Christopherson 2022-10-20 21:06 ` Alexander Graf 0 siblings, 1 reply; 38+ messages in thread From: Sean Christopherson @ 2022-10-20 20:37 UTC (permalink / raw) To: Alexander Graf Cc: Alex Williamson, Radim Krčmář, kvm, Xiao Guangrong, Chandrasekaran, Siddharth, Paolo Bonzini On Thu, Oct 20, 2022, Alexander Graf wrote: > On 26.06.20 19:32, Sean Christopherson wrote: > > /cast <thread necromancy> > > > > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: > > [...] > > > I don't think any of this explains the pass-through GPU issue. But, we > > have a few use cases where zapping the entire MMU is undesirable, so I'm > > going to retry upstreaming this patch as with per-VM opt-in. I wanted to > > set the record straight for posterity before doing so. > > Hey Sean, > > Did you ever get around to upstream or rework the zap optimization? The way > I read current upstream, a memslot change still always wipes all SPTEs, not > only the ones that were changed. Nope, I've more or less given up hope on zapping only the deleted/moved memslot. TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very much a special case. Do you have use case and/or issue that doesn't play nice with the "zap all" behavior? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2022-10-20 20:37 ` Sean Christopherson @ 2022-10-20 21:06 ` Alexander Graf 2022-10-21 19:40 ` Sean Christopherson 0 siblings, 1 reply; 38+ messages in thread From: Alexander Graf @ 2022-10-20 21:06 UTC (permalink / raw) To: Sean Christopherson Cc: Alex Williamson, Radim Krčmář, kvm, Xiao Guangrong, Chandrasekaran, Siddharth, Paolo Bonzini On 20.10.22 22:37, Sean Christopherson wrote: > On Thu, Oct 20, 2022, Alexander Graf wrote: >> On 26.06.20 19:32, Sean Christopherson wrote: >>> /cast <thread necromancy> >>> >>> On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: >> [...] >> >>> I don't think any of this explains the pass-through GPU issue. But, we >>> have a few use cases where zapping the entire MMU is undesirable, so I'm >>> going to retry upstreaming this patch as with per-VM opt-in. I wanted to >>> set the record straight for posterity before doing so. >> Hey Sean, >> >> Did you ever get around to upstream or rework the zap optimization? The way >> I read current upstream, a memslot change still always wipes all SPTEs, not >> only the ones that were changed. > Nope, I've more or less given up hope on zapping only the deleted/moved memslot. > TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very > much a special case. > > Do you have use case and/or issue that doesn't play nice with the "zap all" behavior? Yeah, we're looking at adding support for the Hyper-V VSM extensions which Windows uses to implement Credential Guard. With that, the guest gets access to hypercalls that allow it to set reduced permissions for arbitrary gfns. To ensure that user space has full visibility into those for live migration, memory slots to model access would be a great fit. But it means we'd do ~100k memslot modifications on boot. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2022-10-20 21:06 ` Alexander Graf @ 2022-10-21 19:40 ` Sean Christopherson 2022-10-24 6:12 ` Alexander Graf 0 siblings, 1 reply; 38+ messages in thread From: Sean Christopherson @ 2022-10-21 19:40 UTC (permalink / raw) To: Alexander Graf Cc: Alex Williamson, Radim Krčmář, kvm, Xiao Guangrong, Chandrasekaran, Siddharth, Paolo Bonzini On Thu, Oct 20, 2022, Alexander Graf wrote: > > On 20.10.22 22:37, Sean Christopherson wrote: > > On Thu, Oct 20, 2022, Alexander Graf wrote: > > > On 26.06.20 19:32, Sean Christopherson wrote: > > > > /cast <thread necromancy> > > > > > > > > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: > > > [...] > > > > > > > I don't think any of this explains the pass-through GPU issue. But, we > > > > have a few use cases where zapping the entire MMU is undesirable, so I'm > > > > going to retry upstreaming this patch as with per-VM opt-in. I wanted to > > > > set the record straight for posterity before doing so. > > > Hey Sean, > > > > > > Did you ever get around to upstream or rework the zap optimization? The way > > > I read current upstream, a memslot change still always wipes all SPTEs, not > > > only the ones that were changed. > > Nope, I've more or less given up hope on zapping only the deleted/moved memslot. > > TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very > > much a special case. > > > > Do you have use case and/or issue that doesn't play nice with the "zap all" behavior? > > > Yeah, we're looking at adding support for the Hyper-V VSM extensions which > Windows uses to implement Credential Guard. With that, the guest gets access > to hypercalls that allow it to set reduced permissions for arbitrary gfns. > To ensure that user space has full visibility into those for live migration, > memory slots to model access would be a great fit. But it means we'd do > ~100k memslot modifications on boot. Oof. 100k memslot updates is going to be painful irrespective of flushing. And memslots (in their current form) won't work if the guest can drop executable permissions. Assuming KVM needs to support a KVM_MEM_NO_EXEC flag, rather than trying to solve the "KVM flushes everything on memslot deletion", I think we should instead properly support toggling KVM_MEM_READONLY (and KVM_MEM_NO_EXEC) without forcing userspace to delete the memslot. Commit 75d61fbcf563 ("KVM: set_memory_region: Disallow changing read-only attribute later") was just a quick-and-dirty fix, there's no fundemental problem that makes it impossible (or even all that difficult) to support toggling permissions. The ABI would be that KVM only guarantees the new permissions take effect when the ioctl() returns, i.e. KVM doesn't need to ensure there are no writable SPTEs when the memslot is installed, just that there are no writable SPTEs before userspace regains control. E.g. sans sanity checking and whatnot, I think x86 support would be something like: @@ -12669,9 +12667,16 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm, * MOVE/DELETE: The old mappings will already have been cleaned up by * kvm_arch_flush_shadow_memslot(). */ - if ((change != KVM_MR_FLAGS_ONLY) || (new_flags & KVM_MEM_READONLY)) + if (change != KVM_MR_FLAGS_ONLY) return; + if ((old_flags ^ new_flags) & KVM_MEM_READONLY) { + if ((new_flags & KVM_MEM_READONLY) && + kvm_mmu_slot_write_protect(kvm, new)) + kvm_arch_flush_remote_tlbs_memslot(kvm, new); + return; + } + /* * READONLY and non-flags changes were filtered out above, and the only * other flag is LOG_DIRTY_PAGES, i.e. something is wrong if dirty ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2022-10-21 19:40 ` Sean Christopherson @ 2022-10-24 6:12 ` Alexander Graf 2022-10-24 15:55 ` Sean Christopherson 0 siblings, 1 reply; 38+ messages in thread From: Alexander Graf @ 2022-10-24 6:12 UTC (permalink / raw) To: Sean Christopherson Cc: Alex Williamson, Radim Krčmář, kvm, Xiao Guangrong, Chandrasekaran, Siddharth, Paolo Bonzini Hey Sean, On 21.10.22 21:40, Sean Christopherson wrote: > > On Thu, Oct 20, 2022, Alexander Graf wrote: >> On 20.10.22 22:37, Sean Christopherson wrote: >>> On Thu, Oct 20, 2022, Alexander Graf wrote: >>>> On 26.06.20 19:32, Sean Christopherson wrote: >>>>> /cast <thread necromancy> >>>>> >>>>> On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: >>>> [...] >>>> >>>>> I don't think any of this explains the pass-through GPU issue. But, we >>>>> have a few use cases where zapping the entire MMU is undesirable, so I'm >>>>> going to retry upstreaming this patch as with per-VM opt-in. I wanted to >>>>> set the record straight for posterity before doing so. >>>> Hey Sean, >>>> >>>> Did you ever get around to upstream or rework the zap optimization? The way >>>> I read current upstream, a memslot change still always wipes all SPTEs, not >>>> only the ones that were changed. >>> Nope, I've more or less given up hope on zapping only the deleted/moved memslot. >>> TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very >>> much a special case. >>> >>> Do you have use case and/or issue that doesn't play nice with the "zap all" behavior? >> >> Yeah, we're looking at adding support for the Hyper-V VSM extensions which >> Windows uses to implement Credential Guard. With that, the guest gets access >> to hypercalls that allow it to set reduced permissions for arbitrary gfns. >> To ensure that user space has full visibility into those for live migration, >> memory slots to model access would be a great fit. But it means we'd do >> ~100k memslot modifications on boot. > Oof. 100k memslot updates is going to be painful irrespective of flushing. And > memslots (in their current form) won't work if the guest can drop executable > permissions. > > Assuming KVM needs to support a KVM_MEM_NO_EXEC flag, rather than trying to solve > the "KVM flushes everything on memslot deletion", I think we should instead > properly support toggling KVM_MEM_READONLY (and KVM_MEM_NO_EXEC) without forcing > userspace to delete the memslot. Commit 75d61fbcf563 ("KVM: set_memory_region: That would be a cute acceleration for the case where we have to change permissions for a full slot. Unfortunately, the bulk of the changes are slot splits. Let me explain with numbers from a 1 vcpu, 8GB Windows Server 2019 boot: GFN permission modification requests: 46294 Unique GFNs: 21200 That means on boot, we start off with a few huge memslots for guest RAM. Then down the road, we need to change permissions for individual pages inside these larger regions. The obvious option for that is a memslot split - delete, create, create, create. Now we have 2 large memslots and 1 that only spans a single page. Later in the boot process, Windows then some times also toggles permissions for pages that it already split off earlier. That's the case we can optimize with the modify optimization you described in the previous email. But that's only about half the requests. The other half are memslot split requests. We already built a prototype implementation of an atomic memslot update ioctl that allows us to keep other vCPUs running while we do the delete/create/create/create operation. But even with that, we see up to 30 min boot times for larger guests that most of the time are stuck in zapping pages. I guess we have 2 options to make this viable: 1) Optimize memslot splits + modifications to a point where they're fast enough 2) Add a different, faster mechanism on top of memslots for page granular permission bits Also sorry for not posting the underlying credguard and atomic memslot patches yet. I wanted to kick off this conversation before sending them out - they're still too raw for upstream review atm :). Thanks, Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot 2022-10-24 6:12 ` Alexander Graf @ 2022-10-24 15:55 ` Sean Christopherson 0 siblings, 0 replies; 38+ messages in thread From: Sean Christopherson @ 2022-10-24 15:55 UTC (permalink / raw) To: Alexander Graf Cc: Alex Williamson, Radim Krčmář, kvm, Xiao Guangrong, Chandrasekaran, Siddharth, Paolo Bonzini On Mon, Oct 24, 2022, Alexander Graf wrote: > Hey Sean, > > On 21.10.22 21:40, Sean Christopherson wrote: > > > > On Thu, Oct 20, 2022, Alexander Graf wrote: > > > On 20.10.22 22:37, Sean Christopherson wrote: > > > > On Thu, Oct 20, 2022, Alexander Graf wrote: > > > > > On 26.06.20 19:32, Sean Christopherson wrote: > > > > > > /cast <thread necromancy> > > > > > > > > > > > > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: > > > > > [...] > > > > > > > > > > > I don't think any of this explains the pass-through GPU issue. But, we > > > > > > have a few use cases where zapping the entire MMU is undesirable, so I'm > > > > > > going to retry upstreaming this patch as with per-VM opt-in. I wanted to > > > > > > set the record straight for posterity before doing so. > > > > > Hey Sean, > > > > > > > > > > Did you ever get around to upstream or rework the zap optimization? The way > > > > > I read current upstream, a memslot change still always wipes all SPTEs, not > > > > > only the ones that were changed. > > > > Nope, I've more or less given up hope on zapping only the deleted/moved memslot. > > > > TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very > > > > much a special case. > > > > > > > > Do you have use case and/or issue that doesn't play nice with the "zap all" behavior? > > > > > > Yeah, we're looking at adding support for the Hyper-V VSM extensions which > > > Windows uses to implement Credential Guard. With that, the guest gets access > > > to hypercalls that allow it to set reduced permissions for arbitrary gfns. > > > To ensure that user space has full visibility into those for live migration, > > > memory slots to model access would be a great fit. But it means we'd do > > > ~100k memslot modifications on boot. > > Oof. 100k memslot updates is going to be painful irrespective of flushing. And > > memslots (in their current form) won't work if the guest can drop executable > > permissions. > > > > Assuming KVM needs to support a KVM_MEM_NO_EXEC flag, rather than trying to solve > > the "KVM flushes everything on memslot deletion", I think we should instead > > properly support toggling KVM_MEM_READONLY (and KVM_MEM_NO_EXEC) without forcing > > userspace to delete the memslot. Commit 75d61fbcf563 ("KVM: set_memory_region: > > > That would be a cute acceleration for the case where we have to change > permissions for a full slot. Unfortunately, the bulk of the changes are slot > splits. Ah, right, the guest will be operating on per-page granularity. > We already built a prototype implementation of an atomic memslot update > ioctl that allows us to keep other vCPUs running while we do the > delete/create/create/create operation. Please weigh in with your use case on a relevant upstream discussion regarding "atomic" memslot updates[*]. I suspect we'll end up with a different solution for this use case (see below), but we should at least capture all potential use cases and ideas for modifying memslots without pausing vCPUs. [*] https://lore.kernel.org/all/20220909104506.738478-1-eesposit@redhat.com > But even with that, we see up to 30 min boot times for larger guests that > most of the time are stuck in zapping pages. Out of curiosity, did you measure runtime performance? I would expect some amount of runtime overhead as well dut to fragmenting memslots to that degree. > I guess we have 2 options to make this viable: > > 1) Optimize memslot splits + modifications to a point where they're fast > enough > 2) Add a different, faster mechanism on top of memslots for page granular > permission bits #2 crossed my mind as well. This is actually nearly identical to the confidential VM use case, where KVM needs to handle guest-initiated conversions of memory between "private" and "shared" on a per-page granularity. The proposed solution for that is indeed a layer on top of memslots[*], which we arrived at in no small part because splitting memslots was going to be a bottleneck. Extending the proposed mem_attr_array to support additional state should be quite easy. The framework is all there, KVM just needs a few extra flags values, e.g. KVM_MEM_ATTR_SHARED BIT(0) KVM_MEM_ATTR_READONLY BIT(1) KVM_MEM_ATTR_NOEXEC BIT(2) and then new ioctls to expose the functionality to userspace. Actually, if we want to go this route, it might even make sense to define new a generic MEM_ATTR ioctl() right away instead of repurposing KVM_MEMORY_ENCRYPT_(UN)REG_REGION for the private vs. shared use case. [*] https://lore.kernel.org/all/20220915142913.2213336-6-chao.p.peng@linux.intel.com ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2022-10-24 17:24 UTC | newest] Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-02-05 20:54 [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism Sean Christopherson 2019-02-05 20:54 ` [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots Sean Christopherson 2019-02-06 9:12 ` Cornelia Huck 2019-02-12 12:36 ` [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism Paolo Bonzini [not found] ` <20190205210137.1377-11-sean.j.christopherson@intel.com> 2019-08-13 16:04 ` [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot Alex Williamson 2019-08-13 17:04 ` Sean Christopherson 2019-08-13 17:57 ` Alex Williamson 2019-08-13 19:33 ` Alex Williamson 2019-08-13 20:19 ` Sean Christopherson 2019-08-13 20:37 ` Paolo Bonzini 2019-08-13 21:14 ` Alex Williamson 2019-08-13 21:15 ` Paolo Bonzini 2019-08-13 22:10 ` Alex Williamson 2019-08-15 14:46 ` Sean Christopherson 2019-08-15 15:23 ` Alex Williamson 2019-08-15 16:00 ` Sean Christopherson 2019-08-15 18:16 ` Alex Williamson 2019-08-15 19:25 ` Sean Christopherson 2019-08-15 20:11 ` Alex Williamson 2019-08-19 16:03 ` Paolo Bonzini 2019-08-20 20:03 ` Sean Christopherson 2019-08-20 20:42 ` Alex Williamson 2019-08-20 21:02 ` Sean Christopherson 2019-08-21 19:08 ` Alex Williamson 2019-08-21 19:35 ` Alex Williamson 2019-08-21 20:30 ` Sean Christopherson 2019-08-23 2:25 ` Sean Christopherson 2019-08-23 22:05 ` Alex Williamson 2019-08-21 20:10 ` Sean Christopherson 2019-08-26 7:36 ` Tian, Kevin 2019-08-26 14:56 ` Sean Christopherson 2020-06-26 17:32 ` Sean Christopherson 2022-10-20 18:31 ` Alexander Graf 2022-10-20 20:37 ` Sean Christopherson 2022-10-20 21:06 ` Alexander Graf 2022-10-21 19:40 ` Sean Christopherson 2022-10-24 6:12 ` Alexander Graf 2022-10-24 15:55 ` Sean Christopherson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).