KVM Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism
@ 2019-02-05 20:54 Sean Christopherson
  2019-02-05 20:54 ` [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots Sean Christopherson
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Sean Christopherson @ 2019-02-05 20:54 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář,
	James Hogan, Paul Mackerras, Christian Borntraeger,
	Janosch Frank, Christoffer Dall, Marc Zyngier
  Cc: linux-s390, kvm, David Hildenbrand, Cornelia Huck, linux-mips,
	kvm-ppc, Xiao Guangrong, kvmarm, linux-arm-kernel

...and clean up the MMIO spte generation code.

=== Non-x86 Folks ===
Patch 01/27 (which you should also be cc'd on) changes the prototype for
a function that is defined by all architectures but only actually used
by x86.  Feel free to skip the rest of this cover letter.

=== x86 Folks ===
Though not explicitly stated, for all intents and purposes the fast
invalidate mechanism was added to speed up the scenario where removing
a memslot, e.g. when accessing PCI ROM, caused KVM to flush all shadow
entries[1].

The other use cases of "flush everything" are VM teardown and handling
MMIO generation overflow, neither of which is a performance critial
path (see "Other Uses" below).

For the memslot case, zapping all shadow entries is overkill, i.e. KVM
only needs to zap the entries associated with the memslot, but KVM has
historically used a big hammer because if all you have is a hammer,
everything looks like a nail.

Rather than zap all pages when removing a memslot, zap only the shadow
entries associated with said memslot.  I see a performance improvement
of ~5% when zapping only the pages for the deleted memslot when using a
slightly modified version of the original benchmark[2][3][4] (I don't
have easy access to a system with hundreds of gigs of memory).

$ cat shell.sh
        #!/bin/sh

        echo 3 > /proc/sys/vm/drop_caches
        ./mmtest -c 8 -m 2000 -e ./rom.sh

I.e. running 8 threads and 2gb of memory per thread, time in milliseconds:

Before: 89.117
After:  84.768


With the memslot use case gone, maintaining the fast invalidate path
adds a moderate amount of complexity but provides little to no value.
Furhtermore, its existence may give the impression that it's "ok" to zap
all shadow pages.  Remove the fast invalidate mechanism to simplify the
code and to discourage future code from zapping all pages as using such
a big hammer should be a last resort.

Unfortunately, simply reverting the fast invalidate code re-introduces
shortcomings in the previous code, notably that KVM may hold a spinlock
and not schedule for an extended period of time.  Releasing the lock to
voluntarily reschedule is perfectly doable, and for the VM teardown case
of "zap everything" it can be added with no additional changes since KVM
doesn't need to provide any ordering guarantees about sptes in that case.

For the MMIO case, KVM needs to prevent vCPUs from consuming stale sptes
created with an old generation number.  This should be a non-issue as
KVM is supposed to prevent consuming stale MMIO sptes regardless of the
zapping mechanism employed, releasing mmu_lock while zapping simply
makes it much more likely to hit any bug leading to stale spte usage.
As luck would have it, there are a number of corner cases in the MMIO
spte invalidation flow that could theoretically result in reusing stale
sptes.  So before reworking memslot zapping and reverting to the slow
invalidate mechanism, fix a number of bugs related to MMIO generations
and opportunistically clean up the memslots/MMIO generation code.


=== History ===
Flushing of shadow pages when removing a memslot was originally added
by commit 34d4cb8fca1f ("KVM: MMU: nuke shadowed pgtable pages and ptes
on memslot destruction"), and obviously emphasized functionality over
performance.  Commit 2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow")
added a path to allow flushing only the removed slot's shadow pages,
but x86 just redirected to the "zap all" flow.

Eventually, it became evident that zapping everything is slow, and so
x86 developed a more efficient hammer in the form of the fast invalidate
mechanism.

=== Other Uses ===
When a VM is being destroyed, either there are no active vcpus, i.e.
there's no lock contention, or the VM has ungracefully terminated, in
which case we want to reclaim its pages as quickly as possible, i.e.
not release the MMU lock if there are still CPUs executing in the VM.

The MMIO generation scenario is almost literally a one-in-a-million
occurrence, i.e. is not a performance sensitive scenario.

It's worth noting that prior to the "fast invalidate" series being
applied, there was actually another use case of kvm_mmu_zap_all() in
emulator_fix_hypercall() whose existence may have contributed to
improving the performance of "zap all" instead of avoiding it altogether.
But that usage was removed by the series itself in commit 758ccc89b83c
("KVM: x86: drop calling kvm_mmu_zap_all in emulator_fix_hypercall").

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
[2] https://lkml.kernel.org/r/1368706673-8530-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
[3] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277.html
[4] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277/mmtest.tar.bz2

v1: https://patchwork.kernel.org/cover/10756579/
v2:
  - Fix a zap vs flush priority bug in kvm_mmu_remote_flush_or_zap() [me]
  - Voluntarily reschedule and/or drop mmu_lock as needed when zapping
    all sptes or all MMIO sptes [Paolo]
  - Add patches to clean up the memslots/MMIO generation code and fix
    a variety of theoretically corner case bugs


Sean Christopherson (27):
  KVM: Call kvm_arch_memslots_updated() before updating memslots
  KVM: x86/mmu: Detect MMIO generation wrap in any address space
  KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux
  KVM: Explicitly define the "memslot update in-progress" bit
  KVM: x86: Use a u64 when passing the MMIO gen around
  KVM: x86: Refactor the MMIO SPTE generation handling
  KVM: Remove the hack to trigger memslot generation wraparound
  KVM: Move the memslot update in-progress flag to bit 63
  KVM: x86/mmu: Move slot_level_*() helper functions up a few lines
  KVM: x86/mmu: Split remote_flush+zap case out of
    kvm_mmu_flush_or_zap()
  KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  Revert "KVM: MMU: document fast invalidate all pages"
  Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes"
  KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes
  KVM: x86/mmu: Remove is_obsolete() call
  Revert "KVM: MMU: reclaim the zapped-obsolete page first"
  Revert "KVM: MMU: collapse TLB flushes when zap all pages"
  Revert "KVM: MMU: zap pages in batch"
  Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages"
  Revert "KVM: MMU: show mmu_valid_gen in shadow page related
    tracepoints"
  Revert "KVM: x86: use the fast way to invalidate all pages"
  KVM: x86/mmu: skip over invalid root pages when zapping all sptes
  KVM: x86/mmu: Voluntarily reschedule as needed when zapping all sptes
  Revert "KVM: MMU: fast invalidate all pages"
  KVM: x86/mmu: Differentiate between nr zapped and list unstable
  KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children
  KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and
    kvm_mmu_zap_mmio_sptes()

 Documentation/virtual/kvm/mmu.txt   |  41 +--
 arch/mips/include/asm/kvm_host.h    |   2 +-
 arch/powerpc/include/asm/kvm_host.h |   2 +-
 arch/s390/include/asm/kvm_host.h    |   2 +-
 arch/x86/include/asm/kvm_host.h     |   9 +-
 arch/x86/kvm/mmu.c                  | 442 +++++++++++++---------------
 arch/x86/kvm/mmu.h                  |   1 -
 arch/x86/kvm/mmutrace.h             |  42 +--
 arch/x86/kvm/x86.c                  |   7 +-
 arch/x86/kvm/x86.h                  |   7 +-
 include/linux/kvm_host.h            |  23 +-
 virt/kvm/arm/mmu.c                  |   2 +-
 virt/kvm/kvm_main.c                 |  39 ++-
 13 files changed, 286 insertions(+), 333 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots
  2019-02-05 20:54 [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism Sean Christopherson
@ 2019-02-05 20:54 ` Sean Christopherson
  2019-02-06  9:12   ` Cornelia Huck
  2019-02-12 12:36 ` [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism Paolo Bonzini
       [not found] ` <20190205210137.1377-11-sean.j.christopherson@intel.com>
  2 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2019-02-05 20:54 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář,
	James Hogan, Paul Mackerras, Christian Borntraeger,
	Janosch Frank, Christoffer Dall, Marc Zyngier
  Cc: linux-s390, kvm, David Hildenbrand, Cornelia Huck, linux-mips,
	kvm-ppc, Xiao Guangrong, kvmarm, linux-arm-kernel

kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.

Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots.  Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.

Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/mips/include/asm/kvm_host.h    | 2 +-
 arch/powerpc/include/asm/kvm_host.h | 2 +-
 arch/s390/include/asm/kvm_host.h    | 2 +-
 arch/x86/include/asm/kvm_host.h     | 2 +-
 arch/x86/kvm/mmu.c                  | 4 ++--
 arch/x86/kvm/x86.c                  | 4 ++--
 include/linux/kvm_host.h            | 2 +-
 virt/kvm/arm/mmu.c                  | 2 +-
 virt/kvm/kvm_main.c                 | 7 +++++--
 9 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 71c3f21d80d5..f764a2de68e2 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -1131,7 +1131,7 @@ static inline void kvm_arch_hardware_unsetup(void) {}
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_free_memslot(struct kvm *kvm,
 		struct kvm_memory_slot *free, struct kvm_memory_slot *dont) {}
-static inline void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) {}
+static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 0f98f00da2ea..19693b8add93 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -837,7 +837,7 @@ struct kvm_vcpu_arch {
 static inline void kvm_arch_hardware_disable(void) {}
 static inline void kvm_arch_hardware_unsetup(void) {}
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
-static inline void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) {}
+static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
 static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_exit(void) {}
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index d5d24889c3bc..c2b8c8c6c9be 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -878,7 +878,7 @@ static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_free_memslot(struct kvm *kvm,
 		struct kvm_memory_slot *free, struct kvm_memory_slot *dont) {}
-static inline void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) {}
+static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
 static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
 static inline void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 		struct kvm_memory_slot *slot) {}
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4660ce90de7f..1c7f726e3ad5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1253,7 +1253,7 @@ void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 				   struct kvm_memory_slot *slot,
 				   gfn_t gfn_offset, unsigned long mask);
 void kvm_mmu_zap_all(struct kvm *kvm);
-void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots *slots);
+void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ce770b446238..fb040ea42454 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5890,13 +5890,13 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
 	return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
 }
 
-void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots *slots)
+void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
 {
 	/*
 	 * The very rare case: if the generation-number is round,
 	 * zap all shadow pages.
 	 */
-	if (unlikely((slots->generation & MMIO_GEN_MASK) == 0)) {
+	if (unlikely((gen & MMIO_GEN_MASK) == 0)) {
 		kvm_debug_ratelimited("kvm: zapping shadow pages for mmio generation wraparound\n");
 		kvm_mmu_invalidate_zap_all_pages(kvm);
 	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9ce1b8e2d3b9..4d05fca04fe1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9337,13 +9337,13 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 	return -ENOMEM;
 }
 
-void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots)
+void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
 {
 	/*
 	 * memslots->generation has been incremented.
 	 * mmio generation may have reached its maximum value.
 	 */
-	kvm_mmu_invalidate_mmio_sptes(kvm, slots);
+	kvm_mmu_invalidate_mmio_sptes(kvm, gen);
 }
 
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c38cc5eb7e73..cf761ff58224 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -634,7 +634,7 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free,
 			   struct kvm_memory_slot *dont);
 int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 			    unsigned long npages);
-void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots);
+void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
 				struct kvm_memory_slot *memslot,
 				const struct kvm_userspace_memory_region *mem,
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 3053bf2584f8..3ab0daee3fea 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -2350,7 +2350,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 	return 0;
 }
 
-void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots)
+void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
 {
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cf7cc0554094..df5dc7224f02 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -878,6 +878,7 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
 		int as_id, struct kvm_memslots *slots)
 {
 	struct kvm_memslots *old_memslots = __kvm_memslots(kvm, as_id);
+	u64 gen;
 
 	/*
 	 * Set the low bit in the generation, which disables SPTE caching
@@ -900,9 +901,11 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
 	 * space 0 will use generations 0, 4, 8, ... while * address space 1 will
 	 * use generations 2, 6, 10, 14, ...
 	 */
-	slots->generation += KVM_ADDRESS_SPACE_NUM * 2 - 1;
+	gen = slots->generation + KVM_ADDRESS_SPACE_NUM * 2 - 1;
 
-	kvm_arch_memslots_updated(kvm, slots);
+	kvm_arch_memslots_updated(kvm, gen);
+
+	slots->generation = gen;
 
 	return old_memslots;
 }
-- 
2.20.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots
  2019-02-05 20:54 ` [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots Sean Christopherson
@ 2019-02-06  9:12   ` Cornelia Huck
  0 siblings, 0 replies; 19+ messages in thread
From: Cornelia Huck @ 2019-02-06  9:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-s390, Janosch Frank, kvm, Marc Zyngier, James Hogan,
	David Hildenbrand, kvm-ppc, linux-mips, Paul Mackerras,
	Christian Borntraeger, Xiao Guangrong, Paolo Bonzini, kvmarm,
	linux-arm-kernel

On Tue,  5 Feb 2019 12:54:17 -0800
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> kvm_arch_memslots_updated() is at this point in time an x86-specific
> hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
> the memslots generation number in its MMIO sptes in order to avoid
> full page fault walks for repeat faults on emulated MMIO addresses.
> Because only 19 bits are used, wrapping the MMIO generation number is
> possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
> the generation has changed so that it can invalidate all MMIO sptes in
> case the effective MMIO generation has wrapped so as to avoid using a
> stale spte, e.g. a (very) old spte that was created with generation==0.
> 
> Given that the purpose of kvm_arch_memslots_updated() is to prevent
> consuming stale entries, it needs to be called before the new generation
> is propagated to memslots.  Invalidating the MMIO sptes after updating
> memslots means that there is a window where a vCPU could dereference
> the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
> spte that was created with (pre-wrap) generation==0.
> 
> Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/mips/include/asm/kvm_host.h    | 2 +-
>  arch/powerpc/include/asm/kvm_host.h | 2 +-
>  arch/s390/include/asm/kvm_host.h    | 2 +-
>  arch/x86/include/asm/kvm_host.h     | 2 +-
>  arch/x86/kvm/mmu.c                  | 4 ++--
>  arch/x86/kvm/x86.c                  | 4 ++--
>  include/linux/kvm_host.h            | 2 +-
>  virt/kvm/arm/mmu.c                  | 2 +-
>  virt/kvm/kvm_main.c                 | 7 +++++--
>  9 files changed, 15 insertions(+), 12 deletions(-)

Not an x86 person, but I think that makes sense.

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism
  2019-02-05 20:54 [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism Sean Christopherson
  2019-02-05 20:54 ` [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots Sean Christopherson
@ 2019-02-12 12:36 ` Paolo Bonzini
       [not found] ` <20190205210137.1377-11-sean.j.christopherson@intel.com>
  2 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2019-02-12 12:36 UTC (permalink / raw)
  To: Sean Christopherson, Radim Krčmář,
	James Hogan, Paul Mackerras, Christian Borntraeger,
	Janosch Frank, Christoffer Dall, Marc Zyngier
  Cc: linux-s390, kvm, David Hildenbrand, Cornelia Huck, linux-mips,
	kvm-ppc, Xiao Guangrong, kvmarm, linux-arm-kernel

On 05/02/19 21:54, Sean Christopherson wrote:
> ...and clean up the MMIO spte generation code.
> 
> === Non-x86 Folks ===
> Patch 01/27 (which you should also be cc'd on) changes the prototype for
> a function that is defined by all architectures but only actually used
> by x86.  Feel free to skip the rest of this cover letter.
> 
> === x86 Folks ===
> Though not explicitly stated, for all intents and purposes the fast
> invalidate mechanism was added to speed up the scenario where removing
> a memslot, e.g. when accessing PCI ROM, caused KVM to flush all shadow
> entries[1].
> 
> The other use cases of "flush everything" are VM teardown and handling
> MMIO generation overflow, neither of which is a performance critial
> path (see "Other Uses" below).
> 
> For the memslot case, zapping all shadow entries is overkill, i.e. KVM
> only needs to zap the entries associated with the memslot, but KVM has
> historically used a big hammer because if all you have is a hammer,
> everything looks like a nail.
> 
> Rather than zap all pages when removing a memslot, zap only the shadow
> entries associated with said memslot.  I see a performance improvement
> of ~5% when zapping only the pages for the deleted memslot when using a
> slightly modified version of the original benchmark[2][3][4] (I don't
> have easy access to a system with hundreds of gigs of memory).
> 
> $ cat shell.sh
>         #!/bin/sh
> 
>         echo 3 > /proc/sys/vm/drop_caches
>         ./mmtest -c 8 -m 2000 -e ./rom.sh
> 
> I.e. running 8 threads and 2gb of memory per thread, time in milliseconds:
> 
> Before: 89.117
> After:  84.768
> 
> 
> With the memslot use case gone, maintaining the fast invalidate path
> adds a moderate amount of complexity but provides little to no value.
> Furhtermore, its existence may give the impression that it's "ok" to zap
> all shadow pages.  Remove the fast invalidate mechanism to simplify the
> code and to discourage future code from zapping all pages as using such
> a big hammer should be a last resort.
> 
> Unfortunately, simply reverting the fast invalidate code re-introduces
> shortcomings in the previous code, notably that KVM may hold a spinlock
> and not schedule for an extended period of time.  Releasing the lock to
> voluntarily reschedule is perfectly doable, and for the VM teardown case
> of "zap everything" it can be added with no additional changes since KVM
> doesn't need to provide any ordering guarantees about sptes in that case.
> 
> For the MMIO case, KVM needs to prevent vCPUs from consuming stale sptes
> created with an old generation number.  This should be a non-issue as
> KVM is supposed to prevent consuming stale MMIO sptes regardless of the
> zapping mechanism employed, releasing mmu_lock while zapping simply
> makes it much more likely to hit any bug leading to stale spte usage.
> As luck would have it, there are a number of corner cases in the MMIO
> spte invalidation flow that could theoretically result in reusing stale
> sptes.  So before reworking memslot zapping and reverting to the slow
> invalidate mechanism, fix a number of bugs related to MMIO generations
> and opportunistically clean up the memslots/MMIO generation code.
> 
> 
> === History ===
> Flushing of shadow pages when removing a memslot was originally added
> by commit 34d4cb8fca1f ("KVM: MMU: nuke shadowed pgtable pages and ptes
> on memslot destruction"), and obviously emphasized functionality over
> performance.  Commit 2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow")
> added a path to allow flushing only the removed slot's shadow pages,
> but x86 just redirected to the "zap all" flow.
> 
> Eventually, it became evident that zapping everything is slow, and so
> x86 developed a more efficient hammer in the form of the fast invalidate
> mechanism.
> 
> === Other Uses ===
> When a VM is being destroyed, either there are no active vcpus, i.e.
> there's no lock contention, or the VM has ungracefully terminated, in
> which case we want to reclaim its pages as quickly as possible, i.e.
> not release the MMU lock if there are still CPUs executing in the VM.
> 
> The MMIO generation scenario is almost literally a one-in-a-million
> occurrence, i.e. is not a performance sensitive scenario.
> 
> It's worth noting that prior to the "fast invalidate" series being
> applied, there was actually another use case of kvm_mmu_zap_all() in
> emulator_fix_hypercall() whose existence may have contributed to
> improving the performance of "zap all" instead of avoiding it altogether.
> But that usage was removed by the series itself in commit 758ccc89b83c
> ("KVM: x86: drop calling kvm_mmu_zap_all in emulator_fix_hypercall").
> 
> [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
> [2] https://lkml.kernel.org/r/1368706673-8530-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
> [3] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277.html
> [4] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277/mmtest.tar.bz2
> 
> v1: https://patchwork.kernel.org/cover/10756579/
> v2:
>   - Fix a zap vs flush priority bug in kvm_mmu_remote_flush_or_zap() [me]
>   - Voluntarily reschedule and/or drop mmu_lock as needed when zapping
>     all sptes or all MMIO sptes [Paolo]
>   - Add patches to clean up the memslots/MMIO generation code and fix
>     a variety of theoretically corner case bugs
> 
> 
> Sean Christopherson (27):
>   KVM: Call kvm_arch_memslots_updated() before updating memslots
>   KVM: x86/mmu: Detect MMIO generation wrap in any address space
>   KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux
>   KVM: Explicitly define the "memslot update in-progress" bit
>   KVM: x86: Use a u64 when passing the MMIO gen around
>   KVM: x86: Refactor the MMIO SPTE generation handling
>   KVM: Remove the hack to trigger memslot generation wraparound
>   KVM: Move the memslot update in-progress flag to bit 63
>   KVM: x86/mmu: Move slot_level_*() helper functions up a few lines
>   KVM: x86/mmu: Split remote_flush+zap case out of
>     kvm_mmu_flush_or_zap()
>   KVM: x86/mmu: Zap only the relevant pages when removing a memslot
>   Revert "KVM: MMU: document fast invalidate all pages"
>   Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes"
>   KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes
>   KVM: x86/mmu: Remove is_obsolete() call
>   Revert "KVM: MMU: reclaim the zapped-obsolete page first"
>   Revert "KVM: MMU: collapse TLB flushes when zap all pages"
>   Revert "KVM: MMU: zap pages in batch"
>   Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages"
>   Revert "KVM: MMU: show mmu_valid_gen in shadow page related
>     tracepoints"
>   Revert "KVM: x86: use the fast way to invalidate all pages"
>   KVM: x86/mmu: skip over invalid root pages when zapping all sptes
>   KVM: x86/mmu: Voluntarily reschedule as needed when zapping all sptes
>   Revert "KVM: MMU: fast invalidate all pages"
>   KVM: x86/mmu: Differentiate between nr zapped and list unstable
>   KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children
>   KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and
>     kvm_mmu_zap_mmio_sptes()
> 
>  Documentation/virtual/kvm/mmu.txt   |  41 +--
>  arch/mips/include/asm/kvm_host.h    |   2 +-
>  arch/powerpc/include/asm/kvm_host.h |   2 +-
>  arch/s390/include/asm/kvm_host.h    |   2 +-
>  arch/x86/include/asm/kvm_host.h     |   9 +-
>  arch/x86/kvm/mmu.c                  | 442 +++++++++++++---------------
>  arch/x86/kvm/mmu.h                  |   1 -
>  arch/x86/kvm/mmutrace.h             |  42 +--
>  arch/x86/kvm/x86.c                  |   7 +-
>  arch/x86/kvm/x86.h                  |   7 +-
>  include/linux/kvm_host.h            |  23 +-
>  virt/kvm/arm/mmu.c                  |   2 +-
>  virt/kvm/kvm_main.c                 |  39 ++-
>  13 files changed, 286 insertions(+), 333 deletions(-)
> 

Queued, thanks.

Paolo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
       [not found] ` <20190205210137.1377-11-sean.j.christopherson@intel.com>
@ 2019-08-13 16:04   ` Alex Williamson
  2019-08-13 17:04     ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Williamson @ 2019-08-13 16:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Tue,  5 Feb 2019 13:01:21 -0800
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
> handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
> that actually belong to the memslot being removed.  This improves
> performance, especially why the deleted memslot has only a few shadow
> entries, or even no entries.  E.g. a microbenchmark to access regular
> memory while concurrently reading PCI ROM to trigger memslot deletion
> showed a 5% improvement in throughput.
> 
> Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++-
>  1 file changed, 32 insertions(+), 1 deletion(-)

A number of vfio users are reporting VM instability issues since v5.1,
some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap
only the relevant pages when removing a memslot"), which I've confirmed
via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and
re-verified on current 5.3-rc4 using the below patch to toggle the
broken behavior.

My reproducer is a Windows 10 VM with assigned GeForce GPU running a
variety of tests, including FurMark and PassMark Performance Test.
With the code enabled as exists in upstream currently, PassMark will
generally introduce graphics glitches or hangs.  Sometimes it's
necessary to reboot the VM to see these issues.  Flipping the 0/1 in
the below patch appears to resolve the issue.

I'd appreciate any insights into further debugging this block of code
so that we can fix this regression.  Thanks,

Alex

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24843cf49579..6a77306940f7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5653,6 +5653,9 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 			struct kvm_memory_slot *slot,
 			struct kvm_page_track_notifier_node *node)
 {
+#if 0
+	kvm_mmu_zap_all(kvm);
+#else
 	struct kvm_mmu_page *sp;
 	LIST_HEAD(invalid_list);
 	unsigned long i;
@@ -5685,6 +5688,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 
 out_unlock:
 	spin_unlock(&kvm->mmu_lock);
+#endif
 }
 
 void kvm_mmu_init_vm(struct kvm *kvm)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 16:04   ` [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot Alex Williamson
@ 2019-08-13 17:04     ` Sean Christopherson
  2019-08-13 17:57       ` Alex Williamson
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2019-08-13 17:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Tue, Aug 13, 2019 at 10:04:58AM -0600, Alex Williamson wrote:
> On Tue,  5 Feb 2019 13:01:21 -0800
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
> > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
> > that actually belong to the memslot being removed.  This improves
> > performance, especially why the deleted memslot has only a few shadow
> > entries, or even no entries.  E.g. a microbenchmark to access regular
> > memory while concurrently reading PCI ROM to trigger memslot deletion
> > showed a 5% improvement in throughput.
> > 
> > Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > ---
> >  arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++-
> >  1 file changed, 32 insertions(+), 1 deletion(-)
> 
> A number of vfio users are reporting VM instability issues since v5.1,
> some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap
> only the relevant pages when removing a memslot"), which I've confirmed
> via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and
> re-verified on current 5.3-rc4 using the below patch to toggle the
> broken behavior.
> 
> My reproducer is a Windows 10 VM with assigned GeForce GPU running a
> variety of tests, including FurMark and PassMark Performance Test.
> With the code enabled as exists in upstream currently, PassMark will
> generally introduce graphics glitches or hangs.  Sometimes it's
> necessary to reboot the VM to see these issues.

As in, the issue only shows up when the VM is rebooted?  Just want to
double check that that's not a typo.

> Flipping the 0/1 in the below patch appears to resolve the issue.
> 
> I'd appreciate any insights into further debugging this block of code
> so that we can fix this regression.  Thanks,

If it's not too painful to reproduce, I'd say start by determining whether
it's a problem with the basic logic or if the cond_resched_lock() handling
is wrong.  I.e. comment/ifdef out this chunk:

		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
			flush = false;
			cond_resched_lock(&kvm->mmu_lock);
		}


> 
> Alex
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 24843cf49579..6a77306940f7 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -5653,6 +5653,9 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>  			struct kvm_memory_slot *slot,
>  			struct kvm_page_track_notifier_node *node)
>  {
> +#if 0
> +	kvm_mmu_zap_all(kvm);
> +#else
>  	struct kvm_mmu_page *sp;
>  	LIST_HEAD(invalid_list);
>  	unsigned long i;
> @@ -5685,6 +5688,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>  
>  out_unlock:
>  	spin_unlock(&kvm->mmu_lock);
> +#endif
>  }
>  
>  void kvm_mmu_init_vm(struct kvm *kvm)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 17:04     ` Sean Christopherson
@ 2019-08-13 17:57       ` Alex Williamson
  2019-08-13 19:33         ` Alex Williamson
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Williamson @ 2019-08-13 17:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Tue, 13 Aug 2019 10:04:41 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Tue, Aug 13, 2019 at 10:04:58AM -0600, Alex Williamson wrote:
> > On Tue,  5 Feb 2019 13:01:21 -0800
> > Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> >   
> > > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
> > > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
> > > that actually belong to the memslot being removed.  This improves
> > > performance, especially why the deleted memslot has only a few shadow
> > > entries, or even no entries.  E.g. a microbenchmark to access regular
> > > memory while concurrently reading PCI ROM to trigger memslot deletion
> > > showed a 5% improvement in throughput.
> > > 
> > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > ---
> > >  arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++-
> > >  1 file changed, 32 insertions(+), 1 deletion(-)  
> > 
> > A number of vfio users are reporting VM instability issues since v5.1,
> > some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap
> > only the relevant pages when removing a memslot"), which I've confirmed
> > via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and
> > re-verified on current 5.3-rc4 using the below patch to toggle the
> > broken behavior.
> > 
> > My reproducer is a Windows 10 VM with assigned GeForce GPU running a
> > variety of tests, including FurMark and PassMark Performance Test.
> > With the code enabled as exists in upstream currently, PassMark will
> > generally introduce graphics glitches or hangs.  Sometimes it's
> > necessary to reboot the VM to see these issues.  
> 
> As in, the issue only shows up when the VM is rebooted?  Just want to
> double check that that's not a typo.

No, it can occur on the first boot as well, it's just that the recipe
to induce a failure is not well understood and manifests itself in
different ways.  I generally run the tests, then if it still hasn't
reproduced, I reboot the VM a couple times, running a couple apps in
between to try to trigger/notice bad behavior.

> > Flipping the 0/1 in the below patch appears to resolve the issue.
> > 
> > I'd appreciate any insights into further debugging this block of code
> > so that we can fix this regression.  Thanks,  
> 
> If it's not too painful to reproduce, I'd say start by determining whether
> it's a problem with the basic logic or if the cond_resched_lock() handling
> is wrong.  I.e. comment/ifdef out this chunk:
> 
> 		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> 			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
> 			flush = false;
> 			cond_resched_lock(&kvm->mmu_lock);
> 		}

If anything, removing this chunk seems to make things worse.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 17:57       ` Alex Williamson
@ 2019-08-13 19:33         ` Alex Williamson
  2019-08-13 20:19           ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Williamson @ 2019-08-13 19:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Tue, 13 Aug 2019 11:57:37 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 13 Aug 2019 10:04:41 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> > On Tue, Aug 13, 2019 at 10:04:58AM -0600, Alex Williamson wrote:  
> > > On Tue,  5 Feb 2019 13:01:21 -0800
> > > Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > >     
> > > > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
> > > > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
> > > > that actually belong to the memslot being removed.  This improves
> > > > performance, especially why the deleted memslot has only a few shadow
> > > > entries, or even no entries.  E.g. a microbenchmark to access regular
> > > > memory while concurrently reading PCI ROM to trigger memslot deletion
> > > > showed a 5% improvement in throughput.
> > > > 
> > > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > > ---
> > > >  arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 32 insertions(+), 1 deletion(-)    
> > > 
> > > A number of vfio users are reporting VM instability issues since v5.1,
> > > some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap
> > > only the relevant pages when removing a memslot"), which I've confirmed
> > > via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and
> > > re-verified on current 5.3-rc4 using the below patch to toggle the
> > > broken behavior.
> > > 
> > > My reproducer is a Windows 10 VM with assigned GeForce GPU running a
> > > variety of tests, including FurMark and PassMark Performance Test.
> > > With the code enabled as exists in upstream currently, PassMark will
> > > generally introduce graphics glitches or hangs.  Sometimes it's
> > > necessary to reboot the VM to see these issues.    
> > 
> > As in, the issue only shows up when the VM is rebooted?  Just want to
> > double check that that's not a typo.  
> 
> No, it can occur on the first boot as well, it's just that the recipe
> to induce a failure is not well understood and manifests itself in
> different ways.  I generally run the tests, then if it still hasn't
> reproduced, I reboot the VM a couple times, running a couple apps in
> between to try to trigger/notice bad behavior.
> 
> > > Flipping the 0/1 in the below patch appears to resolve the issue.
> > > 
> > > I'd appreciate any insights into further debugging this block of code
> > > so that we can fix this regression.  Thanks,    
> > 
> > If it's not too painful to reproduce, I'd say start by determining whether
> > it's a problem with the basic logic or if the cond_resched_lock() handling
> > is wrong.  I.e. comment/ifdef out this chunk:
> > 
> > 		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> > 			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
> > 			flush = false;
> > 			cond_resched_lock(&kvm->mmu_lock);
> > 		}  
> 
> If anything, removing this chunk seems to make things worse.

Could it be something with the gfn test:

                        if (sp->gfn != gfn)
                                continue;

If I remove it, I can't trigger the misbehavior.  If I log it, I only
get hits on VM boot/reboot and some of the gfns look suspiciously like
they could be the assigned GPU BARs and maybe MSI mappings:

               (sp->gfn) != (gfn)
[   71.829450] gfn fec00 != c02c4
[   71.835554] gfn ffe00 != c046f
[   71.841664] gfn 0 != c0720
[   71.847084] gfn 0 != c0720
[   71.852489] gfn 0 != c0720
[   71.857899] gfn 0 != c0720
[   71.863306] gfn 0 != c0720
[   71.868717] gfn 0 != c0720
[   71.874122] gfn 0 != c0720
[   71.879531] gfn 0 != c0720
[   71.884937] gfn 0 != c0720
[   71.890349] gfn 0 != c0720
[   71.895757] gfn 0 != c0720
[   71.901163] gfn 0 != c0720
[   71.906569] gfn 0 != c0720
[   71.911980] gfn 0 != c0720
[   71.917387] gfn 0 != c0720
[   71.922808] gfn fee00 != c0edc
[   71.928915] gfn fee00 != c0edc
[   71.935018] gfn fee00 != c0edc
[   71.941730] gfn c1000 != 8002d7
[   71.948039] gfn 80000 != 8006d5
[   71.954328] gfn 80000 != 8006d5
[   71.960600] gfn 80000 != 8006d5
[   71.966874] gfn 80000 != 8006d5
[   71.992272] gfn 0 != c0720
[   71.997683] gfn 0 != c0720
[   72.003725] gfn 80000 != 8006d5
[   72.044333] gfn 0 != c0720
[   72.049743] gfn 0 != c0720
[   72.055846] gfn 80000 != 8006d5
[   72.177341] gfn ffe00 != c046f
[   72.183453] gfn 0 != c0720
[   72.188864] gfn 0 != c0720
[   72.194290] gfn 0 != c0720
[   72.200308] gfn 80000 != 8006d5
[   82.539023] gfn fec00 != c02c4
[   82.545142] gfn 40000 != c0377
[   82.551343] gfn ffe00 != c046f
[   82.557466] gfn 100000 != c066f
[   82.563839] gfn 800000 != c06ec
[   82.570133] gfn 800000 != c06ec
[   82.576408] gfn 0 != c0720
[   82.581850] gfn 0 != c0720
[   82.587275] gfn 0 != c0720
[   82.592685] gfn 0 != c0720
[   82.598131] gfn 0 != c0720
[   82.603552] gfn 0 != c0720
[   82.608978] gfn 0 != c0720
[   82.614419] gfn 0 != c0720
[   82.619844] gfn 0 != c0720
[   82.625291] gfn 0 != c0720
[   82.630791] gfn 0 != c0720
[   82.636208] gfn 0 != c0720
[   82.641635] gfn 80a000 != c085e
[   82.647929] gfn fee00 != c0edc
[   82.654062] gfn fee00 != c0edc
[   82.660504] gfn 100000 != c066f
[   82.666800] gfn 0 != c0720
[   82.672211] gfn 0 != c0720
[   82.677635] gfn 0 != c0720
[   82.683060] gfn 0 != c0720
[   82.689209] gfn c1000 != 8002d7
[   82.695501] gfn 80000 != 8006d5
[   82.701796] gfn 80000 != 8006d5
[   82.708092] gfn 100000 != 80099b
[   82.714547] gfn 0 != 800a4c
[   82.720154] gfn 0 != 800a4c
[   82.725752] gfn 0 != 800a4c
[   82.731370] gfn 0 != 800a4c
[   82.738705] gfn 100000 != 80099b
[   82.745201] gfn 0 != 800a4c
[   82.750793] gfn 0 != 800a4c
[   82.756381] gfn 0 != 800a4c
[   82.761979] gfn 0 != 800a4c
[   82.768122] gfn 100000 != 8083a4
[   82.774605] gfn 0 != 8094aa
[   82.780196] gfn 0 != 8094aa
[   82.785796] gfn 0 != 8094aa
[   82.791412] gfn 0 != 8094aa
[   82.797523] gfn 100000 != 8083a4
[   82.803977] gfn 0 != 8094aa
[   82.809576] gfn 0 != 8094aa
[   82.815193] gfn 0 != 8094aa
[   82.820809] gfn 0 != 8094aa

(GPU has a BAR mapped at 0x80000000)

Is this gfn optimization correct?  Overzealous?  Doesn't account
correctly for something about MMIO mappings?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 19:33         ` Alex Williamson
@ 2019-08-13 20:19           ` Sean Christopherson
  2019-08-13 20:37             ` Paolo Bonzini
  2019-08-15 15:23             ` Alex Williamson
  0 siblings, 2 replies; 19+ messages in thread
From: Sean Christopherson @ 2019-08-13 20:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Tue, Aug 13, 2019 at 01:33:16PM -0600, Alex Williamson wrote:
> On Tue, 13 Aug 2019 11:57:37 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:

> Could it be something with the gfn test:
> 
>                         if (sp->gfn != gfn)
>                                 continue;
> 
> If I remove it, I can't trigger the misbehavior.  If I log it, I only
> get hits on VM boot/reboot and some of the gfns look suspiciously like
> they could be the assigned GPU BARs and maybe MSI mappings:
> 
>                (sp->gfn) != (gfn)

Hits at boot/reboot makes sense, memslots get zapped when userspace
removes a memory region/slot, e.g. remaps BARs and whatnot.

...
 
> Is this gfn optimization correct?  Overzealous?  Doesn't account
> correctly for something about MMIO mappings?  Thanks,

Yes?  Shadow pages are stored in a hash table, for_each_valid_sp() walks
all entries for a given gfn.  The sp->gfn check is there to skip entries
that hashed to the same list but for a completely different gfn.

Skipping the gfn check would be sort of a lightweight zap all in the
sense that it would zap shadow pages that happend to collide with the
target memslot/gfn but are otherwise unrelated.

What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.:

	if (sp->gfn != gfn && sp->gfn != 0x80000)
		continue;

If that doesn't work, it might be worth trying other gfns to see if you
can pinpoint which sp is being zapped as collateral damage.

It's possible there is a pre-existing bug somewhere else that was being
hidden because KVM was effectively zapping all SPTEs during (re)boot,
and the hash collision is also hiding the bug by zapping the stale entry.

Of course it's also possible this code is wrong, :-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 20:19           ` Sean Christopherson
@ 2019-08-13 20:37             ` Paolo Bonzini
  2019-08-13 21:14               ` Alex Williamson
  2019-08-15 15:23             ` Alex Williamson
  1 sibling, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2019-08-13 20:37 UTC (permalink / raw)
  To: Sean Christopherson, Alex Williamson
  Cc: Radim Krčmář, kvm, Xiao Guangrong

On 13/08/19 22:19, Sean Christopherson wrote:
> Yes?  Shadow pages are stored in a hash table, for_each_valid_sp() walks
> all entries for a given gfn.  The sp->gfn check is there to skip entries
> that hashed to the same list but for a completely different gfn.
> 
> Skipping the gfn check would be sort of a lightweight zap all in the
> sense that it would zap shadow pages that happend to collide with the
> target memslot/gfn but are otherwise unrelated.
> 
> What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.:
> 
> 	if (sp->gfn != gfn && sp->gfn != 0x80000)
> 		continue;
> 
> If that doesn't work, it might be worth trying other gfns to see if you
> can pinpoint which sp is being zapped as collateral damage.
> 
> It's possible there is a pre-existing bug somewhere else that was being
> hidden because KVM was effectively zapping all SPTEs during (re)boot,
> and the hash collision is also hiding the bug by zapping the stale entry.
> 
> Of course it's also possible this code is wrong, :-)

Also, can you reproduce it with one vCPU?  This could (though not really
100%) distinguish a missing invalidation from a race condition.

Do we even need the call to slot_handle_all_level?  The rmap update
should be done already by kvm_mmu_prepare_zap_page (via
kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte).

Alex, can you replace it with just "flush = false;"?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 20:37             ` Paolo Bonzini
@ 2019-08-13 21:14               ` Alex Williamson
  2019-08-13 21:15                 ` Paolo Bonzini
  2019-08-15 14:46                 ` Sean Christopherson
  0 siblings, 2 replies; 19+ messages in thread
From: Alex Williamson @ 2019-08-13 21:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Radim Krčmář, kvm, Xiao Guangrong

On Tue, 13 Aug 2019 22:37:14 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 13/08/19 22:19, Sean Christopherson wrote:
> > Yes?  Shadow pages are stored in a hash table, for_each_valid_sp() walks
> > all entries for a given gfn.  The sp->gfn check is there to skip entries
> > that hashed to the same list but for a completely different gfn.
> > 
> > Skipping the gfn check would be sort of a lightweight zap all in the
> > sense that it would zap shadow pages that happend to collide with the
> > target memslot/gfn but are otherwise unrelated.
> > 
> > What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.:
> > 
> > 	if (sp->gfn != gfn && sp->gfn != 0x80000)
> > 		continue;

Not having any luck with this yet.  Tried 0x80000, 0x8xxxxx, 0.
 
> > If that doesn't work, it might be worth trying other gfns to see if you
> > can pinpoint which sp is being zapped as collateral damage.
> > 
> > It's possible there is a pre-existing bug somewhere else that was being
> > hidden because KVM was effectively zapping all SPTEs during (re)boot,
> > and the hash collision is also hiding the bug by zapping the stale entry.
> > 
> > Of course it's also possible this code is wrong, :-)  
> 
> Also, can you reproduce it with one vCPU?  This could (though not really
> 100%) distinguish a missing invalidation from a race condition.

That's a pretty big change, I'll give it a shot, but not sure how
conclusive it would be.

> Do we even need the call to slot_handle_all_level?  The rmap update
> should be done already by kvm_mmu_prepare_zap_page (via
> kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte).
> 
> Alex, can you replace it with just "flush = false;"?

Replace the continue w/ flush = false?  I'm not clear on this
suggestion.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 21:14               ` Alex Williamson
@ 2019-08-13 21:15                 ` Paolo Bonzini
  2019-08-13 22:10                   ` Alex Williamson
  2019-08-15 14:46                 ` Sean Christopherson
  1 sibling, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2019-08-13 21:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Sean Christopherson, Radim Krčmář, kvm, Xiao Guangrong

On 13/08/19 23:14, Alex Williamson wrote:
>> Do we even need the call to slot_handle_all_level?  The rmap update
>> should be done already by kvm_mmu_prepare_zap_page (via
>> kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte).
>>
>> Alex, can you replace it with just "flush = false;"?
> Replace the continue w/ flush = false?  I'm not clear on this
> suggestion.  Thanks,

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24843cf49579..382b3ee303e3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5664,7 +5668,7 @@ kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm
 	if (list_empty(&kvm->arch.active_mmu_pages))
 		goto out_unlock;

-	flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, false);
+	flush = false;

 	for (i = 0; i < slot->npages; i++) {
 		gfn = slot->base_gfn + i;


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 21:15                 ` Paolo Bonzini
@ 2019-08-13 22:10                   ` Alex Williamson
  0 siblings, 0 replies; 19+ messages in thread
From: Alex Williamson @ 2019-08-13 22:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Radim Krčmář, kvm, Xiao Guangrong

On Tue, 13 Aug 2019 23:15:38 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 13/08/19 23:14, Alex Williamson wrote:
> >> Do we even need the call to slot_handle_all_level?  The rmap update
> >> should be done already by kvm_mmu_prepare_zap_page (via
> >> kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte).
> >>
> >> Alex, can you replace it with just "flush = false;"?  
> > Replace the continue w/ flush = false?  I'm not clear on this
> > suggestion.  Thanks,  
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 24843cf49579..382b3ee303e3 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -5664,7 +5668,7 @@ kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm
>  	if (list_empty(&kvm->arch.active_mmu_pages))
>  		goto out_unlock;
> 
> -	flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, false);
> +	flush = false;
> 
>  	for (i = 0; i < slot->npages; i++) {
>  		gfn = slot->base_gfn + i;
> 

Things are not happy with that, it managed to crash the host:

[  871.244789] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  871.258680] #PF: supervisor read access in kernel mode
[  871.268939] #PF: error_code(0x0000) - not-present page
[  871.279202] PGD 0 P4D 0 
[  871.284264] Oops: 0000 [#1] SMP PTI
[  871.291232] CPU: 3 PID: 1954 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4
[  871.304264] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
[  871.322523] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm]
[  871.332085] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d
[  871.369560] RSP: 0018:ffffc10300a0bb18 EFLAGS: 00010202
[  871.379995] RAX: 00000000000c1040 RBX: ffffc103007cd000 RCX: 0000000000000014
[  871.394243] RDX: ffff9b3bbff18130 RSI: 00000000000c1080 RDI: 0000000000000004
[  871.408491] RBP: ffff9b3bced15400 R08: ffff9b3b94a90008 R09: ffff9b3b94a90000
[  871.422739] R10: ffff9b3b94a90168 R11: 0000000000000004 R12: ffff9b3bbff18130
[  871.436986] R13: ffffc103007cd000 R14: 0000000000000000 R15: 00000000000c1000
[  871.451234] FS:  00007fc27b37d700(0000) GS:ffff9b3f9f4c0000(0000) knlGS:0000000000000000
[  871.467390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  871.478865] CR2: 0000000000000000 CR3: 000000003bb2a004 CR4: 00000000001626e0
[  871.493113] Call Trace:
[  871.498012]  drop_spte+0x77/0xa0 [kvm]
[  871.505500]  mmu_page_zap_pte+0xac/0xe0 [kvm]
[  871.514200]  __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm]
[  871.524809]  kvm_mmu_invalidate_zap_pages_in_memslot+0xb7/0x130 [kvm]
[  871.537670]  kvm_page_track_flush_slot+0x55/0x80 [kvm]
[  871.547928]  __kvm_set_memory_region+0x821/0xaa0 [kvm]
[  871.558191]  kvm_set_memory_region+0x26/0x40 [kvm]
[  871.567759]  kvm_vm_ioctl+0x59a/0x940 [kvm]
[  871.576106]  ? __seccomp_filter+0x7a/0x680
[  871.584287]  do_vfs_ioctl+0xa4/0x630
[  871.591430]  ? security_file_ioctl+0x32/0x50
[  871.599954]  ksys_ioctl+0x60/0x90
[  871.606575]  __x64_sys_ioctl+0x16/0x20
[  871.614066]  do_syscall_64+0x5f/0x1a0
[  871.621379]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

It also failed with the 1 vcpu test.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 21:14               ` Alex Williamson
  2019-08-13 21:15                 ` Paolo Bonzini
@ 2019-08-15 14:46                 ` Sean Christopherson
  1 sibling, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2019-08-15 14:46 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

[-- Attachment #1: Type: text/plain, Size: 1208 bytes --]

On Tue, Aug 13, 2019 at 03:14:17PM -0600, Alex Williamson wrote:
> On Tue, 13 Aug 2019 22:37:14 +0200
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> > On 13/08/19 22:19, Sean Christopherson wrote:
> > > Yes?  Shadow pages are stored in a hash table, for_each_valid_sp() walks
> > > all entries for a given gfn.  The sp->gfn check is there to skip entries
> > > that hashed to the same list but for a completely different gfn.
> > > 
> > > Skipping the gfn check would be sort of a lightweight zap all in the
> > > sense that it would zap shadow pages that happend to collide with the
> > > target memslot/gfn but are otherwise unrelated.
> > > 
> > > What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.:
> > > 
> > > 	if (sp->gfn != gfn && sp->gfn != 0x80000)
> > > 		continue;
> 
> Not having any luck with this yet.  Tried 0x80000, 0x8xxxxx, 0.

I've no idea if it would actually be interesting, but something to try
would be to zap only emulated mmio SPTEs (in addition to the memslot).
If that test passes then I think it might indicate a problem with device
enumeration as opposed to the mapping of the device itself ("think" and
"might" being the operative words).  Patch attached.

[-- Attachment #2: emulated_mmio_only.patch --]
[-- Type: text/x-diff, Size: 947 bytes --]

From df7d1bf7025c20703f71e126dc673ba58230605f Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson@intel.com>
Date: Thu, 15 Aug 2019 07:27:05 -0700
Subject: [PATCH] tmp

---
 arch/x86/kvm/mmu.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24843cf49579..d5bbb7ed716f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5649,6 +5649,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 	return alloc_mmu_pages(vcpu);
 }
 
+static void __kvm_mmu_zap_all(struct kvm *kvm, bool mmio_only);
+
 static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 			struct kvm_memory_slot *slot,
 			struct kvm_page_track_notifier_node *node)
@@ -5685,6 +5687,8 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 
 out_unlock:
 	spin_unlock(&kvm->mmu_lock);
+
+	__kvm_mmu_zap_all(kvm, true);
 }
 
 void kvm_mmu_init_vm(struct kvm *kvm)
-- 
2.22.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 20:19           ` Sean Christopherson
  2019-08-13 20:37             ` Paolo Bonzini
@ 2019-08-15 15:23             ` Alex Williamson
  2019-08-15 16:00               ` Sean Christopherson
  1 sibling, 1 reply; 19+ messages in thread
From: Alex Williamson @ 2019-08-15 15:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Tue, 13 Aug 2019 13:19:14 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Tue, Aug 13, 2019 at 01:33:16PM -0600, Alex Williamson wrote:
> > On Tue, 13 Aug 2019 11:57:37 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:  
> 
> > Could it be something with the gfn test:
> > 
> >                         if (sp->gfn != gfn)
> >                                 continue;
> > 
> > If I remove it, I can't trigger the misbehavior.  If I log it, I only
> > get hits on VM boot/reboot and some of the gfns look suspiciously like
> > they could be the assigned GPU BARs and maybe MSI mappings:
> > 
> >                (sp->gfn) != (gfn)  
> 
> Hits at boot/reboot makes sense, memslots get zapped when userspace
> removes a memory region/slot, e.g. remaps BARs and whatnot.
> 
> ...
>  
> > Is this gfn optimization correct?  Overzealous?  Doesn't account
> > correctly for something about MMIO mappings?  Thanks,  
> 
> Yes?  Shadow pages are stored in a hash table, for_each_valid_sp() walks
> all entries for a given gfn.  The sp->gfn check is there to skip entries
> that hashed to the same list but for a completely different gfn.
> 
> Skipping the gfn check would be sort of a lightweight zap all in the
> sense that it would zap shadow pages that happend to collide with the
> target memslot/gfn but are otherwise unrelated.
> 
> What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.:
> 
> 	if (sp->gfn != gfn && sp->gfn != 0x80000)
> 		continue;
> 
> If that doesn't work, it might be worth trying other gfns to see if you
> can pinpoint which sp is being zapped as collateral damage.
> 
> It's possible there is a pre-existing bug somewhere else that was being
> hidden because KVM was effectively zapping all SPTEs during (re)boot,
> and the hash collision is also hiding the bug by zapping the stale entry.
> 
> Of course it's also possible this code is wrong, :-)

Ok, fun day of trying to figure out which ranges are relevant, I've
narrowed it down to all of these:

0xffe00
0xfee00
0xfec00
0xc1000
0x80a000
0x800000
0x100000

ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and
0x80000 can take the continue branch without seeing bad behavior in the
VM.

The assigned GPU has BARs at GPAs:

0xc0000000-0xc0ffffff
0x800000000-0x808000000
0x808000000-0x809ffffff

And the assigned companion audio function is at GPA:

0xc1080000-0xc1083fff

Only one of those seems to align very well with a gfn base involved
here.  The virtio ethernet has an mmio range at GPA 0x80a000000,
otherwise I don't find any other I/O devices coincident with the gfns
above.

I'm running the VM with 2MB hugepages, but I believe the issue still
occurs with standard pages.  When run with standard pages I see more
hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to
the set above that cannot take the continue branch.  I don't know if
that means anything.

Any further ideas what to look for?  Thanks,

Alex

PS - I see the posted workaround patch, I'll test that in the interim.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-15 15:23             ` Alex Williamson
@ 2019-08-15 16:00               ` Sean Christopherson
  2019-08-15 18:16                 ` Alex Williamson
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2019-08-15 16:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Thu, Aug 15, 2019 at 09:23:24AM -0600, Alex Williamson wrote:
> Ok, fun day of trying to figure out which ranges are relevant, I've
> narrowed it down to all of these:
> 
> 0xffe00
> 0xfee00
> 0xfec00

APIC and I/O APIC stuff

> 0xc1000

Assigned audio

> 0x80a000

?

> 0x800000

GPU BAR

> 0x100000

?

The APIC ranges are puzzling, I wouldn't expect their mappings to change.

> ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and
> 0x80000 can take the continue branch without seeing bad behavior in the
> VM.
> 
> The assigned GPU has BARs at GPAs:
> 
> 0xc0000000-0xc0ffffff
> 0x800000000-0x808000000
> 0x808000000-0x809ffffff
> 
> And the assigned companion audio function is at GPA:
> 
> 0xc1080000-0xc1083fff
> 
> Only one of those seems to align very well with a gfn base involved
> here.  The virtio ethernet has an mmio range at GPA 0x80a000000,
> otherwise I don't find any other I/O devices coincident with the gfns
> above.
> 
> I'm running the VM with 2MB hugepages, but I believe the issue still
> occurs with standard pages.  When run with standard pages I see more
> hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to
> the set above that cannot take the continue branch.  I don't know if
> that means anything.
> 
> Any further ideas what to look for?  Thanks,

Maybe try isolating which memslot removal causes problems?  E.g. flush
the affected ranges if base_gfn == (xxx || yyy || zzz), otherwise flush
only the memslot's gfns.  Based on the log you sent a while back for gfn
mismatches, I'm guessing the culprits are all GPU BARs, but it's
probably worth confirming.  That might also explain why gfn == 0x80000
can take the continue branch, i.e. if removing the corresponding memslot
is what's causing problems, then it's being flushed and not actually
taking the continue path.

One other thought would be to force a call to kvm_flush_remote_tlbs(kvm),
e.g. set flush=true just before the final kvm_mmu_remote_flush_or_zap().
Maybe it's a case where there are no SPTEs for the memslot, but the TLB
flush is needed for some reason.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-15 16:00               ` Sean Christopherson
@ 2019-08-15 18:16                 ` Alex Williamson
  2019-08-15 19:25                   ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Williamson @ 2019-08-15 18:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Thu, 15 Aug 2019 09:00:06 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Thu, Aug 15, 2019 at 09:23:24AM -0600, Alex Williamson wrote:
> > Ok, fun day of trying to figure out which ranges are relevant, I've
> > narrowed it down to all of these:
> > 
> > 0xffe00
> > 0xfee00
> > 0xfec00  
> 
> APIC and I/O APIC stuff
> 
> > 0xc1000  
> 
> Assigned audio
> 
> > 0x80a000  
> 
> ?
> 
> > 0x800000  
> 
> GPU BAR
> 
> > 0x100000  
> 
> ?
> 
> The APIC ranges are puzzling, I wouldn't expect their mappings to change.
> 
> > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and
> > 0x80000 can take the continue branch without seeing bad behavior in the
> > VM.
> > 
> > The assigned GPU has BARs at GPAs:
> > 
> > 0xc0000000-0xc0ffffff
> > 0x800000000-0x808000000
> > 0x808000000-0x809ffffff
> > 
> > And the assigned companion audio function is at GPA:
> > 
> > 0xc1080000-0xc1083fff
> > 
> > Only one of those seems to align very well with a gfn base involved
> > here.  The virtio ethernet has an mmio range at GPA 0x80a000000,
> > otherwise I don't find any other I/O devices coincident with the gfns
> > above.
> > 
> > I'm running the VM with 2MB hugepages, but I believe the issue still
> > occurs with standard pages.  When run with standard pages I see more
> > hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to
> > the set above that cannot take the continue branch.  I don't know if
> > that means anything.
> > 
> > Any further ideas what to look for?  Thanks,  
> 
> Maybe try isolating which memslot removal causes problems?  E.g. flush
> the affected ranges if base_gfn == (xxx || yyy || zzz), otherwise flush
> only the memslot's gfns.  Based on the log you sent a while back for gfn
> mismatches, I'm guessing the culprits are all GPU BARs, but it's
> probably worth confirming.  That might also explain why gfn == 0x80000
> can take the continue branch, i.e. if removing the corresponding memslot
> is what's causing problems, then it's being flushed and not actually
> taking the continue path.

If I print out the memslot base_gfn, it seems pretty evident that only
the assigned device mappings are triggering this branch.  The base_gfns
exclusively include:

 0x800000
 0x808000
 0xc0089

Where the first two clearly match the 64bit BARs and the last is the
result of a page that we need to emulate within the BAR @0xc0000000 at
offset 0x88000, so the base_gfn is the remaining direct mapping.

I don't know if this implies we're doing something wrong for assigned
device slots, but maybe a more targeted workaround would be if we could
specifically identify these slots, though there's no special
registration of them versus other slots.  Did you have any non-device
assignment test cases that took this branch when developing the series?

> One other thought would be to force a call to kvm_flush_remote_tlbs(kvm),
> e.g. set flush=true just before the final kvm_mmu_remote_flush_or_zap().
> Maybe it's a case where there are no SPTEs for the memslot, but the TLB
> flush is needed for some reason.

This doesn't work.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-15 18:16                 ` Alex Williamson
@ 2019-08-15 19:25                   ` Sean Christopherson
  2019-08-15 20:11                     ` Alex Williamson
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2019-08-15 19:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Thu, Aug 15, 2019 at 12:16:07PM -0600, Alex Williamson wrote:
> On Thu, 15 Aug 2019 09:00:06 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> If I print out the memslot base_gfn, it seems pretty evident that only
> the assigned device mappings are triggering this branch.  The base_gfns
> exclusively include:
> 
>  0x800000
>  0x808000
>  0xc0089
> 
> Where the first two clearly match the 64bit BARs and the last is the
> result of a page that we need to emulate within the BAR @0xc0000000 at
> offset 0x88000, so the base_gfn is the remaining direct mapping.

That's consistent with my understanding of userspace, e.g. normal memory
regions aren't deleted until the VM is shut down (barring hot unplug).

> I don't know if this implies we're doing something wrong for assigned
> device slots, but maybe a more targeted workaround would be if we could
> specifically identify these slots, though there's no special
> registration of them versus other slots.

What is triggering the memslot removal/update?  Is it possible that
whatever action is occuring is modifying multiple memslots?  E.g. KVM's
memslot-only zapping is allowing the guest to access stale entries for
the unzapped-but-related memslots, whereas the full zap does not.

FYI, my VFIO/GPU/PCI knowledge is abysmal, please speak up if any of my
ideas are nonsensical.

> Did you have any non-device
> assignment test cases that took this branch when developing the series?

The primary testing was performance oriented, using a slightly modified
version of a synthetic benchmark[1] from a previous series[2] that touched
the memslot flushing flow.  From a functional perspective, I highly doubt
that test would have been able expose an improper zapping bug.

We do have some amount of coverage via kvm-unit-tests, as an EPT test was
triggering a slab bug due not actually zapping the collected SPTEs[3].

[1] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277/mmtest.tar.bz2
[2] https://lkml.kernel.org/r/1368706673-8530-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
[3] https://patchwork.kernel.org/patch/10899283/

> > One other thought would be to force a call to kvm_flush_remote_tlbs(kvm),
> > e.g. set flush=true just before the final kvm_mmu_remote_flush_or_zap().
> > Maybe it's a case where there are no SPTEs for the memslot, but the TLB
> > flush is needed for some reason.
> 
> This doesn't work.  Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-15 19:25                   ` Sean Christopherson
@ 2019-08-15 20:11                     ` Alex Williamson
  0 siblings, 0 replies; 19+ messages in thread
From: Alex Williamson @ 2019-08-15 20:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Thu, 15 Aug 2019 12:25:31 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Thu, Aug 15, 2019 at 12:16:07PM -0600, Alex Williamson wrote:
> > On Thu, 15 Aug 2019 09:00:06 -0700
> > Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > 
> > If I print out the memslot base_gfn, it seems pretty evident that only
> > the assigned device mappings are triggering this branch.  The base_gfns
> > exclusively include:
> > 
> >  0x800000
> >  0x808000
> >  0xc0089
> > 
> > Where the first two clearly match the 64bit BARs and the last is the
> > result of a page that we need to emulate within the BAR @0xc0000000 at
> > offset 0x88000, so the base_gfn is the remaining direct mapping.  
> 
> That's consistent with my understanding of userspace, e.g. normal memory
> regions aren't deleted until the VM is shut down (barring hot unplug).
> 
> > I don't know if this implies we're doing something wrong for assigned
> > device slots, but maybe a more targeted workaround would be if we could
> > specifically identify these slots, though there's no special
> > registration of them versus other slots.  
> 
> What is triggering the memslot removal/update?  Is it possible that
> whatever action is occuring is modifying multiple memslots?  E.g. KVM's
> memslot-only zapping is allowing the guest to access stale entries for
> the unzapped-but-related memslots, whereas the full zap does not.
> 
> FYI, my VFIO/GPU/PCI knowledge is abysmal, please speak up if any of my
> ideas are nonsensical.

The memory bit in the PCI command register of config space for each
device controls whether the device decoders are active for the MMIO BAR
ranges.  These get flipped as both the guest firmware and guest OS
enumerate and assign resources to the PCI subsystem.  Generally these
are not further manipulated while the guest OS is running except for
hotplug operations.  The guest OS device driver will generally perform
the final enable of these ranges and they'll remain enabled until the
guest is rebooted.

I recall somewhere in this thread you referenced reading the ROM as
part of the performance testing of this series.  The ROM has it's own
enable bit within the ROM BAR as the PCI spec allows devices to share
decoders between the standard BARs and the ROM BAR.  Enabling and
disabling the enable bit in the ROM BAR should be very similar in
memslot behavior to the overall memory enable bit for the other BARs
within the device.

Note that often the faults that I'm seeing occur a long time after BAR
mappings are finalized, usually (not always) the VM boots to a fully
functional desktop and it's only as I run various tests do the glitches
start to appear.  For instance, when I allowed sp->gfn 0xfec00 to take
the continue branch, I got an OpenCL error.  For either 0xffee00 or
0xc1000 I got graphics glitches, for example stray geometric artifacts
flashed on the screen.  For 0x100000 and 0x800000 I'd get a black
screen or blank 3D graphics window.  For 0x80a000 the VM froze
(apparently).  I can't say whether each of these is a consistent failure
mode, I only tested to the point of determining whether a range
generates an error.

> > Did you have any non-device
> > assignment test cases that took this branch when developing the series?  
> 
> The primary testing was performance oriented, using a slightly modified
> version of a synthetic benchmark[1] from a previous series[2] that touched
> the memslot flushing flow.  From a functional perspective, I highly doubt
> that test would have been able expose an improper zapping bug.

:-\

It seems like there's probably some sort of inflection point where it
becomes faster to zap all pages versus the overhead of walking every
page in a memory slot, was that evaluated?  Not sure if that's relevant
here, but curious.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, back to index

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-05 20:54 [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism Sean Christopherson
2019-02-05 20:54 ` [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots Sean Christopherson
2019-02-06  9:12   ` Cornelia Huck
2019-02-12 12:36 ` [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism Paolo Bonzini
     [not found] ` <20190205210137.1377-11-sean.j.christopherson@intel.com>
2019-08-13 16:04   ` [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot Alex Williamson
2019-08-13 17:04     ` Sean Christopherson
2019-08-13 17:57       ` Alex Williamson
2019-08-13 19:33         ` Alex Williamson
2019-08-13 20:19           ` Sean Christopherson
2019-08-13 20:37             ` Paolo Bonzini
2019-08-13 21:14               ` Alex Williamson
2019-08-13 21:15                 ` Paolo Bonzini
2019-08-13 22:10                   ` Alex Williamson
2019-08-15 14:46                 ` Sean Christopherson
2019-08-15 15:23             ` Alex Williamson
2019-08-15 16:00               ` Sean Christopherson
2019-08-15 18:16                 ` Alex Williamson
2019-08-15 19:25                   ` Sean Christopherson
2019-08-15 20:11                     ` Alex Williamson

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org kvm@archiver.kernel.org
	public-inbox-index kvm


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/ public-inbox