kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism
@ 2019-02-05 20:54 Sean Christopherson
  2019-02-05 20:54 ` [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots Sean Christopherson
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Sean Christopherson @ 2019-02-05 20:54 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář,
	James Hogan, Paul Mackerras, Christian Borntraeger,
	Janosch Frank, Christoffer Dall, Marc Zyngier
  Cc: linux-s390, kvm, David Hildenbrand, Cornelia Huck, linux-mips,
	kvm-ppc, Xiao Guangrong, kvmarm, linux-arm-kernel

...and clean up the MMIO spte generation code.

=== Non-x86 Folks ===
Patch 01/27 (which you should also be cc'd on) changes the prototype for
a function that is defined by all architectures but only actually used
by x86.  Feel free to skip the rest of this cover letter.

=== x86 Folks ===
Though not explicitly stated, for all intents and purposes the fast
invalidate mechanism was added to speed up the scenario where removing
a memslot, e.g. when accessing PCI ROM, caused KVM to flush all shadow
entries[1].

The other use cases of "flush everything" are VM teardown and handling
MMIO generation overflow, neither of which is a performance critial
path (see "Other Uses" below).

For the memslot case, zapping all shadow entries is overkill, i.e. KVM
only needs to zap the entries associated with the memslot, but KVM has
historically used a big hammer because if all you have is a hammer,
everything looks like a nail.

Rather than zap all pages when removing a memslot, zap only the shadow
entries associated with said memslot.  I see a performance improvement
of ~5% when zapping only the pages for the deleted memslot when using a
slightly modified version of the original benchmark[2][3][4] (I don't
have easy access to a system with hundreds of gigs of memory).

$ cat shell.sh
        #!/bin/sh

        echo 3 > /proc/sys/vm/drop_caches
        ./mmtest -c 8 -m 2000 -e ./rom.sh

I.e. running 8 threads and 2gb of memory per thread, time in milliseconds:

Before: 89.117
After:  84.768


With the memslot use case gone, maintaining the fast invalidate path
adds a moderate amount of complexity but provides little to no value.
Furhtermore, its existence may give the impression that it's "ok" to zap
all shadow pages.  Remove the fast invalidate mechanism to simplify the
code and to discourage future code from zapping all pages as using such
a big hammer should be a last resort.

Unfortunately, simply reverting the fast invalidate code re-introduces
shortcomings in the previous code, notably that KVM may hold a spinlock
and not schedule for an extended period of time.  Releasing the lock to
voluntarily reschedule is perfectly doable, and for the VM teardown case
of "zap everything" it can be added with no additional changes since KVM
doesn't need to provide any ordering guarantees about sptes in that case.

For the MMIO case, KVM needs to prevent vCPUs from consuming stale sptes
created with an old generation number.  This should be a non-issue as
KVM is supposed to prevent consuming stale MMIO sptes regardless of the
zapping mechanism employed, releasing mmu_lock while zapping simply
makes it much more likely to hit any bug leading to stale spte usage.
As luck would have it, there are a number of corner cases in the MMIO
spte invalidation flow that could theoretically result in reusing stale
sptes.  So before reworking memslot zapping and reverting to the slow
invalidate mechanism, fix a number of bugs related to MMIO generations
and opportunistically clean up the memslots/MMIO generation code.


=== History ===
Flushing of shadow pages when removing a memslot was originally added
by commit 34d4cb8fca1f ("KVM: MMU: nuke shadowed pgtable pages and ptes
on memslot destruction"), and obviously emphasized functionality over
performance.  Commit 2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow")
added a path to allow flushing only the removed slot's shadow pages,
but x86 just redirected to the "zap all" flow.

Eventually, it became evident that zapping everything is slow, and so
x86 developed a more efficient hammer in the form of the fast invalidate
mechanism.

=== Other Uses ===
When a VM is being destroyed, either there are no active vcpus, i.e.
there's no lock contention, or the VM has ungracefully terminated, in
which case we want to reclaim its pages as quickly as possible, i.e.
not release the MMU lock if there are still CPUs executing in the VM.

The MMIO generation scenario is almost literally a one-in-a-million
occurrence, i.e. is not a performance sensitive scenario.

It's worth noting that prior to the "fast invalidate" series being
applied, there was actually another use case of kvm_mmu_zap_all() in
emulator_fix_hypercall() whose existence may have contributed to
improving the performance of "zap all" instead of avoiding it altogether.
But that usage was removed by the series itself in commit 758ccc89b83c
("KVM: x86: drop calling kvm_mmu_zap_all in emulator_fix_hypercall").

[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
[2] https://lkml.kernel.org/r/1368706673-8530-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
[3] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277.html
[4] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277/mmtest.tar.bz2

v1: https://patchwork.kernel.org/cover/10756579/
v2:
  - Fix a zap vs flush priority bug in kvm_mmu_remote_flush_or_zap() [me]
  - Voluntarily reschedule and/or drop mmu_lock as needed when zapping
    all sptes or all MMIO sptes [Paolo]
  - Add patches to clean up the memslots/MMIO generation code and fix
    a variety of theoretically corner case bugs


Sean Christopherson (27):
  KVM: Call kvm_arch_memslots_updated() before updating memslots
  KVM: x86/mmu: Detect MMIO generation wrap in any address space
  KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux
  KVM: Explicitly define the "memslot update in-progress" bit
  KVM: x86: Use a u64 when passing the MMIO gen around
  KVM: x86: Refactor the MMIO SPTE generation handling
  KVM: Remove the hack to trigger memslot generation wraparound
  KVM: Move the memslot update in-progress flag to bit 63
  KVM: x86/mmu: Move slot_level_*() helper functions up a few lines
  KVM: x86/mmu: Split remote_flush+zap case out of
    kvm_mmu_flush_or_zap()
  KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  Revert "KVM: MMU: document fast invalidate all pages"
  Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes"
  KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes
  KVM: x86/mmu: Remove is_obsolete() call
  Revert "KVM: MMU: reclaim the zapped-obsolete page first"
  Revert "KVM: MMU: collapse TLB flushes when zap all pages"
  Revert "KVM: MMU: zap pages in batch"
  Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages"
  Revert "KVM: MMU: show mmu_valid_gen in shadow page related
    tracepoints"
  Revert "KVM: x86: use the fast way to invalidate all pages"
  KVM: x86/mmu: skip over invalid root pages when zapping all sptes
  KVM: x86/mmu: Voluntarily reschedule as needed when zapping all sptes
  Revert "KVM: MMU: fast invalidate all pages"
  KVM: x86/mmu: Differentiate between nr zapped and list unstable
  KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children
  KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and
    kvm_mmu_zap_mmio_sptes()

 Documentation/virtual/kvm/mmu.txt   |  41 +--
 arch/mips/include/asm/kvm_host.h    |   2 +-
 arch/powerpc/include/asm/kvm_host.h |   2 +-
 arch/s390/include/asm/kvm_host.h    |   2 +-
 arch/x86/include/asm/kvm_host.h     |   9 +-
 arch/x86/kvm/mmu.c                  | 442 +++++++++++++---------------
 arch/x86/kvm/mmu.h                  |   1 -
 arch/x86/kvm/mmutrace.h             |  42 +--
 arch/x86/kvm/x86.c                  |   7 +-
 arch/x86/kvm/x86.h                  |   7 +-
 include/linux/kvm_host.h            |  23 +-
 virt/kvm/arm/mmu.c                  |   2 +-
 virt/kvm/kvm_main.c                 |  39 ++-
 13 files changed, 286 insertions(+), 333 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots
  2019-02-05 20:54 [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism Sean Christopherson
@ 2019-02-05 20:54 ` Sean Christopherson
  2019-02-06  9:12   ` Cornelia Huck
  2019-02-12 12:36 ` [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism Paolo Bonzini
       [not found] ` <20190205210137.1377-11-sean.j.christopherson@intel.com>
  2 siblings, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2019-02-05 20:54 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář,
	James Hogan, Paul Mackerras, Christian Borntraeger,
	Janosch Frank, Christoffer Dall, Marc Zyngier
  Cc: linux-s390, kvm, David Hildenbrand, Cornelia Huck, linux-mips,
	kvm-ppc, Xiao Guangrong, kvmarm, linux-arm-kernel

kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.

Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots.  Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.

Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/mips/include/asm/kvm_host.h    | 2 +-
 arch/powerpc/include/asm/kvm_host.h | 2 +-
 arch/s390/include/asm/kvm_host.h    | 2 +-
 arch/x86/include/asm/kvm_host.h     | 2 +-
 arch/x86/kvm/mmu.c                  | 4 ++--
 arch/x86/kvm/x86.c                  | 4 ++--
 include/linux/kvm_host.h            | 2 +-
 virt/kvm/arm/mmu.c                  | 2 +-
 virt/kvm/kvm_main.c                 | 7 +++++--
 9 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 71c3f21d80d5..f764a2de68e2 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -1131,7 +1131,7 @@ static inline void kvm_arch_hardware_unsetup(void) {}
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_free_memslot(struct kvm *kvm,
 		struct kvm_memory_slot *free, struct kvm_memory_slot *dont) {}
-static inline void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) {}
+static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 0f98f00da2ea..19693b8add93 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -837,7 +837,7 @@ struct kvm_vcpu_arch {
 static inline void kvm_arch_hardware_disable(void) {}
 static inline void kvm_arch_hardware_unsetup(void) {}
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
-static inline void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) {}
+static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
 static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_exit(void) {}
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index d5d24889c3bc..c2b8c8c6c9be 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -878,7 +878,7 @@ static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_free_memslot(struct kvm *kvm,
 		struct kvm_memory_slot *free, struct kvm_memory_slot *dont) {}
-static inline void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) {}
+static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
 static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
 static inline void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 		struct kvm_memory_slot *slot) {}
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4660ce90de7f..1c7f726e3ad5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1253,7 +1253,7 @@ void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 				   struct kvm_memory_slot *slot,
 				   gfn_t gfn_offset, unsigned long mask);
 void kvm_mmu_zap_all(struct kvm *kvm);
-void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots *slots);
+void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ce770b446238..fb040ea42454 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5890,13 +5890,13 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
 	return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
 }
 
-void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots *slots)
+void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
 {
 	/*
 	 * The very rare case: if the generation-number is round,
 	 * zap all shadow pages.
 	 */
-	if (unlikely((slots->generation & MMIO_GEN_MASK) == 0)) {
+	if (unlikely((gen & MMIO_GEN_MASK) == 0)) {
 		kvm_debug_ratelimited("kvm: zapping shadow pages for mmio generation wraparound\n");
 		kvm_mmu_invalidate_zap_all_pages(kvm);
 	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9ce1b8e2d3b9..4d05fca04fe1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9337,13 +9337,13 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 	return -ENOMEM;
 }
 
-void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots)
+void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
 {
 	/*
 	 * memslots->generation has been incremented.
 	 * mmio generation may have reached its maximum value.
 	 */
-	kvm_mmu_invalidate_mmio_sptes(kvm, slots);
+	kvm_mmu_invalidate_mmio_sptes(kvm, gen);
 }
 
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c38cc5eb7e73..cf761ff58224 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -634,7 +634,7 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free,
 			   struct kvm_memory_slot *dont);
 int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 			    unsigned long npages);
-void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots);
+void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
 				struct kvm_memory_slot *memslot,
 				const struct kvm_userspace_memory_region *mem,
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 3053bf2584f8..3ab0daee3fea 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -2350,7 +2350,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 	return 0;
 }
 
-void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots)
+void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
 {
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cf7cc0554094..df5dc7224f02 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -878,6 +878,7 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
 		int as_id, struct kvm_memslots *slots)
 {
 	struct kvm_memslots *old_memslots = __kvm_memslots(kvm, as_id);
+	u64 gen;
 
 	/*
 	 * Set the low bit in the generation, which disables SPTE caching
@@ -900,9 +901,11 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
 	 * space 0 will use generations 0, 4, 8, ... while * address space 1 will
 	 * use generations 2, 6, 10, 14, ...
 	 */
-	slots->generation += KVM_ADDRESS_SPACE_NUM * 2 - 1;
+	gen = slots->generation + KVM_ADDRESS_SPACE_NUM * 2 - 1;
 
-	kvm_arch_memslots_updated(kvm, slots);
+	kvm_arch_memslots_updated(kvm, gen);
+
+	slots->generation = gen;
 
 	return old_memslots;
 }
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots
  2019-02-05 20:54 ` [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots Sean Christopherson
@ 2019-02-06  9:12   ` Cornelia Huck
  0 siblings, 0 replies; 38+ messages in thread
From: Cornelia Huck @ 2019-02-06  9:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-s390, Janosch Frank, kvm, Marc Zyngier, James Hogan,
	David Hildenbrand, kvm-ppc, linux-mips, Paul Mackerras,
	Christian Borntraeger, Xiao Guangrong, Paolo Bonzini, kvmarm,
	linux-arm-kernel

On Tue,  5 Feb 2019 12:54:17 -0800
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> kvm_arch_memslots_updated() is at this point in time an x86-specific
> hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
> the memslots generation number in its MMIO sptes in order to avoid
> full page fault walks for repeat faults on emulated MMIO addresses.
> Because only 19 bits are used, wrapping the MMIO generation number is
> possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
> the generation has changed so that it can invalidate all MMIO sptes in
> case the effective MMIO generation has wrapped so as to avoid using a
> stale spte, e.g. a (very) old spte that was created with generation==0.
> 
> Given that the purpose of kvm_arch_memslots_updated() is to prevent
> consuming stale entries, it needs to be called before the new generation
> is propagated to memslots.  Invalidating the MMIO sptes after updating
> memslots means that there is a window where a vCPU could dereference
> the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
> spte that was created with (pre-wrap) generation==0.
> 
> Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/mips/include/asm/kvm_host.h    | 2 +-
>  arch/powerpc/include/asm/kvm_host.h | 2 +-
>  arch/s390/include/asm/kvm_host.h    | 2 +-
>  arch/x86/include/asm/kvm_host.h     | 2 +-
>  arch/x86/kvm/mmu.c                  | 4 ++--
>  arch/x86/kvm/x86.c                  | 4 ++--
>  include/linux/kvm_host.h            | 2 +-
>  virt/kvm/arm/mmu.c                  | 2 +-
>  virt/kvm/kvm_main.c                 | 7 +++++--
>  9 files changed, 15 insertions(+), 12 deletions(-)

Not an x86 person, but I think that makes sense.

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism
  2019-02-05 20:54 [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism Sean Christopherson
  2019-02-05 20:54 ` [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots Sean Christopherson
@ 2019-02-12 12:36 ` Paolo Bonzini
       [not found] ` <20190205210137.1377-11-sean.j.christopherson@intel.com>
  2 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2019-02-12 12:36 UTC (permalink / raw)
  To: Sean Christopherson, Radim Krčmář,
	James Hogan, Paul Mackerras, Christian Borntraeger,
	Janosch Frank, Christoffer Dall, Marc Zyngier
  Cc: linux-s390, kvm, David Hildenbrand, Cornelia Huck, linux-mips,
	kvm-ppc, Xiao Guangrong, kvmarm, linux-arm-kernel

On 05/02/19 21:54, Sean Christopherson wrote:
> ...and clean up the MMIO spte generation code.
> 
> === Non-x86 Folks ===
> Patch 01/27 (which you should also be cc'd on) changes the prototype for
> a function that is defined by all architectures but only actually used
> by x86.  Feel free to skip the rest of this cover letter.
> 
> === x86 Folks ===
> Though not explicitly stated, for all intents and purposes the fast
> invalidate mechanism was added to speed up the scenario where removing
> a memslot, e.g. when accessing PCI ROM, caused KVM to flush all shadow
> entries[1].
> 
> The other use cases of "flush everything" are VM teardown and handling
> MMIO generation overflow, neither of which is a performance critial
> path (see "Other Uses" below).
> 
> For the memslot case, zapping all shadow entries is overkill, i.e. KVM
> only needs to zap the entries associated with the memslot, but KVM has
> historically used a big hammer because if all you have is a hammer,
> everything looks like a nail.
> 
> Rather than zap all pages when removing a memslot, zap only the shadow
> entries associated with said memslot.  I see a performance improvement
> of ~5% when zapping only the pages for the deleted memslot when using a
> slightly modified version of the original benchmark[2][3][4] (I don't
> have easy access to a system with hundreds of gigs of memory).
> 
> $ cat shell.sh
>         #!/bin/sh
> 
>         echo 3 > /proc/sys/vm/drop_caches
>         ./mmtest -c 8 -m 2000 -e ./rom.sh
> 
> I.e. running 8 threads and 2gb of memory per thread, time in milliseconds:
> 
> Before: 89.117
> After:  84.768
> 
> 
> With the memslot use case gone, maintaining the fast invalidate path
> adds a moderate amount of complexity but provides little to no value.
> Furhtermore, its existence may give the impression that it's "ok" to zap
> all shadow pages.  Remove the fast invalidate mechanism to simplify the
> code and to discourage future code from zapping all pages as using such
> a big hammer should be a last resort.
> 
> Unfortunately, simply reverting the fast invalidate code re-introduces
> shortcomings in the previous code, notably that KVM may hold a spinlock
> and not schedule for an extended period of time.  Releasing the lock to
> voluntarily reschedule is perfectly doable, and for the VM teardown case
> of "zap everything" it can be added with no additional changes since KVM
> doesn't need to provide any ordering guarantees about sptes in that case.
> 
> For the MMIO case, KVM needs to prevent vCPUs from consuming stale sptes
> created with an old generation number.  This should be a non-issue as
> KVM is supposed to prevent consuming stale MMIO sptes regardless of the
> zapping mechanism employed, releasing mmu_lock while zapping simply
> makes it much more likely to hit any bug leading to stale spte usage.
> As luck would have it, there are a number of corner cases in the MMIO
> spte invalidation flow that could theoretically result in reusing stale
> sptes.  So before reworking memslot zapping and reverting to the slow
> invalidate mechanism, fix a number of bugs related to MMIO generations
> and opportunistically clean up the memslots/MMIO generation code.
> 
> 
> === History ===
> Flushing of shadow pages when removing a memslot was originally added
> by commit 34d4cb8fca1f ("KVM: MMU: nuke shadowed pgtable pages and ptes
> on memslot destruction"), and obviously emphasized functionality over
> performance.  Commit 2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow")
> added a path to allow flushing only the removed slot's shadow pages,
> but x86 just redirected to the "zap all" flow.
> 
> Eventually, it became evident that zapping everything is slow, and so
> x86 developed a more efficient hammer in the form of the fast invalidate
> mechanism.
> 
> === Other Uses ===
> When a VM is being destroyed, either there are no active vcpus, i.e.
> there's no lock contention, or the VM has ungracefully terminated, in
> which case we want to reclaim its pages as quickly as possible, i.e.
> not release the MMU lock if there are still CPUs executing in the VM.
> 
> The MMIO generation scenario is almost literally a one-in-a-million
> occurrence, i.e. is not a performance sensitive scenario.
> 
> It's worth noting that prior to the "fast invalidate" series being
> applied, there was actually another use case of kvm_mmu_zap_all() in
> emulator_fix_hypercall() whose existence may have contributed to
> improving the performance of "zap all" instead of avoiding it altogether.
> But that usage was removed by the series itself in commit 758ccc89b83c
> ("KVM: x86: drop calling kvm_mmu_zap_all in emulator_fix_hypercall").
> 
> [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
> [2] https://lkml.kernel.org/r/1368706673-8530-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
> [3] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277.html
> [4] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277/mmtest.tar.bz2
> 
> v1: https://patchwork.kernel.org/cover/10756579/
> v2:
>   - Fix a zap vs flush priority bug in kvm_mmu_remote_flush_or_zap() [me]
>   - Voluntarily reschedule and/or drop mmu_lock as needed when zapping
>     all sptes or all MMIO sptes [Paolo]
>   - Add patches to clean up the memslots/MMIO generation code and fix
>     a variety of theoretically corner case bugs
> 
> 
> Sean Christopherson (27):
>   KVM: Call kvm_arch_memslots_updated() before updating memslots
>   KVM: x86/mmu: Detect MMIO generation wrap in any address space
>   KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux
>   KVM: Explicitly define the "memslot update in-progress" bit
>   KVM: x86: Use a u64 when passing the MMIO gen around
>   KVM: x86: Refactor the MMIO SPTE generation handling
>   KVM: Remove the hack to trigger memslot generation wraparound
>   KVM: Move the memslot update in-progress flag to bit 63
>   KVM: x86/mmu: Move slot_level_*() helper functions up a few lines
>   KVM: x86/mmu: Split remote_flush+zap case out of
>     kvm_mmu_flush_or_zap()
>   KVM: x86/mmu: Zap only the relevant pages when removing a memslot
>   Revert "KVM: MMU: document fast invalidate all pages"
>   Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes"
>   KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes
>   KVM: x86/mmu: Remove is_obsolete() call
>   Revert "KVM: MMU: reclaim the zapped-obsolete page first"
>   Revert "KVM: MMU: collapse TLB flushes when zap all pages"
>   Revert "KVM: MMU: zap pages in batch"
>   Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages"
>   Revert "KVM: MMU: show mmu_valid_gen in shadow page related
>     tracepoints"
>   Revert "KVM: x86: use the fast way to invalidate all pages"
>   KVM: x86/mmu: skip over invalid root pages when zapping all sptes
>   KVM: x86/mmu: Voluntarily reschedule as needed when zapping all sptes
>   Revert "KVM: MMU: fast invalidate all pages"
>   KVM: x86/mmu: Differentiate between nr zapped and list unstable
>   KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children
>   KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and
>     kvm_mmu_zap_mmio_sptes()
> 
>  Documentation/virtual/kvm/mmu.txt   |  41 +--
>  arch/mips/include/asm/kvm_host.h    |   2 +-
>  arch/powerpc/include/asm/kvm_host.h |   2 +-
>  arch/s390/include/asm/kvm_host.h    |   2 +-
>  arch/x86/include/asm/kvm_host.h     |   9 +-
>  arch/x86/kvm/mmu.c                  | 442 +++++++++++++---------------
>  arch/x86/kvm/mmu.h                  |   1 -
>  arch/x86/kvm/mmutrace.h             |  42 +--
>  arch/x86/kvm/x86.c                  |   7 +-
>  arch/x86/kvm/x86.h                  |   7 +-
>  include/linux/kvm_host.h            |  23 +-
>  virt/kvm/arm/mmu.c                  |   2 +-
>  virt/kvm/kvm_main.c                 |  39 ++-
>  13 files changed, 286 insertions(+), 333 deletions(-)
> 

Queued, thanks.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
       [not found] ` <20190205210137.1377-11-sean.j.christopherson@intel.com>
@ 2019-08-13 16:04   ` Alex Williamson
  2019-08-13 17:04     ` Sean Christopherson
  0 siblings, 1 reply; 38+ messages in thread
From: Alex Williamson @ 2019-08-13 16:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Tue,  5 Feb 2019 13:01:21 -0800
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
> handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
> that actually belong to the memslot being removed.  This improves
> performance, especially why the deleted memslot has only a few shadow
> entries, or even no entries.  E.g. a microbenchmark to access regular
> memory while concurrently reading PCI ROM to trigger memslot deletion
> showed a 5% improvement in throughput.
> 
> Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++-
>  1 file changed, 32 insertions(+), 1 deletion(-)

A number of vfio users are reporting VM instability issues since v5.1,
some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap
only the relevant pages when removing a memslot"), which I've confirmed
via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and
re-verified on current 5.3-rc4 using the below patch to toggle the
broken behavior.

My reproducer is a Windows 10 VM with assigned GeForce GPU running a
variety of tests, including FurMark and PassMark Performance Test.
With the code enabled as exists in upstream currently, PassMark will
generally introduce graphics glitches or hangs.  Sometimes it's
necessary to reboot the VM to see these issues.  Flipping the 0/1 in
the below patch appears to resolve the issue.

I'd appreciate any insights into further debugging this block of code
so that we can fix this regression.  Thanks,

Alex

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24843cf49579..6a77306940f7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5653,6 +5653,9 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 			struct kvm_memory_slot *slot,
 			struct kvm_page_track_notifier_node *node)
 {
+#if 0
+	kvm_mmu_zap_all(kvm);
+#else
 	struct kvm_mmu_page *sp;
 	LIST_HEAD(invalid_list);
 	unsigned long i;
@@ -5685,6 +5688,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 
 out_unlock:
 	spin_unlock(&kvm->mmu_lock);
+#endif
 }
 
 void kvm_mmu_init_vm(struct kvm *kvm)

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 16:04   ` [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot Alex Williamson
@ 2019-08-13 17:04     ` Sean Christopherson
  2019-08-13 17:57       ` Alex Williamson
  0 siblings, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2019-08-13 17:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Tue, Aug 13, 2019 at 10:04:58AM -0600, Alex Williamson wrote:
> On Tue,  5 Feb 2019 13:01:21 -0800
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
> > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
> > that actually belong to the memslot being removed.  This improves
> > performance, especially why the deleted memslot has only a few shadow
> > entries, or even no entries.  E.g. a microbenchmark to access regular
> > memory while concurrently reading PCI ROM to trigger memslot deletion
> > showed a 5% improvement in throughput.
> > 
> > Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > ---
> >  arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++-
> >  1 file changed, 32 insertions(+), 1 deletion(-)
> 
> A number of vfio users are reporting VM instability issues since v5.1,
> some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap
> only the relevant pages when removing a memslot"), which I've confirmed
> via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and
> re-verified on current 5.3-rc4 using the below patch to toggle the
> broken behavior.
> 
> My reproducer is a Windows 10 VM with assigned GeForce GPU running a
> variety of tests, including FurMark and PassMark Performance Test.
> With the code enabled as exists in upstream currently, PassMark will
> generally introduce graphics glitches or hangs.  Sometimes it's
> necessary to reboot the VM to see these issues.

As in, the issue only shows up when the VM is rebooted?  Just want to
double check that that's not a typo.

> Flipping the 0/1 in the below patch appears to resolve the issue.
> 
> I'd appreciate any insights into further debugging this block of code
> so that we can fix this regression.  Thanks,

If it's not too painful to reproduce, I'd say start by determining whether
it's a problem with the basic logic or if the cond_resched_lock() handling
is wrong.  I.e. comment/ifdef out this chunk:

		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
			flush = false;
			cond_resched_lock(&kvm->mmu_lock);
		}


> 
> Alex
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 24843cf49579..6a77306940f7 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -5653,6 +5653,9 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>  			struct kvm_memory_slot *slot,
>  			struct kvm_page_track_notifier_node *node)
>  {
> +#if 0
> +	kvm_mmu_zap_all(kvm);
> +#else
>  	struct kvm_mmu_page *sp;
>  	LIST_HEAD(invalid_list);
>  	unsigned long i;
> @@ -5685,6 +5688,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>  
>  out_unlock:
>  	spin_unlock(&kvm->mmu_lock);
> +#endif
>  }
>  
>  void kvm_mmu_init_vm(struct kvm *kvm)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 17:04     ` Sean Christopherson
@ 2019-08-13 17:57       ` Alex Williamson
  2019-08-13 19:33         ` Alex Williamson
  0 siblings, 1 reply; 38+ messages in thread
From: Alex Williamson @ 2019-08-13 17:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Tue, 13 Aug 2019 10:04:41 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Tue, Aug 13, 2019 at 10:04:58AM -0600, Alex Williamson wrote:
> > On Tue,  5 Feb 2019 13:01:21 -0800
> > Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> >   
> > > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
> > > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
> > > that actually belong to the memslot being removed.  This improves
> > > performance, especially why the deleted memslot has only a few shadow
> > > entries, or even no entries.  E.g. a microbenchmark to access regular
> > > memory while concurrently reading PCI ROM to trigger memslot deletion
> > > showed a 5% improvement in throughput.
> > > 
> > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > ---
> > >  arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++-
> > >  1 file changed, 32 insertions(+), 1 deletion(-)  
> > 
> > A number of vfio users are reporting VM instability issues since v5.1,
> > some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap
> > only the relevant pages when removing a memslot"), which I've confirmed
> > via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and
> > re-verified on current 5.3-rc4 using the below patch to toggle the
> > broken behavior.
> > 
> > My reproducer is a Windows 10 VM with assigned GeForce GPU running a
> > variety of tests, including FurMark and PassMark Performance Test.
> > With the code enabled as exists in upstream currently, PassMark will
> > generally introduce graphics glitches or hangs.  Sometimes it's
> > necessary to reboot the VM to see these issues.  
> 
> As in, the issue only shows up when the VM is rebooted?  Just want to
> double check that that's not a typo.

No, it can occur on the first boot as well, it's just that the recipe
to induce a failure is not well understood and manifests itself in
different ways.  I generally run the tests, then if it still hasn't
reproduced, I reboot the VM a couple times, running a couple apps in
between to try to trigger/notice bad behavior.

> > Flipping the 0/1 in the below patch appears to resolve the issue.
> > 
> > I'd appreciate any insights into further debugging this block of code
> > so that we can fix this regression.  Thanks,  
> 
> If it's not too painful to reproduce, I'd say start by determining whether
> it's a problem with the basic logic or if the cond_resched_lock() handling
> is wrong.  I.e. comment/ifdef out this chunk:
> 
> 		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> 			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
> 			flush = false;
> 			cond_resched_lock(&kvm->mmu_lock);
> 		}

If anything, removing this chunk seems to make things worse.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 17:57       ` Alex Williamson
@ 2019-08-13 19:33         ` Alex Williamson
  2019-08-13 20:19           ` Sean Christopherson
  0 siblings, 1 reply; 38+ messages in thread
From: Alex Williamson @ 2019-08-13 19:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Tue, 13 Aug 2019 11:57:37 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 13 Aug 2019 10:04:41 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> > On Tue, Aug 13, 2019 at 10:04:58AM -0600, Alex Williamson wrote:  
> > > On Tue,  5 Feb 2019 13:01:21 -0800
> > > Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > >     
> > > > Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
> > > > handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
> > > > that actually belong to the memslot being removed.  This improves
> > > > performance, especially why the deleted memslot has only a few shadow
> > > > entries, or even no entries.  E.g. a microbenchmark to access regular
> > > > memory while concurrently reading PCI ROM to trigger memslot deletion
> > > > showed a 5% improvement in throughput.
> > > > 
> > > > Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
> > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > > ---
> > > >  arch/x86/kvm/mmu.c | 33 ++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 32 insertions(+), 1 deletion(-)    
> > > 
> > > A number of vfio users are reporting VM instability issues since v5.1,
> > > some have traced it back to this commit 4e103134b862 ("KVM: x86/mmu: Zap
> > > only the relevant pages when removing a memslot"), which I've confirmed
> > > via bisection of the 5.1 merge window KVM pull (636deed6c0bc) and
> > > re-verified on current 5.3-rc4 using the below patch to toggle the
> > > broken behavior.
> > > 
> > > My reproducer is a Windows 10 VM with assigned GeForce GPU running a
> > > variety of tests, including FurMark and PassMark Performance Test.
> > > With the code enabled as exists in upstream currently, PassMark will
> > > generally introduce graphics glitches or hangs.  Sometimes it's
> > > necessary to reboot the VM to see these issues.    
> > 
> > As in, the issue only shows up when the VM is rebooted?  Just want to
> > double check that that's not a typo.  
> 
> No, it can occur on the first boot as well, it's just that the recipe
> to induce a failure is not well understood and manifests itself in
> different ways.  I generally run the tests, then if it still hasn't
> reproduced, I reboot the VM a couple times, running a couple apps in
> between to try to trigger/notice bad behavior.
> 
> > > Flipping the 0/1 in the below patch appears to resolve the issue.
> > > 
> > > I'd appreciate any insights into further debugging this block of code
> > > so that we can fix this regression.  Thanks,    
> > 
> > If it's not too painful to reproduce, I'd say start by determining whether
> > it's a problem with the basic logic or if the cond_resched_lock() handling
> > is wrong.  I.e. comment/ifdef out this chunk:
> > 
> > 		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> > 			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
> > 			flush = false;
> > 			cond_resched_lock(&kvm->mmu_lock);
> > 		}  
> 
> If anything, removing this chunk seems to make things worse.

Could it be something with the gfn test:

                        if (sp->gfn != gfn)
                                continue;

If I remove it, I can't trigger the misbehavior.  If I log it, I only
get hits on VM boot/reboot and some of the gfns look suspiciously like
they could be the assigned GPU BARs and maybe MSI mappings:

               (sp->gfn) != (gfn)
[   71.829450] gfn fec00 != c02c4
[   71.835554] gfn ffe00 != c046f
[   71.841664] gfn 0 != c0720
[   71.847084] gfn 0 != c0720
[   71.852489] gfn 0 != c0720
[   71.857899] gfn 0 != c0720
[   71.863306] gfn 0 != c0720
[   71.868717] gfn 0 != c0720
[   71.874122] gfn 0 != c0720
[   71.879531] gfn 0 != c0720
[   71.884937] gfn 0 != c0720
[   71.890349] gfn 0 != c0720
[   71.895757] gfn 0 != c0720
[   71.901163] gfn 0 != c0720
[   71.906569] gfn 0 != c0720
[   71.911980] gfn 0 != c0720
[   71.917387] gfn 0 != c0720
[   71.922808] gfn fee00 != c0edc
[   71.928915] gfn fee00 != c0edc
[   71.935018] gfn fee00 != c0edc
[   71.941730] gfn c1000 != 8002d7
[   71.948039] gfn 80000 != 8006d5
[   71.954328] gfn 80000 != 8006d5
[   71.960600] gfn 80000 != 8006d5
[   71.966874] gfn 80000 != 8006d5
[   71.992272] gfn 0 != c0720
[   71.997683] gfn 0 != c0720
[   72.003725] gfn 80000 != 8006d5
[   72.044333] gfn 0 != c0720
[   72.049743] gfn 0 != c0720
[   72.055846] gfn 80000 != 8006d5
[   72.177341] gfn ffe00 != c046f
[   72.183453] gfn 0 != c0720
[   72.188864] gfn 0 != c0720
[   72.194290] gfn 0 != c0720
[   72.200308] gfn 80000 != 8006d5
[   82.539023] gfn fec00 != c02c4
[   82.545142] gfn 40000 != c0377
[   82.551343] gfn ffe00 != c046f
[   82.557466] gfn 100000 != c066f
[   82.563839] gfn 800000 != c06ec
[   82.570133] gfn 800000 != c06ec
[   82.576408] gfn 0 != c0720
[   82.581850] gfn 0 != c0720
[   82.587275] gfn 0 != c0720
[   82.592685] gfn 0 != c0720
[   82.598131] gfn 0 != c0720
[   82.603552] gfn 0 != c0720
[   82.608978] gfn 0 != c0720
[   82.614419] gfn 0 != c0720
[   82.619844] gfn 0 != c0720
[   82.625291] gfn 0 != c0720
[   82.630791] gfn 0 != c0720
[   82.636208] gfn 0 != c0720
[   82.641635] gfn 80a000 != c085e
[   82.647929] gfn fee00 != c0edc
[   82.654062] gfn fee00 != c0edc
[   82.660504] gfn 100000 != c066f
[   82.666800] gfn 0 != c0720
[   82.672211] gfn 0 != c0720
[   82.677635] gfn 0 != c0720
[   82.683060] gfn 0 != c0720
[   82.689209] gfn c1000 != 8002d7
[   82.695501] gfn 80000 != 8006d5
[   82.701796] gfn 80000 != 8006d5
[   82.708092] gfn 100000 != 80099b
[   82.714547] gfn 0 != 800a4c
[   82.720154] gfn 0 != 800a4c
[   82.725752] gfn 0 != 800a4c
[   82.731370] gfn 0 != 800a4c
[   82.738705] gfn 100000 != 80099b
[   82.745201] gfn 0 != 800a4c
[   82.750793] gfn 0 != 800a4c
[   82.756381] gfn 0 != 800a4c
[   82.761979] gfn 0 != 800a4c
[   82.768122] gfn 100000 != 8083a4
[   82.774605] gfn 0 != 8094aa
[   82.780196] gfn 0 != 8094aa
[   82.785796] gfn 0 != 8094aa
[   82.791412] gfn 0 != 8094aa
[   82.797523] gfn 100000 != 8083a4
[   82.803977] gfn 0 != 8094aa
[   82.809576] gfn 0 != 8094aa
[   82.815193] gfn 0 != 8094aa
[   82.820809] gfn 0 != 8094aa

(GPU has a BAR mapped at 0x80000000)

Is this gfn optimization correct?  Overzealous?  Doesn't account
correctly for something about MMIO mappings?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 19:33         ` Alex Williamson
@ 2019-08-13 20:19           ` Sean Christopherson
  2019-08-13 20:37             ` Paolo Bonzini
  2019-08-15 15:23             ` Alex Williamson
  0 siblings, 2 replies; 38+ messages in thread
From: Sean Christopherson @ 2019-08-13 20:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Tue, Aug 13, 2019 at 01:33:16PM -0600, Alex Williamson wrote:
> On Tue, 13 Aug 2019 11:57:37 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:

> Could it be something with the gfn test:
> 
>                         if (sp->gfn != gfn)
>                                 continue;
> 
> If I remove it, I can't trigger the misbehavior.  If I log it, I only
> get hits on VM boot/reboot and some of the gfns look suspiciously like
> they could be the assigned GPU BARs and maybe MSI mappings:
> 
>                (sp->gfn) != (gfn)

Hits at boot/reboot makes sense, memslots get zapped when userspace
removes a memory region/slot, e.g. remaps BARs and whatnot.

...
 
> Is this gfn optimization correct?  Overzealous?  Doesn't account
> correctly for something about MMIO mappings?  Thanks,

Yes?  Shadow pages are stored in a hash table, for_each_valid_sp() walks
all entries for a given gfn.  The sp->gfn check is there to skip entries
that hashed to the same list but for a completely different gfn.

Skipping the gfn check would be sort of a lightweight zap all in the
sense that it would zap shadow pages that happend to collide with the
target memslot/gfn but are otherwise unrelated.

What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.:

	if (sp->gfn != gfn && sp->gfn != 0x80000)
		continue;

If that doesn't work, it might be worth trying other gfns to see if you
can pinpoint which sp is being zapped as collateral damage.

It's possible there is a pre-existing bug somewhere else that was being
hidden because KVM was effectively zapping all SPTEs during (re)boot,
and the hash collision is also hiding the bug by zapping the stale entry.

Of course it's also possible this code is wrong, :-)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 20:19           ` Sean Christopherson
@ 2019-08-13 20:37             ` Paolo Bonzini
  2019-08-13 21:14               ` Alex Williamson
  2019-08-15 15:23             ` Alex Williamson
  1 sibling, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2019-08-13 20:37 UTC (permalink / raw)
  To: Sean Christopherson, Alex Williamson
  Cc: Radim Krčmář, kvm, Xiao Guangrong

On 13/08/19 22:19, Sean Christopherson wrote:
> Yes?  Shadow pages are stored in a hash table, for_each_valid_sp() walks
> all entries for a given gfn.  The sp->gfn check is there to skip entries
> that hashed to the same list but for a completely different gfn.
> 
> Skipping the gfn check would be sort of a lightweight zap all in the
> sense that it would zap shadow pages that happend to collide with the
> target memslot/gfn but are otherwise unrelated.
> 
> What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.:
> 
> 	if (sp->gfn != gfn && sp->gfn != 0x80000)
> 		continue;
> 
> If that doesn't work, it might be worth trying other gfns to see if you
> can pinpoint which sp is being zapped as collateral damage.
> 
> It's possible there is a pre-existing bug somewhere else that was being
> hidden because KVM was effectively zapping all SPTEs during (re)boot,
> and the hash collision is also hiding the bug by zapping the stale entry.
> 
> Of course it's also possible this code is wrong, :-)

Also, can you reproduce it with one vCPU?  This could (though not really
100%) distinguish a missing invalidation from a race condition.

Do we even need the call to slot_handle_all_level?  The rmap update
should be done already by kvm_mmu_prepare_zap_page (via
kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte).

Alex, can you replace it with just "flush = false;"?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 20:37             ` Paolo Bonzini
@ 2019-08-13 21:14               ` Alex Williamson
  2019-08-13 21:15                 ` Paolo Bonzini
  2019-08-15 14:46                 ` Sean Christopherson
  0 siblings, 2 replies; 38+ messages in thread
From: Alex Williamson @ 2019-08-13 21:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Radim Krčmář, kvm, Xiao Guangrong

On Tue, 13 Aug 2019 22:37:14 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 13/08/19 22:19, Sean Christopherson wrote:
> > Yes?  Shadow pages are stored in a hash table, for_each_valid_sp() walks
> > all entries for a given gfn.  The sp->gfn check is there to skip entries
> > that hashed to the same list but for a completely different gfn.
> > 
> > Skipping the gfn check would be sort of a lightweight zap all in the
> > sense that it would zap shadow pages that happend to collide with the
> > target memslot/gfn but are otherwise unrelated.
> > 
> > What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.:
> > 
> > 	if (sp->gfn != gfn && sp->gfn != 0x80000)
> > 		continue;

Not having any luck with this yet.  Tried 0x80000, 0x8xxxxx, 0.
 
> > If that doesn't work, it might be worth trying other gfns to see if you
> > can pinpoint which sp is being zapped as collateral damage.
> > 
> > It's possible there is a pre-existing bug somewhere else that was being
> > hidden because KVM was effectively zapping all SPTEs during (re)boot,
> > and the hash collision is also hiding the bug by zapping the stale entry.
> > 
> > Of course it's also possible this code is wrong, :-)  
> 
> Also, can you reproduce it with one vCPU?  This could (though not really
> 100%) distinguish a missing invalidation from a race condition.

That's a pretty big change, I'll give it a shot, but not sure how
conclusive it would be.

> Do we even need the call to slot_handle_all_level?  The rmap update
> should be done already by kvm_mmu_prepare_zap_page (via
> kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte).
> 
> Alex, can you replace it with just "flush = false;"?

Replace the continue w/ flush = false?  I'm not clear on this
suggestion.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 21:14               ` Alex Williamson
@ 2019-08-13 21:15                 ` Paolo Bonzini
  2019-08-13 22:10                   ` Alex Williamson
  2019-08-15 14:46                 ` Sean Christopherson
  1 sibling, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2019-08-13 21:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Sean Christopherson, Radim Krčmář, kvm, Xiao Guangrong

On 13/08/19 23:14, Alex Williamson wrote:
>> Do we even need the call to slot_handle_all_level?  The rmap update
>> should be done already by kvm_mmu_prepare_zap_page (via
>> kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte).
>>
>> Alex, can you replace it with just "flush = false;"?
> Replace the continue w/ flush = false?  I'm not clear on this
> suggestion.  Thanks,

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24843cf49579..382b3ee303e3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5664,7 +5668,7 @@ kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm
 	if (list_empty(&kvm->arch.active_mmu_pages))
 		goto out_unlock;

-	flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, false);
+	flush = false;

 	for (i = 0; i < slot->npages; i++) {
 		gfn = slot->base_gfn + i;


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 21:15                 ` Paolo Bonzini
@ 2019-08-13 22:10                   ` Alex Williamson
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2019-08-13 22:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Radim Krčmář, kvm, Xiao Guangrong

On Tue, 13 Aug 2019 23:15:38 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 13/08/19 23:14, Alex Williamson wrote:
> >> Do we even need the call to slot_handle_all_level?  The rmap update
> >> should be done already by kvm_mmu_prepare_zap_page (via
> >> kvm_mmu_page_unlink_children -> mmu_page_zap_pte -> drop_spte).
> >>
> >> Alex, can you replace it with just "flush = false;"?  
> > Replace the continue w/ flush = false?  I'm not clear on this
> > suggestion.  Thanks,  
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 24843cf49579..382b3ee303e3 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -5664,7 +5668,7 @@ kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm
>  	if (list_empty(&kvm->arch.active_mmu_pages))
>  		goto out_unlock;
> 
> -	flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, false);
> +	flush = false;
> 
>  	for (i = 0; i < slot->npages; i++) {
>  		gfn = slot->base_gfn + i;
> 

Things are not happy with that, it managed to crash the host:

[  871.244789] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  871.258680] #PF: supervisor read access in kernel mode
[  871.268939] #PF: error_code(0x0000) - not-present page
[  871.279202] PGD 0 P4D 0 
[  871.284264] Oops: 0000 [#1] SMP PTI
[  871.291232] CPU: 3 PID: 1954 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4
[  871.304264] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
[  871.322523] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm]
[  871.332085] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d
[  871.369560] RSP: 0018:ffffc10300a0bb18 EFLAGS: 00010202
[  871.379995] RAX: 00000000000c1040 RBX: ffffc103007cd000 RCX: 0000000000000014
[  871.394243] RDX: ffff9b3bbff18130 RSI: 00000000000c1080 RDI: 0000000000000004
[  871.408491] RBP: ffff9b3bced15400 R08: ffff9b3b94a90008 R09: ffff9b3b94a90000
[  871.422739] R10: ffff9b3b94a90168 R11: 0000000000000004 R12: ffff9b3bbff18130
[  871.436986] R13: ffffc103007cd000 R14: 0000000000000000 R15: 00000000000c1000
[  871.451234] FS:  00007fc27b37d700(0000) GS:ffff9b3f9f4c0000(0000) knlGS:0000000000000000
[  871.467390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  871.478865] CR2: 0000000000000000 CR3: 000000003bb2a004 CR4: 00000000001626e0
[  871.493113] Call Trace:
[  871.498012]  drop_spte+0x77/0xa0 [kvm]
[  871.505500]  mmu_page_zap_pte+0xac/0xe0 [kvm]
[  871.514200]  __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm]
[  871.524809]  kvm_mmu_invalidate_zap_pages_in_memslot+0xb7/0x130 [kvm]
[  871.537670]  kvm_page_track_flush_slot+0x55/0x80 [kvm]
[  871.547928]  __kvm_set_memory_region+0x821/0xaa0 [kvm]
[  871.558191]  kvm_set_memory_region+0x26/0x40 [kvm]
[  871.567759]  kvm_vm_ioctl+0x59a/0x940 [kvm]
[  871.576106]  ? __seccomp_filter+0x7a/0x680
[  871.584287]  do_vfs_ioctl+0xa4/0x630
[  871.591430]  ? security_file_ioctl+0x32/0x50
[  871.599954]  ksys_ioctl+0x60/0x90
[  871.606575]  __x64_sys_ioctl+0x16/0x20
[  871.614066]  do_syscall_64+0x5f/0x1a0
[  871.621379]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

It also failed with the 1 vcpu test.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 21:14               ` Alex Williamson
  2019-08-13 21:15                 ` Paolo Bonzini
@ 2019-08-15 14:46                 ` Sean Christopherson
  1 sibling, 0 replies; 38+ messages in thread
From: Sean Christopherson @ 2019-08-15 14:46 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

[-- Attachment #1: Type: text/plain, Size: 1208 bytes --]

On Tue, Aug 13, 2019 at 03:14:17PM -0600, Alex Williamson wrote:
> On Tue, 13 Aug 2019 22:37:14 +0200
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> > On 13/08/19 22:19, Sean Christopherson wrote:
> > > Yes?  Shadow pages are stored in a hash table, for_each_valid_sp() walks
> > > all entries for a given gfn.  The sp->gfn check is there to skip entries
> > > that hashed to the same list but for a completely different gfn.
> > > 
> > > Skipping the gfn check would be sort of a lightweight zap all in the
> > > sense that it would zap shadow pages that happend to collide with the
> > > target memslot/gfn but are otherwise unrelated.
> > > 
> > > What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.:
> > > 
> > > 	if (sp->gfn != gfn && sp->gfn != 0x80000)
> > > 		continue;
> 
> Not having any luck with this yet.  Tried 0x80000, 0x8xxxxx, 0.

I've no idea if it would actually be interesting, but something to try
would be to zap only emulated mmio SPTEs (in addition to the memslot).
If that test passes then I think it might indicate a problem with device
enumeration as opposed to the mapping of the device itself ("think" and
"might" being the operative words).  Patch attached.

[-- Attachment #2: emulated_mmio_only.patch --]
[-- Type: text/x-diff, Size: 947 bytes --]

From df7d1bf7025c20703f71e126dc673ba58230605f Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson@intel.com>
Date: Thu, 15 Aug 2019 07:27:05 -0700
Subject: [PATCH] tmp

---
 arch/x86/kvm/mmu.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24843cf49579..d5bbb7ed716f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5649,6 +5649,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 	return alloc_mmu_pages(vcpu);
 }
 
+static void __kvm_mmu_zap_all(struct kvm *kvm, bool mmio_only);
+
 static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 			struct kvm_memory_slot *slot,
 			struct kvm_page_track_notifier_node *node)
@@ -5685,6 +5687,8 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 
 out_unlock:
 	spin_unlock(&kvm->mmu_lock);
+
+	__kvm_mmu_zap_all(kvm, true);
 }
 
 void kvm_mmu_init_vm(struct kvm *kvm)
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-13 20:19           ` Sean Christopherson
  2019-08-13 20:37             ` Paolo Bonzini
@ 2019-08-15 15:23             ` Alex Williamson
  2019-08-15 16:00               ` Sean Christopherson
  2019-08-19 16:03               ` Paolo Bonzini
  1 sibling, 2 replies; 38+ messages in thread
From: Alex Williamson @ 2019-08-15 15:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Tue, 13 Aug 2019 13:19:14 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Tue, Aug 13, 2019 at 01:33:16PM -0600, Alex Williamson wrote:
> > On Tue, 13 Aug 2019 11:57:37 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:  
> 
> > Could it be something with the gfn test:
> > 
> >                         if (sp->gfn != gfn)
> >                                 continue;
> > 
> > If I remove it, I can't trigger the misbehavior.  If I log it, I only
> > get hits on VM boot/reboot and some of the gfns look suspiciously like
> > they could be the assigned GPU BARs and maybe MSI mappings:
> > 
> >                (sp->gfn) != (gfn)  
> 
> Hits at boot/reboot makes sense, memslots get zapped when userspace
> removes a memory region/slot, e.g. remaps BARs and whatnot.
> 
> ...
>  
> > Is this gfn optimization correct?  Overzealous?  Doesn't account
> > correctly for something about MMIO mappings?  Thanks,  
> 
> Yes?  Shadow pages are stored in a hash table, for_each_valid_sp() walks
> all entries for a given gfn.  The sp->gfn check is there to skip entries
> that hashed to the same list but for a completely different gfn.
> 
> Skipping the gfn check would be sort of a lightweight zap all in the
> sense that it would zap shadow pages that happend to collide with the
> target memslot/gfn but are otherwise unrelated.
> 
> What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.:
> 
> 	if (sp->gfn != gfn && sp->gfn != 0x80000)
> 		continue;
> 
> If that doesn't work, it might be worth trying other gfns to see if you
> can pinpoint which sp is being zapped as collateral damage.
> 
> It's possible there is a pre-existing bug somewhere else that was being
> hidden because KVM was effectively zapping all SPTEs during (re)boot,
> and the hash collision is also hiding the bug by zapping the stale entry.
> 
> Of course it's also possible this code is wrong, :-)

Ok, fun day of trying to figure out which ranges are relevant, I've
narrowed it down to all of these:

0xffe00
0xfee00
0xfec00
0xc1000
0x80a000
0x800000
0x100000

ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and
0x80000 can take the continue branch without seeing bad behavior in the
VM.

The assigned GPU has BARs at GPAs:

0xc0000000-0xc0ffffff
0x800000000-0x808000000
0x808000000-0x809ffffff

And the assigned companion audio function is at GPA:

0xc1080000-0xc1083fff

Only one of those seems to align very well with a gfn base involved
here.  The virtio ethernet has an mmio range at GPA 0x80a000000,
otherwise I don't find any other I/O devices coincident with the gfns
above.

I'm running the VM with 2MB hugepages, but I believe the issue still
occurs with standard pages.  When run with standard pages I see more
hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to
the set above that cannot take the continue branch.  I don't know if
that means anything.

Any further ideas what to look for?  Thanks,

Alex

PS - I see the posted workaround patch, I'll test that in the interim.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-15 15:23             ` Alex Williamson
@ 2019-08-15 16:00               ` Sean Christopherson
  2019-08-15 18:16                 ` Alex Williamson
  2019-08-19 16:03               ` Paolo Bonzini
  1 sibling, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2019-08-15 16:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Thu, Aug 15, 2019 at 09:23:24AM -0600, Alex Williamson wrote:
> Ok, fun day of trying to figure out which ranges are relevant, I've
> narrowed it down to all of these:
> 
> 0xffe00
> 0xfee00
> 0xfec00

APIC and I/O APIC stuff

> 0xc1000

Assigned audio

> 0x80a000

?

> 0x800000

GPU BAR

> 0x100000

?

The APIC ranges are puzzling, I wouldn't expect their mappings to change.

> ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and
> 0x80000 can take the continue branch without seeing bad behavior in the
> VM.
> 
> The assigned GPU has BARs at GPAs:
> 
> 0xc0000000-0xc0ffffff
> 0x800000000-0x808000000
> 0x808000000-0x809ffffff
> 
> And the assigned companion audio function is at GPA:
> 
> 0xc1080000-0xc1083fff
> 
> Only one of those seems to align very well with a gfn base involved
> here.  The virtio ethernet has an mmio range at GPA 0x80a000000,
> otherwise I don't find any other I/O devices coincident with the gfns
> above.
> 
> I'm running the VM with 2MB hugepages, but I believe the issue still
> occurs with standard pages.  When run with standard pages I see more
> hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to
> the set above that cannot take the continue branch.  I don't know if
> that means anything.
> 
> Any further ideas what to look for?  Thanks,

Maybe try isolating which memslot removal causes problems?  E.g. flush
the affected ranges if base_gfn == (xxx || yyy || zzz), otherwise flush
only the memslot's gfns.  Based on the log you sent a while back for gfn
mismatches, I'm guessing the culprits are all GPU BARs, but it's
probably worth confirming.  That might also explain why gfn == 0x80000
can take the continue branch, i.e. if removing the corresponding memslot
is what's causing problems, then it's being flushed and not actually
taking the continue path.

One other thought would be to force a call to kvm_flush_remote_tlbs(kvm),
e.g. set flush=true just before the final kvm_mmu_remote_flush_or_zap().
Maybe it's a case where there are no SPTEs for the memslot, but the TLB
flush is needed for some reason.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-15 16:00               ` Sean Christopherson
@ 2019-08-15 18:16                 ` Alex Williamson
  2019-08-15 19:25                   ` Sean Christopherson
  0 siblings, 1 reply; 38+ messages in thread
From: Alex Williamson @ 2019-08-15 18:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Thu, 15 Aug 2019 09:00:06 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Thu, Aug 15, 2019 at 09:23:24AM -0600, Alex Williamson wrote:
> > Ok, fun day of trying to figure out which ranges are relevant, I've
> > narrowed it down to all of these:
> > 
> > 0xffe00
> > 0xfee00
> > 0xfec00  
> 
> APIC and I/O APIC stuff
> 
> > 0xc1000  
> 
> Assigned audio
> 
> > 0x80a000  
> 
> ?
> 
> > 0x800000  
> 
> GPU BAR
> 
> > 0x100000  
> 
> ?
> 
> The APIC ranges are puzzling, I wouldn't expect their mappings to change.
> 
> > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and
> > 0x80000 can take the continue branch without seeing bad behavior in the
> > VM.
> > 
> > The assigned GPU has BARs at GPAs:
> > 
> > 0xc0000000-0xc0ffffff
> > 0x800000000-0x808000000
> > 0x808000000-0x809ffffff
> > 
> > And the assigned companion audio function is at GPA:
> > 
> > 0xc1080000-0xc1083fff
> > 
> > Only one of those seems to align very well with a gfn base involved
> > here.  The virtio ethernet has an mmio range at GPA 0x80a000000,
> > otherwise I don't find any other I/O devices coincident with the gfns
> > above.
> > 
> > I'm running the VM with 2MB hugepages, but I believe the issue still
> > occurs with standard pages.  When run with standard pages I see more
> > hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to
> > the set above that cannot take the continue branch.  I don't know if
> > that means anything.
> > 
> > Any further ideas what to look for?  Thanks,  
> 
> Maybe try isolating which memslot removal causes problems?  E.g. flush
> the affected ranges if base_gfn == (xxx || yyy || zzz), otherwise flush
> only the memslot's gfns.  Based on the log you sent a while back for gfn
> mismatches, I'm guessing the culprits are all GPU BARs, but it's
> probably worth confirming.  That might also explain why gfn == 0x80000
> can take the continue branch, i.e. if removing the corresponding memslot
> is what's causing problems, then it's being flushed and not actually
> taking the continue path.

If I print out the memslot base_gfn, it seems pretty evident that only
the assigned device mappings are triggering this branch.  The base_gfns
exclusively include:

 0x800000
 0x808000
 0xc0089

Where the first two clearly match the 64bit BARs and the last is the
result of a page that we need to emulate within the BAR @0xc0000000 at
offset 0x88000, so the base_gfn is the remaining direct mapping.

I don't know if this implies we're doing something wrong for assigned
device slots, but maybe a more targeted workaround would be if we could
specifically identify these slots, though there's no special
registration of them versus other slots.  Did you have any non-device
assignment test cases that took this branch when developing the series?

> One other thought would be to force a call to kvm_flush_remote_tlbs(kvm),
> e.g. set flush=true just before the final kvm_mmu_remote_flush_or_zap().
> Maybe it's a case where there are no SPTEs for the memslot, but the TLB
> flush is needed for some reason.

This doesn't work.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-15 18:16                 ` Alex Williamson
@ 2019-08-15 19:25                   ` Sean Christopherson
  2019-08-15 20:11                     ` Alex Williamson
  0 siblings, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2019-08-15 19:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Thu, Aug 15, 2019 at 12:16:07PM -0600, Alex Williamson wrote:
> On Thu, 15 Aug 2019 09:00:06 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> If I print out the memslot base_gfn, it seems pretty evident that only
> the assigned device mappings are triggering this branch.  The base_gfns
> exclusively include:
> 
>  0x800000
>  0x808000
>  0xc0089
> 
> Where the first two clearly match the 64bit BARs and the last is the
> result of a page that we need to emulate within the BAR @0xc0000000 at
> offset 0x88000, so the base_gfn is the remaining direct mapping.

That's consistent with my understanding of userspace, e.g. normal memory
regions aren't deleted until the VM is shut down (barring hot unplug).

> I don't know if this implies we're doing something wrong for assigned
> device slots, but maybe a more targeted workaround would be if we could
> specifically identify these slots, though there's no special
> registration of them versus other slots.

What is triggering the memslot removal/update?  Is it possible that
whatever action is occuring is modifying multiple memslots?  E.g. KVM's
memslot-only zapping is allowing the guest to access stale entries for
the unzapped-but-related memslots, whereas the full zap does not.

FYI, my VFIO/GPU/PCI knowledge is abysmal, please speak up if any of my
ideas are nonsensical.

> Did you have any non-device
> assignment test cases that took this branch when developing the series?

The primary testing was performance oriented, using a slightly modified
version of a synthetic benchmark[1] from a previous series[2] that touched
the memslot flushing flow.  From a functional perspective, I highly doubt
that test would have been able expose an improper zapping bug.

We do have some amount of coverage via kvm-unit-tests, as an EPT test was
triggering a slab bug due not actually zapping the collected SPTEs[3].

[1] http://lkml.iu.edu/hypermail/linux/kernel/1305.2/00277/mmtest.tar.bz2
[2] https://lkml.kernel.org/r/1368706673-8530-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
[3] https://patchwork.kernel.org/patch/10899283/

> > One other thought would be to force a call to kvm_flush_remote_tlbs(kvm),
> > e.g. set flush=true just before the final kvm_mmu_remote_flush_or_zap().
> > Maybe it's a case where there are no SPTEs for the memslot, but the TLB
> > flush is needed for some reason.
> 
> This doesn't work.  Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-15 19:25                   ` Sean Christopherson
@ 2019-08-15 20:11                     ` Alex Williamson
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2019-08-15 20:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Thu, 15 Aug 2019 12:25:31 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Thu, Aug 15, 2019 at 12:16:07PM -0600, Alex Williamson wrote:
> > On Thu, 15 Aug 2019 09:00:06 -0700
> > Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > 
> > If I print out the memslot base_gfn, it seems pretty evident that only
> > the assigned device mappings are triggering this branch.  The base_gfns
> > exclusively include:
> > 
> >  0x800000
> >  0x808000
> >  0xc0089
> > 
> > Where the first two clearly match the 64bit BARs and the last is the
> > result of a page that we need to emulate within the BAR @0xc0000000 at
> > offset 0x88000, so the base_gfn is the remaining direct mapping.  
> 
> That's consistent with my understanding of userspace, e.g. normal memory
> regions aren't deleted until the VM is shut down (barring hot unplug).
> 
> > I don't know if this implies we're doing something wrong for assigned
> > device slots, but maybe a more targeted workaround would be if we could
> > specifically identify these slots, though there's no special
> > registration of them versus other slots.  
> 
> What is triggering the memslot removal/update?  Is it possible that
> whatever action is occuring is modifying multiple memslots?  E.g. KVM's
> memslot-only zapping is allowing the guest to access stale entries for
> the unzapped-but-related memslots, whereas the full zap does not.
> 
> FYI, my VFIO/GPU/PCI knowledge is abysmal, please speak up if any of my
> ideas are nonsensical.

The memory bit in the PCI command register of config space for each
device controls whether the device decoders are active for the MMIO BAR
ranges.  These get flipped as both the guest firmware and guest OS
enumerate and assign resources to the PCI subsystem.  Generally these
are not further manipulated while the guest OS is running except for
hotplug operations.  The guest OS device driver will generally perform
the final enable of these ranges and they'll remain enabled until the
guest is rebooted.

I recall somewhere in this thread you referenced reading the ROM as
part of the performance testing of this series.  The ROM has it's own
enable bit within the ROM BAR as the PCI spec allows devices to share
decoders between the standard BARs and the ROM BAR.  Enabling and
disabling the enable bit in the ROM BAR should be very similar in
memslot behavior to the overall memory enable bit for the other BARs
within the device.

Note that often the faults that I'm seeing occur a long time after BAR
mappings are finalized, usually (not always) the VM boots to a fully
functional desktop and it's only as I run various tests do the glitches
start to appear.  For instance, when I allowed sp->gfn 0xfec00 to take
the continue branch, I got an OpenCL error.  For either 0xffee00 or
0xc1000 I got graphics glitches, for example stray geometric artifacts
flashed on the screen.  For 0x100000 and 0x800000 I'd get a black
screen or blank 3D graphics window.  For 0x80a000 the VM froze
(apparently).  I can't say whether each of these is a consistent failure
mode, I only tested to the point of determining whether a range
generates an error.

> > Did you have any non-device
> > assignment test cases that took this branch when developing the series?  
> 
> The primary testing was performance oriented, using a slightly modified
> version of a synthetic benchmark[1] from a previous series[2] that touched
> the memslot flushing flow.  From a functional perspective, I highly doubt
> that test would have been able expose an improper zapping bug.

:-\

It seems like there's probably some sort of inflection point where it
becomes faster to zap all pages versus the overhead of walking every
page in a memory slot, was that evaluated?  Not sure if that's relevant
here, but curious.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-15 15:23             ` Alex Williamson
  2019-08-15 16:00               ` Sean Christopherson
@ 2019-08-19 16:03               ` Paolo Bonzini
  2019-08-20 20:03                 ` Sean Christopherson
  1 sibling, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2019-08-19 16:03 UTC (permalink / raw)
  To: Alex Williamson, Sean Christopherson
  Cc: Radim Krčmář, kvm, Xiao Guangrong

On 15/08/19 17:23, Alex Williamson wrote:
> 0xffe00
> 0xfee00
> 0xfec00
> 0xc1000
> 0x80a000
> 0x800000
> 0x100000
> 
> ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and
> 0x80000 can take the continue branch without seeing bad behavior in the
> VM.
> 
> The assigned GPU has BARs at GPAs:
> 
> 0xc0000000-0xc0ffffff
> 0x800000000-0x808000000
> 0x808000000-0x809ffffff
> 
> And the assigned companion audio function is at GPA:
> 
> 0xc1080000-0xc1083fff
> 
> Only one of those seems to align very well with a gfn base involved
> here.  The virtio ethernet has an mmio range at GPA 0x80a000000,
> otherwise I don't find any other I/O devices coincident with the gfns
> above.

The IOAPIC and LAPIC are respectively gfn 0xfec00 and 0xfee00.  The
audio function BAR is only 16 KiB, so the 2 MiB PDE starting at 0xc1000
includes both userspace-MMIO and device-MMIO memory.  The virtio-net BAR
is also userspace-MMIO.

It seems like the problem occurs when the sp->gfn you "continue over"
includes a userspace-MMIO gfn.  But since I have no better ideas right
now, I'm going to apply the revert (we don't know for sure that it only
happens with assigned devices).

Paolo

> I'm running the VM with 2MB hugepages, but I believe the issue still
> occurs with standard pages.  When run with standard pages I see more
> hits to gfn values 0, 0x40000, 0x80000, but the same number of hits to
> the set above that cannot take the continue branch.  I don't know if
> that means anything.
> 
> Any further ideas what to look for?  Thanks,
> 
> Alex
> 
> PS - I see the posted workaround patch, I'll test that in the interim.
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-19 16:03               ` Paolo Bonzini
@ 2019-08-20 20:03                 ` Sean Christopherson
  2019-08-20 20:42                   ` Alex Williamson
  2020-06-26 17:32                   ` Sean Christopherson
  0 siblings, 2 replies; 38+ messages in thread
From: Sean Christopherson @ 2019-08-20 20:03 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alex Williamson, Radim Krčmář, kvm, Xiao Guangrong

[-- Attachment #1: Type: text/plain, Size: 2853 bytes --]

On Mon, Aug 19, 2019 at 06:03:05PM +0200, Paolo Bonzini wrote:
> On 15/08/19 17:23, Alex Williamson wrote:
> > 0xffe00
> > 0xfee00
> > 0xfec00
> > 0xc1000
> > 0x80a000
> > 0x800000
> > 0x100000
> > 
> > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and
> > 0x80000 can take the continue branch without seeing bad behavior in the
> > VM.
> > 
> > The assigned GPU has BARs at GPAs:
> > 
> > 0xc0000000-0xc0ffffff
> > 0x800000000-0x808000000
> > 0x808000000-0x809ffffff
> > 
> > And the assigned companion audio function is at GPA:
> > 
> > 0xc1080000-0xc1083fff
> > 
> > Only one of those seems to align very well with a gfn base involved
> > here.  The virtio ethernet has an mmio range at GPA 0x80a000000,
> > otherwise I don't find any other I/O devices coincident with the gfns
> > above.
> 
> The IOAPIC and LAPIC are respectively gfn 0xfec00 and 0xfee00.  The
> audio function BAR is only 16 KiB, so the 2 MiB PDE starting at 0xc1000
> includes both userspace-MMIO and device-MMIO memory.  The virtio-net BAR
> is also userspace-MMIO.
> 
> It seems like the problem occurs when the sp->gfn you "continue over"
> includes a userspace-MMIO gfn.  But since I have no better ideas right
> now, I'm going to apply the revert (we don't know for sure that it only
> happens with assigned devices).

After many hours of staring, I've come to the conclusion that
kvm_mmu_invalidate_zap_pages_in_memslot() is completely broken, i.e.
a revert is definitely in order for 5.3 and stable.

mmu_page_hash is indexed by the gfn of the shadow pages, not the gfn of
the shadow ptes, e.g. for_each_valid_sp() will find page tables, page
directories, etc...  Passing in the raw gfns of the memslot doesn't work
because the gfn isn't properly adjusted/aligned to match how KVM tracks
gfns for shadow pages, e.g. removal of the companion audio memslot that
occupies gfns 0xc1080 - 0xc1083 would need to query gfn 0xc1000 to find
the shadow page table containing the relevant sptes.

This is why Paolo's suggestion to remove slot_handle_all_level() on
kvm_zap_rmapp() caused a BUG(), as zapping the rmaps cleaned up KVM's
accounting without actually zapping the relevant sptes.

All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00
were sensitive to (lack of) zapping.  My theory is that zapping what were
effectively random-but-interesting shadow pages cleaned things up enough
to avoid noticeable badness.


Alex,

Can you please test the attached patch?  It implements a very slimmed down
version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes
pointing at the memslot being removed, which was the original intent of
kvm_mmu_invalidate_zap_pages_in_memslot().  I apologize in advance if it
crashes the host.  I'm hopeful it's correct, but given how broken the
previous version was, I'm not exactly confident.

[-- Attachment #2: 0001-KVM-x86-MMU-Rewrite-zapping-pages-in-memslot-to-fix-.patch --]
[-- Type: text/x-diff, Size: 2936 bytes --]

From 90e4b3dcb33a52da267f1099289ec7b71ad6a975 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson@intel.com>
Date: Tue, 20 Aug 2019 12:00:13 -0700
Subject: [PATCH] KVM: x86/MMU: Rewrite zapping pages in memslot to fix major
 flaws

TODO: write a proper changelog if this actually works

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu.c | 58 +++++++++++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24843cf49579..c91c3472821b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5653,37 +5653,47 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 			struct kvm_memory_slot *slot,
 			struct kvm_page_track_notifier_node *node)
 {
-	struct kvm_mmu_page *sp;
+	struct kvm_mmu_page *sp, *tmp;
 	LIST_HEAD(invalid_list);
-	unsigned long i;
-	bool flush;
-	gfn_t gfn;
+	int max_level, ign;
+	gfn_t gfn_mask;
+
+	/*
+	 * Zapping upper level shadow pages isn't required, worst case scenario
+	 * we'll have unused shadow pages with no children that aren't zapped
+	 * until they're recycled due to age or when the VM is destroyed.  Skip
+	 * shadow pages that can't point directly at the memslot.
+	 */
+	max_level = kvm_largepages_enabled() ? kvm_x86_ops->get_lpage_level() :
+					       PT_PAGE_TABLE_LEVEL;
+	while (slot->npages < KVM_PAGES_PER_HPAGE(max_level))
+		--max_level;
 
 	spin_lock(&kvm->mmu_lock);
+restart:
+	list_for_each_entry_safe(sp, tmp, &kvm->arch.active_mmu_pages, link) {
+		if (sp->role.level > max_level)
+			continue;
+		if (sp->role.invalid && sp->root_count)
+			continue;
 
-	if (list_empty(&kvm->arch.active_mmu_pages))
-		goto out_unlock;
+		/*
+		 * Note, the mask is calculated using level+1.  We're looking
+		 * for shadow pages with sptes that point at the to-be-removed
+		 * memslot, not just for shadow pages residing in the memslot.
+		 */
+		gfn_mask = ~(KVM_PAGES_PER_HPAGE(sp->role.level + 1) - 1);
+		if (sp->gfn < (slot->base_gfn & gfn_mask) ||
+		    sp->gfn > ((slot->base_gfn + slot->npages - 1) & gfn_mask))
+			continue;
 
-	flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, false);
+		if (__kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign))
+			goto restart;
 
-	for (i = 0; i < slot->npages; i++) {
-		gfn = slot->base_gfn + i;
-
-		for_each_valid_sp(kvm, sp, gfn) {
-			if (sp->gfn != gfn)
-				continue;
-
-			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
-		}
-		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
-			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
-			flush = false;
-			cond_resched_lock(&kvm->mmu_lock);
-		}
+		if (cond_resched_lock(&kvm->mmu_lock))
+			goto restart;
 	}
-	kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
-
-out_unlock:
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
 	spin_unlock(&kvm->mmu_lock);
 }
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-20 20:03                 ` Sean Christopherson
@ 2019-08-20 20:42                   ` Alex Williamson
  2019-08-20 21:02                     ` Sean Christopherson
  2020-06-26 17:32                   ` Sean Christopherson
  1 sibling, 1 reply; 38+ messages in thread
From: Alex Williamson @ 2019-08-20 20:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Tue, 20 Aug 2019 13:03:19 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Mon, Aug 19, 2019 at 06:03:05PM +0200, Paolo Bonzini wrote:
> > On 15/08/19 17:23, Alex Williamson wrote:
> > > 0xffe00
> > > 0xfee00
> > > 0xfec00
> > > 0xc1000
> > > 0x80a000
> > > 0x800000
> > > 0x100000
> > > 
> > > ie. I can effective only say that sp->gfn values of 0x0, 0x40000, and
> > > 0x80000 can take the continue branch without seeing bad behavior in the
> > > VM.
> > > 
> > > The assigned GPU has BARs at GPAs:
> > > 
> > > 0xc0000000-0xc0ffffff
> > > 0x800000000-0x808000000
> > > 0x808000000-0x809ffffff
> > > 
> > > And the assigned companion audio function is at GPA:
> > > 
> > > 0xc1080000-0xc1083fff
> > > 
> > > Only one of those seems to align very well with a gfn base involved
> > > here.  The virtio ethernet has an mmio range at GPA 0x80a000000,
> > > otherwise I don't find any other I/O devices coincident with the gfns
> > > above.
> > 
> > The IOAPIC and LAPIC are respectively gfn 0xfec00 and 0xfee00.  The
> > audio function BAR is only 16 KiB, so the 2 MiB PDE starting at 0xc1000
> > includes both userspace-MMIO and device-MMIO memory.  The virtio-net BAR
> > is also userspace-MMIO.
> > 
> > It seems like the problem occurs when the sp->gfn you "continue over"
> > includes a userspace-MMIO gfn.  But since I have no better ideas right
> > now, I'm going to apply the revert (we don't know for sure that it only
> > happens with assigned devices).
> 
> After many hours of staring, I've come to the conclusion that
> kvm_mmu_invalidate_zap_pages_in_memslot() is completely broken, i.e.
> a revert is definitely in order for 5.3 and stable.
> 
> mmu_page_hash is indexed by the gfn of the shadow pages, not the gfn of
> the shadow ptes, e.g. for_each_valid_sp() will find page tables, page
> directories, etc...  Passing in the raw gfns of the memslot doesn't work
> because the gfn isn't properly adjusted/aligned to match how KVM tracks
> gfns for shadow pages, e.g. removal of the companion audio memslot that
> occupies gfns 0xc1080 - 0xc1083 would need to query gfn 0xc1000 to find
> the shadow page table containing the relevant sptes.
> 
> This is why Paolo's suggestion to remove slot_handle_all_level() on
> kvm_zap_rmapp() caused a BUG(), as zapping the rmaps cleaned up KVM's
> accounting without actually zapping the relevant sptes.
> 
> All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00
> were sensitive to (lack of) zapping.  My theory is that zapping what were
> effectively random-but-interesting shadow pages cleaned things up enough
> to avoid noticeable badness.
> 
> 
> Alex,
> 
> Can you please test the attached patch?  It implements a very slimmed down
> version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes
> pointing at the memslot being removed, which was the original intent of
> kvm_mmu_invalidate_zap_pages_in_memslot().  I apologize in advance if it
> crashes the host.  I'm hopeful it's correct, but given how broken the
> previous version was, I'm not exactly confident.

It doesn't crash the host, but the guest is not happy, failing to boot
the desktop in one case and triggering errors in the guest w/o even
running test programs in another case.  Seems like it might be worse
than previous.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-20 20:42                   ` Alex Williamson
@ 2019-08-20 21:02                     ` Sean Christopherson
  2019-08-21 19:08                       ` Alex Williamson
  0 siblings, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2019-08-20 21:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Tue, Aug 20, 2019 at 02:42:04PM -0600, Alex Williamson wrote:
> On Tue, 20 Aug 2019 13:03:19 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00
> > were sensitive to (lack of) zapping.  My theory is that zapping what were
> > effectively random-but-interesting shadow pages cleaned things up enough
> > to avoid noticeable badness.
> > 
> > 
> > Alex,
> > 
> > Can you please test the attached patch?  It implements a very slimmed down
> > version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes
> > pointing at the memslot being removed, which was the original intent of
> > kvm_mmu_invalidate_zap_pages_in_memslot().  I apologize in advance if it
> > crashes the host.  I'm hopeful it's correct, but given how broken the
> > previous version was, I'm not exactly confident.
> 
> It doesn't crash the host, but the guest is not happy, failing to boot
> the desktop in one case and triggering errors in the guest w/o even
> running test programs in another case.  Seems like it might be worse
> than previous.  Thanks,

Hrm, I'm back to being completely flummoxed.

Would you be able to generate a trace of all events/kvmmmu, using the
latest patch?  I'd like to rule out a stupid code bug if it's not too
much trouble.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-20 21:02                     ` Sean Christopherson
@ 2019-08-21 19:08                       ` Alex Williamson
  2019-08-21 19:35                         ` Alex Williamson
  2019-08-21 20:10                         ` Sean Christopherson
  0 siblings, 2 replies; 38+ messages in thread
From: Alex Williamson @ 2019-08-21 19:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Tue, 20 Aug 2019 14:02:45 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Tue, Aug 20, 2019 at 02:42:04PM -0600, Alex Williamson wrote:
> > On Tue, 20 Aug 2019 13:03:19 -0700
> > Sean Christopherson <sean.j.christopherson@intel.com> wrote:  
> > > All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00
> > > were sensitive to (lack of) zapping.  My theory is that zapping what were
> > > effectively random-but-interesting shadow pages cleaned things up enough
> > > to avoid noticeable badness.
> > > 
> > > 
> > > Alex,
> > > 
> > > Can you please test the attached patch?  It implements a very slimmed down
> > > version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes
> > > pointing at the memslot being removed, which was the original intent of
> > > kvm_mmu_invalidate_zap_pages_in_memslot().  I apologize in advance if it
> > > crashes the host.  I'm hopeful it's correct, but given how broken the
> > > previous version was, I'm not exactly confident.  
> > 
> > It doesn't crash the host, but the guest is not happy, failing to boot
> > the desktop in one case and triggering errors in the guest w/o even
> > running test programs in another case.  Seems like it might be worse
> > than previous.  Thanks,  
> 
> Hrm, I'm back to being completely flummoxed.
> 
> Would you be able to generate a trace of all events/kvmmmu, using the
> latest patch?  I'd like to rule out a stupid code bug if it's not too
> much trouble.

I tried to simplify the patch, making it closer to zap_all, so I
removed the max_level calculation and exclusion based on s->role.level,
as well as the gfn range filtering.  For good measure I even removed
the sp->root_count test, so any sp not marked invalid is zapped.  This
works, and I can also add back the sp->root_count test and things
remain working.

From there I added back the gfn range test, but I left out the gfn_mask
because I'm not doing the level filtering and I think(?) this is just
another optimization.  So essentially I only add:

	if (sp->gfn < slot->base_gfn ||
            sp->gfn > (slot->base_gfn + slot->npages - 1))
		continue;

Not only does this not work, the host will sometimes oops:

[  808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  808.555065] #PF: supervisor read access in kernel mode
[  808.565326] #PF: error_code(0x0000) - not-present page
[  808.575588] PGD 0 P4D 0 
[  808.580649] Oops: 0000 [#1] SMP PTI
[  808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4
[  808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
[  808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm]
[  808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d
[  808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202
[  808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014
[  808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004
[  808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000
[  808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260
[  808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004
[  808.747620] FS:  00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000
[  808.763776] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0
[  808.789499] Call Trace:
[  808.794399]  drop_spte+0x77/0xa0 [kvm]
[  808.801885]  mmu_page_zap_pte+0xac/0xe0 [kvm]
[  808.810587]  __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm]
[  808.821196]  kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm]
[  808.833881]  kvm_page_track_flush_slot+0x55/0x80 [kvm]
[  808.844140]  __kvm_set_memory_region+0x821/0xaa0 [kvm]
[  808.854402]  kvm_set_memory_region+0x26/0x40 [kvm]
[  808.863971]  kvm_vm_ioctl+0x59a/0x940 [kvm]
[  808.872318]  ? pagevec_lru_move_fn+0xb8/0xd0
[  808.880846]  ? __seccomp_filter+0x7a/0x680
[  808.889028]  do_vfs_ioctl+0xa4/0x630
[  808.896168]  ? security_file_ioctl+0x32/0x50
[  808.904695]  ksys_ioctl+0x60/0x90
[  808.911316]  __x64_sys_ioctl+0x16/0x20
[  808.918807]  do_syscall_64+0x5f/0x1a0
[  808.926121]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  808.936209] RIP: 0033:0x7f28ebf2b0fb

Does this suggests something is still fundamentally wrong with the
premise of this change or have I done something stupid?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-21 19:08                       ` Alex Williamson
@ 2019-08-21 19:35                         ` Alex Williamson
  2019-08-21 20:30                           ` Sean Christopherson
  2019-08-21 20:10                         ` Sean Christopherson
  1 sibling, 1 reply; 38+ messages in thread
From: Alex Williamson @ 2019-08-21 19:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Wed, 21 Aug 2019 13:08:59 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 20 Aug 2019 14:02:45 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> > On Tue, Aug 20, 2019 at 02:42:04PM -0600, Alex Williamson wrote:  
> > > On Tue, 20 Aug 2019 13:03:19 -0700
> > > Sean Christopherson <sean.j.christopherson@intel.com> wrote:    
> > > > All that being said, it doesn't explain why gfns like 0xfec00 and 0xfee00
> > > > were sensitive to (lack of) zapping.  My theory is that zapping what were
> > > > effectively random-but-interesting shadow pages cleaned things up enough
> > > > to avoid noticeable badness.
> > > > 
> > > > 
> > > > Alex,
> > > > 
> > > > Can you please test the attached patch?  It implements a very slimmed down
> > > > version of kvm_mmu_zap_all() to zap only shadow pages that can hold sptes
> > > > pointing at the memslot being removed, which was the original intent of
> > > > kvm_mmu_invalidate_zap_pages_in_memslot().  I apologize in advance if it
> > > > crashes the host.  I'm hopeful it's correct, but given how broken the
> > > > previous version was, I'm not exactly confident.    
> > > 
> > > It doesn't crash the host, but the guest is not happy, failing to boot
> > > the desktop in one case and triggering errors in the guest w/o even
> > > running test programs in another case.  Seems like it might be worse
> > > than previous.  Thanks,    
> > 
> > Hrm, I'm back to being completely flummoxed.
> > 
> > Would you be able to generate a trace of all events/kvmmmu, using the
> > latest patch?  I'd like to rule out a stupid code bug if it's not too
> > much trouble.  
> 
> I tried to simplify the patch, making it closer to zap_all, so I
> removed the max_level calculation and exclusion based on s->role.level,
> as well as the gfn range filtering.  For good measure I even removed
> the sp->root_count test, so any sp not marked invalid is zapped.  This
> works, and I can also add back the sp->root_count test and things
> remain working.
> 
> From there I added back the gfn range test, but I left out the gfn_mask
> because I'm not doing the level filtering and I think(?) this is just
> another optimization.  So essentially I only add:
> 
> 	if (sp->gfn < slot->base_gfn ||
>             sp->gfn > (slot->base_gfn + slot->npages - 1))
> 		continue;
> 
> Not only does this not work, the host will sometimes oops:
> 
> [  808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [  808.555065] #PF: supervisor read access in kernel mode
> [  808.565326] #PF: error_code(0x0000) - not-present page
> [  808.575588] PGD 0 P4D 0 
> [  808.580649] Oops: 0000 [#1] SMP PTI
> [  808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4
> [  808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
> [  808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm]
> [  808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d
> [  808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202
> [  808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014
> [  808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004
> [  808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000
> [  808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260
> [  808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004
> [  808.747620] FS:  00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000
> [  808.763776] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0
> [  808.789499] Call Trace:
> [  808.794399]  drop_spte+0x77/0xa0 [kvm]
> [  808.801885]  mmu_page_zap_pte+0xac/0xe0 [kvm]
> [  808.810587]  __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm]
> [  808.821196]  kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm]
> [  808.833881]  kvm_page_track_flush_slot+0x55/0x80 [kvm]
> [  808.844140]  __kvm_set_memory_region+0x821/0xaa0 [kvm]
> [  808.854402]  kvm_set_memory_region+0x26/0x40 [kvm]
> [  808.863971]  kvm_vm_ioctl+0x59a/0x940 [kvm]
> [  808.872318]  ? pagevec_lru_move_fn+0xb8/0xd0
> [  808.880846]  ? __seccomp_filter+0x7a/0x680
> [  808.889028]  do_vfs_ioctl+0xa4/0x630
> [  808.896168]  ? security_file_ioctl+0x32/0x50
> [  808.904695]  ksys_ioctl+0x60/0x90
> [  808.911316]  __x64_sys_ioctl+0x16/0x20
> [  808.918807]  do_syscall_64+0x5f/0x1a0
> [  808.926121]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  808.936209] RIP: 0033:0x7f28ebf2b0fb
> 
> Does this suggests something is still fundamentally wrong with the
> premise of this change or have I done something stupid?

Seems the latter, particularly your comment that we're looking for
pages pointing to the gfn range to be removed, not just those in the
range.  Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or
c0000-c0000, zapping zero or c0000, and I think one of the ones you
were looking for c1080-c1083 is reduce to c1000-c1000 and therefore
zaps sp->gfn c1000.  I'll keep looking.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-21 19:08                       ` Alex Williamson
  2019-08-21 19:35                         ` Alex Williamson
@ 2019-08-21 20:10                         ` Sean Christopherson
  2019-08-26  7:36                           ` Tian, Kevin
  2019-08-26 14:56                           ` Sean Christopherson
  1 sibling, 2 replies; 38+ messages in thread
From: Sean Christopherson @ 2019-08-21 20:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Wed, Aug 21, 2019 at 01:08:59PM -0600, Alex Williamson wrote:
> On Tue, 20 Aug 2019 14:02:45 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> > Hrm, I'm back to being completely flummoxed.
> > 
> > Would you be able to generate a trace of all events/kvmmmu, using the
> > latest patch?  I'd like to rule out a stupid code bug if it's not too
> > much trouble.
> 
> I tried to simplify the patch, making it closer to zap_all, so I
> removed the max_level calculation and exclusion based on s->role.level,
> as well as the gfn range filtering.  For good measure I even removed
> the sp->root_count test, so any sp not marked invalid is zapped.  This
> works, and I can also add back the sp->root_count test and things
> remain working.

The sp->root_count test is necessary to avoid an infinite loop when not
using tdp, e.g. EPT.  The root sp can't be moved off active_mmu_pages and
thus freed until all references to the root are gone.  When not using tdp,
the loop can restart in response to __kvm_mmu_prepare_zap_page(), even if
the root sp page was already invalid.

Functionally, skipping an invalid sp *shouldn't* change anything, as the
actual ptes have been zapped, it's only the kernel resources that aren't
yet freed.

TL;DR: completely expected that removing the sp->root_count didn't
affect the result.

> From there I added back the gfn range test, but I left out the gfn_mask
> because I'm not doing the level filtering and I think(?) this is just
> another optimization.  So essentially I only add:
> 
> 	if (sp->gfn < slot->base_gfn ||
>             sp->gfn > (slot->base_gfn + slot->npages - 1))
> 		continue;

This doesn't work because of what sp->gfn contains.  The shadow page is
the page/table holding the spt entries, e.g. a level 1 sp refers to a page
table for entries defining 4k pages.  sp->gfn holds the base gfn of the
range of gfns covered by the sp, which allows finding all parent sps for a
given gfn.  E.g. for level 1 sp, sp->gfn is found by masking the gfn of a
given page by 0xFFFFFFFFFFFFFE00, level 2 sp by 0xFFFFFFFFFFFC0000, and so
on up the chain.

This is what the original code botched, as it would only find all sps for
a memslot if the base of the memslot were 2mb aligned for 4k pages, 1gb
aligned for 2mb pages, etc...  This just happened to mostly work since
memslots are usually aligned to such requirements.  In the original code,
removing the "sp->gfn != gfn" check caused KVM to zap random sps that just
happened to hash to the same entry.  Likewise, omitting the gfn filter in
this code means everything gets zapped.

What I don't understand is why zapping everything, or at least userspace
MMIO addresses, is necessary when removing the GPU BAR memslot.  The
IOAPIC sp in particular makes no sense.  AIUI, the IOAPIC is always
emulated, i.e. not passed through, and its base/size is static (or at
least static after init).  Zapping an IOAPIC sp will send KVM down
different paths but AFAIK the final outsome should be unaffected.

> Not only does this not work, the host will sometimes oops:
> 
> [  808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [  808.555065] #PF: supervisor read access in kernel mode
> [  808.565326] #PF: error_code(0x0000) - not-present page
> [  808.575588] PGD 0 P4D 0 
> [  808.580649] Oops: 0000 [#1] SMP PTI
> [  808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4
> [  808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
> [  808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm]
> [  808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d
> [  808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202
> [  808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014
> [  808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004
> [  808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000
> [  808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260
> [  808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004
> [  808.747620] FS:  00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000
> [  808.763776] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0
> [  808.789499] Call Trace:
> [  808.794399]  drop_spte+0x77/0xa0 [kvm]
> [  808.801885]  mmu_page_zap_pte+0xac/0xe0 [kvm]
> [  808.810587]  __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm]
> [  808.821196]  kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm]
> [  808.833881]  kvm_page_track_flush_slot+0x55/0x80 [kvm]
> [  808.844140]  __kvm_set_memory_region+0x821/0xaa0 [kvm]
> [  808.854402]  kvm_set_memory_region+0x26/0x40 [kvm]
> [  808.863971]  kvm_vm_ioctl+0x59a/0x940 [kvm]
> [  808.872318]  ? pagevec_lru_move_fn+0xb8/0xd0
> [  808.880846]  ? __seccomp_filter+0x7a/0x680
> [  808.889028]  do_vfs_ioctl+0xa4/0x630
> [  808.896168]  ? security_file_ioctl+0x32/0x50
> [  808.904695]  ksys_ioctl+0x60/0x90
> [  808.911316]  __x64_sys_ioctl+0x16/0x20
> [  808.918807]  do_syscall_64+0x5f/0x1a0
> [  808.926121]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  808.936209] RIP: 0033:0x7f28ebf2b0fb
> 
> Does this suggests something is still fundamentally wrong with the
> premise of this change or have I done something stupid?  Thanks,

The NULL pointer thing is unexpected, it means we have a spte, i.e. the
actual entry seen/used by hardware, that KVM thinks is present but doesn't
have the expected KVM tracking.  I'll take a look, my understanding is
that zapping shadow pages at random shouldn't cause problems.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-21 19:35                         ` Alex Williamson
@ 2019-08-21 20:30                           ` Sean Christopherson
  2019-08-23  2:25                             ` Sean Christopherson
  0 siblings, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2019-08-21 20:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Wed, Aug 21, 2019 at 01:35:04PM -0600, Alex Williamson wrote:
> On Wed, 21 Aug 2019 13:08:59 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> > Does this suggests something is still fundamentally wrong with the
> > premise of this change or have I done something stupid?
> 
> Seems the latter, particularly your comment that we're looking for
> pages pointing to the gfn range to be removed, not just those in the
> range.  Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or
> c0000-c0000, zapping zero or c0000, and I think one of the ones you
> were looking for c1080-c1083 is reduce to c1000-c1000 and therefore
> zaps sp->gfn c1000.  I'll keep looking.  Thanks,

Ya.  As far as where to look, at this point I don't think it's an issue of
incorrect zapping.  Not because  I'm 100% confident the zapping logic is
correct, but because many of the tests, e.g. removing 'sp->gfn != gfn' and
not being able to exclude APIC/IOAPIC ranges, suggest that the badness is
'fixed' by zapping seemingly unrelated sps.

In other words, it may be fundamentally wrong to zap only the memslot
being removed, but I really want to know why.  History isn't helpful as
KVM has always zapped all pages when removing a memslot (on x86), and the
introduction of the per-memslot flush hook in commit

  2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow")

was all about refactoring generic code, and doesn't have any information
on whether per-memslot flushing was actually tried for x86.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-21 20:30                           ` Sean Christopherson
@ 2019-08-23  2:25                             ` Sean Christopherson
  2019-08-23 22:05                               ` Alex Williamson
  0 siblings, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2019-08-23  2:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Wed, Aug 21, 2019 at 01:30:41PM -0700, Sean Christopherson wrote:
> On Wed, Aug 21, 2019 at 01:35:04PM -0600, Alex Williamson wrote:
> > On Wed, 21 Aug 2019 13:08:59 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > Does this suggests something is still fundamentally wrong with the
> > > premise of this change or have I done something stupid?
> > 
> > Seems the latter, particularly your comment that we're looking for
> > pages pointing to the gfn range to be removed, not just those in the
> > range.  Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or
> > c0000-c0000, zapping zero or c0000, and I think one of the ones you
> > were looking for c1080-c1083 is reduce to c1000-c1000 and therefore
> > zaps sp->gfn c1000.  I'll keep looking.  Thanks,
> 
> Ya.  As far as where to look, at this point I don't think it's an issue of
> incorrect zapping.  Not because  I'm 100% confident the zapping logic is
> correct, but because many of the tests, e.g. removing 'sp->gfn != gfn' and
> not being able to exclude APIC/IOAPIC ranges, suggest that the badness is
> 'fixed' by zapping seemingly unrelated sps.
> 
> In other words, it may be fundamentally wrong to zap only the memslot
> being removed, but I really want to know why.  History isn't helpful as
> KVM has always zapped all pages when removing a memslot (on x86), and the
> introduction of the per-memslot flush hook in commit
> 
>   2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow")
> 
> was all about refactoring generic code, and doesn't have any information
> on whether per-memslot flushing was actually tried for x86.

One semi-random idea would be to zap mmio pages, i.e. don't skip pages
for which sp->mmio_cached is true, regardless of their gfn or level.  I
don't expect it to make a difference, but it would shrink the haystack on
the off change it does "fix" the issues.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-23  2:25                             ` Sean Christopherson
@ 2019-08-23 22:05                               ` Alex Williamson
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2019-08-23 22:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Thu, 22 Aug 2019 19:25:02 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Wed, Aug 21, 2019 at 01:30:41PM -0700, Sean Christopherson wrote:
> > On Wed, Aug 21, 2019 at 01:35:04PM -0600, Alex Williamson wrote:  
> > > On Wed, 21 Aug 2019 13:08:59 -0600
> > > Alex Williamson <alex.williamson@redhat.com> wrote:  
> > > > Does this suggests something is still fundamentally wrong with the
> > > > premise of this change or have I done something stupid?  
> > > 
> > > Seems the latter, particularly your comment that we're looking for
> > > pages pointing to the gfn range to be removed, not just those in the
> > > range.  Slot gfn ranges like ffe00-ffe1f are getting reduced to 0-0 or
> > > c0000-c0000, zapping zero or c0000, and I think one of the ones you
> > > were looking for c1080-c1083 is reduce to c1000-c1000 and therefore
> > > zaps sp->gfn c1000.  I'll keep looking.  Thanks,  
> > 
> > Ya.  As far as where to look, at this point I don't think it's an issue of
> > incorrect zapping.  Not because  I'm 100% confident the zapping logic is
> > correct, but because many of the tests, e.g. removing 'sp->gfn != gfn' and
> > not being able to exclude APIC/IOAPIC ranges, suggest that the badness is
> > 'fixed' by zapping seemingly unrelated sps.
> > 
> > In other words, it may be fundamentally wrong to zap only the memslot
> > being removed, but I really want to know why.  History isn't helpful as
> > KVM has always zapped all pages when removing a memslot (on x86), and the
> > introduction of the per-memslot flush hook in commit
> > 
> >   2df72e9bc4c5 ("KVM: split kvm_arch_flush_shadow")
> > 
> > was all about refactoring generic code, and doesn't have any information
> > on whether per-memslot flushing was actually tried for x86.  
> 
> One semi-random idea would be to zap mmio pages, i.e. don't skip pages
> for which sp->mmio_cached is true, regardless of their gfn or level.  I
> don't expect it to make a difference, but it would shrink the haystack on
> the off change it does "fix" the issues.

You're right, it doesn't fix it.  All of the logging I've been staring
at suggests your patch does exactly what it's intended to do, but it
still breaks GPU assignment in weird ways.  I have no idea why.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-21 20:10                         ` Sean Christopherson
@ 2019-08-26  7:36                           ` Tian, Kevin
  2019-08-26 14:56                           ` Sean Christopherson
  1 sibling, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2019-08-26  7:36 UTC (permalink / raw)
  To: Christopherson, Sean J, Alex Williamson
  Cc: Paolo Bonzini, Radim Krcmár, kvm, Xiao Guangrong

> From: Sean Christopherson
> Sent: Thursday, August 22, 2019 4:11 AM
> 
[...]
> 
> > From there I added back the gfn range test, but I left out the gfn_mask
> > because I'm not doing the level filtering and I think(?) this is just
> > another optimization.  So essentially I only add:
> >
> > 	if (sp->gfn < slot->base_gfn ||
> >             sp->gfn > (slot->base_gfn + slot->npages - 1))
> > 		continue;
> 
> This doesn't work because of what sp->gfn contains.  The shadow page is
> the page/table holding the spt entries, e.g. a level 1 sp refers to a page
> table for entries defining 4k pages.  sp->gfn holds the base gfn of the
> range of gfns covered by the sp, which allows finding all parent sps for a
> given gfn.  E.g. for level 1 sp, sp->gfn is found by masking the gfn of a
> given page by 0xFFFFFFFFFFFFFE00, level 2 sp by 0xFFFFFFFFFFFC0000, and so
> on up the chain.
> 
> This is what the original code botched, as it would only find all sps for
> a memslot if the base of the memslot were 2mb aligned for 4k pages, 1gb
> aligned for 2mb pages, etc...  This just happened to mostly work since
> memslots are usually aligned to such requirements.  In the original code,
> removing the "sp->gfn != gfn" check caused KVM to zap random sps that just
> happened to hash to the same entry.  Likewise, omitting the gfn filter in
> this code means everything gets zapped.
> 
> What I don't understand is why zapping everything, or at least userspace
> MMIO addresses, is necessary when removing the GPU BAR memslot.  The
> IOAPIC sp in particular makes no sense.  AIUI, the IOAPIC is always
> emulated, i.e. not passed through, and its base/size is static (or at
> least static after init).  Zapping an IOAPIC sp will send KVM down
> different paths but AFAIK the final outsome should be unaffected.
> 

I have the same feeling. the base address of IOAPIC is described by
ACPI MADT thus static. It is possible to reprogram LAPIC base but no
commodity OS does it today.

btw I found Alex mentioned gfn 0x100000 in earlier post. This one is
usually the starting of high memory. If true, it becomes even more
weird, not just about MMIO thing.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-21 20:10                         ` Sean Christopherson
  2019-08-26  7:36                           ` Tian, Kevin
@ 2019-08-26 14:56                           ` Sean Christopherson
  1 sibling, 0 replies; 38+ messages in thread
From: Sean Christopherson @ 2019-08-26 14:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Radim Krčmář, kvm, Xiao Guangrong

On Wed, Aug 21, 2019 at 01:10:43PM -0700, Sean Christopherson wrote:
> On Wed, Aug 21, 2019 at 01:08:59PM -0600, Alex Williamson wrote:
> > Not only does this not work, the host will sometimes oops:
> > 
> > [  808.541168] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [  808.555065] #PF: supervisor read access in kernel mode
> > [  808.565326] #PF: error_code(0x0000) - not-present page
> > [  808.575588] PGD 0 P4D 0 
> > [  808.580649] Oops: 0000 [#1] SMP PTI
> > [  808.587617] CPU: 3 PID: 1965 Comm: CPU 0/KVM Not tainted 5.3.0-rc4+ #4
> > [  808.600652] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
> > [  808.618907] RIP: 0010:gfn_to_rmap+0xd9/0x120 [kvm]
> > [  808.628472] Code: c7 48 8d 0c 80 48 8d 04 48 4d 8d 14 c0 49 8b 02 48 39 c6 72 15 49 03 42 08 48 39 c6 73 0c 41 89 b9 08 b4 00 00 49 8b 3a eb 0b <48> 8b 3c 25 00 00 00 00 45 31 d2 0f b6 42 24 83 e0 0f 83 e8 01 8d
> > [  808.665945] RSP: 0018:ffffa888009a3b20 EFLAGS: 00010202
> > [  808.676381] RAX: 00000000000c1040 RBX: ffffa888007d5000 RCX: 0000000000000014
> > [  808.690628] RDX: ffff8eadd0708260 RSI: 00000000000c1080 RDI: 0000000000000004
> > [  808.704877] RBP: ffff8eadc3d11400 R08: ffff8ead97cf0008 R09: ffff8ead97cf0000
> > [  808.719124] R10: ffff8ead97cf0168 R11: 0000000000000004 R12: ffff8eadd0708260
> > [  808.733374] R13: ffffa888007d5000 R14: 0000000000000000 R15: 0000000000000004
> > [  808.747620] FS:  00007f28dab7c700(0000) GS:ffff8eb19f4c0000(0000) knlGS:0000000000000000
> > [  808.763776] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  808.775249] CR2: 0000000000000000 CR3: 000000003f508006 CR4: 00000000001626e0
> > [  808.789499] Call Trace:
> > [  808.794399]  drop_spte+0x77/0xa0 [kvm]
> > [  808.801885]  mmu_page_zap_pte+0xac/0xe0 [kvm]
> > [  808.810587]  __kvm_mmu_prepare_zap_page+0x69/0x350 [kvm]
> > [  808.821196]  kvm_mmu_invalidate_zap_pages_in_memslot+0x87/0xf0 [kvm]
> > [  808.833881]  kvm_page_track_flush_slot+0x55/0x80 [kvm]
> > [  808.844140]  __kvm_set_memory_region+0x821/0xaa0 [kvm]
> > [  808.854402]  kvm_set_memory_region+0x26/0x40 [kvm]
> > [  808.863971]  kvm_vm_ioctl+0x59a/0x940 [kvm]
> > [  808.872318]  ? pagevec_lru_move_fn+0xb8/0xd0
> > [  808.880846]  ? __seccomp_filter+0x7a/0x680
> > [  808.889028]  do_vfs_ioctl+0xa4/0x630
> > [  808.896168]  ? security_file_ioctl+0x32/0x50
> > [  808.904695]  ksys_ioctl+0x60/0x90
> > [  808.911316]  __x64_sys_ioctl+0x16/0x20
> > [  808.918807]  do_syscall_64+0x5f/0x1a0
> > [  808.926121]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  808.936209] RIP: 0033:0x7f28ebf2b0fb
> > 
> > Does this suggests something is still fundamentally wrong with the
> > premise of this change or have I done something stupid?  Thanks,
> 
> The NULL pointer thing is unexpected, it means we have a spte, i.e. the
> actual entry seen/used by hardware, that KVM thinks is present but doesn't
> have the expected KVM tracking.  I'll take a look, my understanding is
> that zapping shadow pages at random shouldn't cause problems.

The NULL pointer dereference is expected given the flawed implementation,
i.e. there isn't another bug lurking for that particular problem.  The
issue isn't zapping random sptes, but rather that the flawed logic leaves
dangling sptes.  When a different action, e.g. zapping all memslots,
triggers zapping of the dangling spte(s), gfn_to_rmap() attempts to find
the corresponding memslot and hits the above BUG because the memslot no
longer exists.

On the flip side, not hitting that condition provides additional confidence
in the reworked flow, i.e. proves to some degree that it's zapping all
sptes in the to-be-removed memslot.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2019-08-20 20:03                 ` Sean Christopherson
  2019-08-20 20:42                   ` Alex Williamson
@ 2020-06-26 17:32                   ` Sean Christopherson
  2022-10-20 18:31                     ` Alexander Graf
  1 sibling, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2020-06-26 17:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alex Williamson, Radim Krčmář, kvm, Xiao Guangrong

/cast <thread necromancy>

On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:
> On Mon, Aug 19, 2019 at 06:03:05PM +0200, Paolo Bonzini wrote:
> > It seems like the problem occurs when the sp->gfn you "continue over"
> > includes a userspace-MMIO gfn.  But since I have no better ideas right
> > now, I'm going to apply the revert (we don't know for sure that it only
> > happens with assigned devices).
> 
> After many hours of staring, I've come to the conclusion that
> kvm_mmu_invalidate_zap_pages_in_memslot() is completely broken, i.e.
> a revert is definitely in order for 5.3 and stable.

Whelp, past me was wrong.  The patch wasn't completely broken, as the rmap
zapping piece of kvm_mmu_invalidate_zap_pages_in_memslot() was sufficient
to satisfy removal of the memslot.  I.e. zapping leaf PTEs (including large
pages) should prevent referencing the old memslot, the fact that zapping
upper level shadow pages was broken was irrelevant because there should be
no need to zap non-leaf PTEs.

The one thing that sticks out as potentially concerning is passing %false
for @lock_flush_tlb.  Dropping mmu_lock during slot_handle_level_range()
without flushing would allow a vCPU to create and use a new entry while a
different vCPU has the old entry cached in its TLB.  I think that could
even happen with a single vCPU if the memslot is deleted by a helper task,
and the zapped page was a large page that was fractured into small pages
when inserted into the TLB.

TL;DR: Assuming no other bugs in the kernel, this should be sufficient if
the goal is simply to prevent usage of a memslot:

static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
			struct kvm_memory_slot *slot,
			struct kvm_page_track_notifier_node *node)
{
	bool flush;

	spin_lock(&kvm->mmu_lock);
	flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, true);
	if (flush)
		kvm_flush_remote_tlbs(kvm);
	spin_unlock(&
}

> mmu_page_hash is indexed by the gfn of the shadow pages, not the gfn of
> the shadow ptes, e.g. for_each_valid_sp() will find page tables, page
> directories, etc...  Passing in the raw gfns of the memslot doesn't work
> because the gfn isn't properly adjusted/aligned to match how KVM tracks
> gfns for shadow pages, e.g. removal of the companion audio memslot that
> occupies gfns 0xc1080 - 0xc1083 would need to query gfn 0xc1000 to find
> the shadow page table containing the relevant sptes.
> 
> This is why Paolo's suggestion to remove slot_handle_all_level() on
> kvm_zap_rmapp() caused a BUG(), as zapping the rmaps cleaned up KVM's
> accounting without actually zapping the relevant sptes.

This is just straight up wrong, zapping the rmaps does zap the leaf sptes.
The BUG() occurs because gfn_to_rmap() works on the _new_ memslots instance,
and if a memslot is being deleted there is no memslot for the gfn, hence the
NULL pointer bug when mmu_page_zap_pte() attempts to zap a PTE.  Zapping the
rmaps (leaf/last PTEs) first "fixes" the issue by making it so that
mmu_page_zap_pte() will never see a present PTE for a non-existent meslot.

I don't think any of this explains the pass-through GPU issue.  But, we
have a few use cases where zapping the entire MMU is undesirable, so I'm
going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
set the record straight for posterity before doing so.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2020-06-26 17:32                   ` Sean Christopherson
@ 2022-10-20 18:31                     ` Alexander Graf
  2022-10-20 20:37                       ` Sean Christopherson
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Graf @ 2022-10-20 18:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Alex Williamson, Radim Krčmář,
	kvm, Xiao Guangrong, Chandrasekaran, Siddharth, Paolo Bonzini

On 26.06.20 19:32, Sean Christopherson wrote:
> /cast <thread necromancy>
> 
> On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:

[...]

> I don't think any of this explains the pass-through GPU issue.  But, we
> have a few use cases where zapping the entire MMU is undesirable, so I'm
> going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
> set the record straight for posterity before doing so.

Hey Sean,

Did you ever get around to upstream or rework the zap optimization? The 
way I read current upstream, a memslot change still always wipes all 
SPTEs, not only the ones that were changed.


Thanks,

Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2022-10-20 18:31                     ` Alexander Graf
@ 2022-10-20 20:37                       ` Sean Christopherson
  2022-10-20 21:06                         ` Alexander Graf
  0 siblings, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2022-10-20 20:37 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Alex Williamson, Radim Krčmář,
	kvm, Xiao Guangrong, Chandrasekaran, Siddharth, Paolo Bonzini

On Thu, Oct 20, 2022, Alexander Graf wrote:
> On 26.06.20 19:32, Sean Christopherson wrote:
> > /cast <thread necromancy>
> > 
> > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:
> 
> [...]
> 
> > I don't think any of this explains the pass-through GPU issue.  But, we
> > have a few use cases where zapping the entire MMU is undesirable, so I'm
> > going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
> > set the record straight for posterity before doing so.
> 
> Hey Sean,
> 
> Did you ever get around to upstream or rework the zap optimization? The way
> I read current upstream, a memslot change still always wipes all SPTEs, not
> only the ones that were changed.

Nope, I've more or less given up hope on zapping only the deleted/moved memslot.
TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very
much a special case.  

Do you have use case and/or issue that doesn't play nice with the "zap all" behavior?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2022-10-20 20:37                       ` Sean Christopherson
@ 2022-10-20 21:06                         ` Alexander Graf
  2022-10-21 19:40                           ` Sean Christopherson
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Graf @ 2022-10-20 21:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Alex Williamson, Radim Krčmář,
	kvm, Xiao Guangrong, Chandrasekaran, Siddharth, Paolo Bonzini


On 20.10.22 22:37, Sean Christopherson wrote:
> On Thu, Oct 20, 2022, Alexander Graf wrote:
>> On 26.06.20 19:32, Sean Christopherson wrote:
>>> /cast <thread necromancy>
>>>
>>> On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:
>> [...]
>>
>>> I don't think any of this explains the pass-through GPU issue.  But, we
>>> have a few use cases where zapping the entire MMU is undesirable, so I'm
>>> going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
>>> set the record straight for posterity before doing so.
>> Hey Sean,
>>
>> Did you ever get around to upstream or rework the zap optimization? The way
>> I read current upstream, a memslot change still always wipes all SPTEs, not
>> only the ones that were changed.
> Nope, I've more or less given up hope on zapping only the deleted/moved memslot.
> TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very
> much a special case.
>
> Do you have use case and/or issue that doesn't play nice with the "zap all" behavior?


Yeah, we're looking at adding support for the Hyper-V VSM extensions 
which Windows uses to implement Credential Guard. With that, the guest 
gets access to hypercalls that allow it to set reduced permissions for 
arbitrary gfns. To ensure that user space has full visibility into those 
for live migration, memory slots to model access would be a great fit. 
But it means we'd do ~100k memslot modifications on boot.


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2022-10-20 21:06                         ` Alexander Graf
@ 2022-10-21 19:40                           ` Sean Christopherson
  2022-10-24  6:12                             ` Alexander Graf
  0 siblings, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2022-10-21 19:40 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Alex Williamson, Radim Krčmář,
	kvm, Xiao Guangrong, Chandrasekaran, Siddharth, Paolo Bonzini

On Thu, Oct 20, 2022, Alexander Graf wrote:
> 
> On 20.10.22 22:37, Sean Christopherson wrote:
> > On Thu, Oct 20, 2022, Alexander Graf wrote:
> > > On 26.06.20 19:32, Sean Christopherson wrote:
> > > > /cast <thread necromancy>
> > > > 
> > > > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:
> > > [...]
> > > 
> > > > I don't think any of this explains the pass-through GPU issue.  But, we
> > > > have a few use cases where zapping the entire MMU is undesirable, so I'm
> > > > going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
> > > > set the record straight for posterity before doing so.
> > > Hey Sean,
> > > 
> > > Did you ever get around to upstream or rework the zap optimization? The way
> > > I read current upstream, a memslot change still always wipes all SPTEs, not
> > > only the ones that were changed.
> > Nope, I've more or less given up hope on zapping only the deleted/moved memslot.
> > TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very
> > much a special case.
> > 
> > Do you have use case and/or issue that doesn't play nice with the "zap all" behavior?
> 
> 
> Yeah, we're looking at adding support for the Hyper-V VSM extensions which
> Windows uses to implement Credential Guard. With that, the guest gets access
> to hypercalls that allow it to set reduced permissions for arbitrary gfns.
> To ensure that user space has full visibility into those for live migration,
> memory slots to model access would be a great fit. But it means we'd do
> ~100k memslot modifications on boot.

Oof.  100k memslot updates is going to be painful irrespective of flushing.  And
memslots (in their current form) won't work if the guest can drop executable
permissions.

Assuming KVM needs to support a KVM_MEM_NO_EXEC flag, rather than trying to solve
the "KVM flushes everything on memslot deletion", I think we should instead
properly support toggling KVM_MEM_READONLY (and KVM_MEM_NO_EXEC) without forcing
userspace to delete the memslot.  Commit 75d61fbcf563 ("KVM: set_memory_region:
Disallow changing read-only attribute later") was just a quick-and-dirty fix,
there's no fundemental problem that makes it impossible (or even all that difficult)
to support toggling permissions.

The ABI would be that KVM only guarantees the new permissions take effect when
the ioctl() returns, i.e. KVM doesn't need to ensure there are no writable SPTEs
when the memslot is installed, just that there are no writable SPTEs before
userspace regains control.

E.g. sans sanity checking and whatnot, I think x86 support would be something like:

@@ -12669,9 +12667,16 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
         * MOVE/DELETE: The old mappings will already have been cleaned up by
         *              kvm_arch_flush_shadow_memslot().
         */
-       if ((change != KVM_MR_FLAGS_ONLY) || (new_flags & KVM_MEM_READONLY))
+       if (change != KVM_MR_FLAGS_ONLY)
                return;
 
+       if ((old_flags ^ new_flags) & KVM_MEM_READONLY) {
+               if ((new_flags & KVM_MEM_READONLY) &&
+                   kvm_mmu_slot_write_protect(kvm, new))
+                       kvm_arch_flush_remote_tlbs_memslot(kvm, new);
+               return;
+       }
+
        /*
         * READONLY and non-flags changes were filtered out above, and the only
         * other flag is LOG_DIRTY_PAGES, i.e. something is wrong if dirty


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2022-10-21 19:40                           ` Sean Christopherson
@ 2022-10-24  6:12                             ` Alexander Graf
  2022-10-24 15:55                               ` Sean Christopherson
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Graf @ 2022-10-24  6:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Alex Williamson, Radim Krčmář,
	kvm, Xiao Guangrong, Chandrasekaran, Siddharth, Paolo Bonzini

Hey Sean,

On 21.10.22 21:40, Sean Christopherson wrote:
>
> On Thu, Oct 20, 2022, Alexander Graf wrote:
>> On 20.10.22 22:37, Sean Christopherson wrote:
>>> On Thu, Oct 20, 2022, Alexander Graf wrote:
>>>> On 26.06.20 19:32, Sean Christopherson wrote:
>>>>> /cast <thread necromancy>
>>>>>
>>>>> On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:
>>>> [...]
>>>>
>>>>> I don't think any of this explains the pass-through GPU issue.  But, we
>>>>> have a few use cases where zapping the entire MMU is undesirable, so I'm
>>>>> going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
>>>>> set the record straight for posterity before doing so.
>>>> Hey Sean,
>>>>
>>>> Did you ever get around to upstream or rework the zap optimization? The way
>>>> I read current upstream, a memslot change still always wipes all SPTEs, not
>>>> only the ones that were changed.
>>> Nope, I've more or less given up hope on zapping only the deleted/moved memslot.
>>> TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very
>>> much a special case.
>>>
>>> Do you have use case and/or issue that doesn't play nice with the "zap all" behavior?
>>
>> Yeah, we're looking at adding support for the Hyper-V VSM extensions which
>> Windows uses to implement Credential Guard. With that, the guest gets access
>> to hypercalls that allow it to set reduced permissions for arbitrary gfns.
>> To ensure that user space has full visibility into those for live migration,
>> memory slots to model access would be a great fit. But it means we'd do
>> ~100k memslot modifications on boot.
> Oof.  100k memslot updates is going to be painful irrespective of flushing.  And
> memslots (in their current form) won't work if the guest can drop executable
> permissions.
>
> Assuming KVM needs to support a KVM_MEM_NO_EXEC flag, rather than trying to solve
> the "KVM flushes everything on memslot deletion", I think we should instead
> properly support toggling KVM_MEM_READONLY (and KVM_MEM_NO_EXEC) without forcing
> userspace to delete the memslot.  Commit 75d61fbcf563 ("KVM: set_memory_region:


That would be a cute acceleration for the case where we have to change 
permissions for a full slot. Unfortunately, the bulk of the changes are 
slot splits. Let me explain with numbers from a 1 vcpu, 8GB Windows 
Server 2019 boot:

GFN permission modification requests: 46294
Unique GFNs: 21200

That means on boot, we start off with a few huge memslots for guest RAM. 
Then down the road, we need to change permissions for individual pages 
inside these larger regions. The obvious option for that is a memslot 
split - delete, create, create, create. Now we have 2 large memslots and 
1 that only spans a single page.

Later in the boot process, Windows then some times also toggles 
permissions for pages that it already split off earlier. That's the case 
we can optimize with the modify optimization you described in the 
previous email. But that's only about half the requests. The other half 
are memslot split requests.

We already built a prototype implementation of an atomic memslot update 
ioctl that allows us to keep other vCPUs running while we do the 
delete/create/create/create operation. But even with that, we see up to 
30 min boot times for larger guests that most of the time are stuck in 
zapping pages.

I guess we have 2 options to make this viable:

   1) Optimize memslot splits + modifications to a point where they're 
fast enough
   2) Add a different, faster mechanism on top of memslots for page 
granular permission bits

Also sorry for not posting the underlying credguard and atomic memslot 
patches yet. I wanted to kick off this conversation before sending them 
out - they're still too raw for upstream review atm :).


Thanks,

Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
  2022-10-24  6:12                             ` Alexander Graf
@ 2022-10-24 15:55                               ` Sean Christopherson
  0 siblings, 0 replies; 38+ messages in thread
From: Sean Christopherson @ 2022-10-24 15:55 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Alex Williamson, Radim Krčmář,
	kvm, Xiao Guangrong, Chandrasekaran, Siddharth, Paolo Bonzini

On Mon, Oct 24, 2022, Alexander Graf wrote:
> Hey Sean,
> 
> On 21.10.22 21:40, Sean Christopherson wrote:
> > 
> > On Thu, Oct 20, 2022, Alexander Graf wrote:
> > > On 20.10.22 22:37, Sean Christopherson wrote:
> > > > On Thu, Oct 20, 2022, Alexander Graf wrote:
> > > > > On 26.06.20 19:32, Sean Christopherson wrote:
> > > > > > /cast <thread necromancy>
> > > > > > 
> > > > > > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:
> > > > > [...]
> > > > > 
> > > > > > I don't think any of this explains the pass-through GPU issue.  But, we
> > > > > > have a few use cases where zapping the entire MMU is undesirable, so I'm
> > > > > > going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
> > > > > > set the record straight for posterity before doing so.
> > > > > Hey Sean,
> > > > > 
> > > > > Did you ever get around to upstream or rework the zap optimization? The way
> > > > > I read current upstream, a memslot change still always wipes all SPTEs, not
> > > > > only the ones that were changed.
> > > > Nope, I've more or less given up hope on zapping only the deleted/moved memslot.
> > > > TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very
> > > > much a special case.
> > > > 
> > > > Do you have use case and/or issue that doesn't play nice with the "zap all" behavior?
> > > 
> > > Yeah, we're looking at adding support for the Hyper-V VSM extensions which
> > > Windows uses to implement Credential Guard. With that, the guest gets access
> > > to hypercalls that allow it to set reduced permissions for arbitrary gfns.
> > > To ensure that user space has full visibility into those for live migration,
> > > memory slots to model access would be a great fit. But it means we'd do
> > > ~100k memslot modifications on boot.
> > Oof.  100k memslot updates is going to be painful irrespective of flushing.  And
> > memslots (in their current form) won't work if the guest can drop executable
> > permissions.
> > 
> > Assuming KVM needs to support a KVM_MEM_NO_EXEC flag, rather than trying to solve
> > the "KVM flushes everything on memslot deletion", I think we should instead
> > properly support toggling KVM_MEM_READONLY (and KVM_MEM_NO_EXEC) without forcing
> > userspace to delete the memslot.  Commit 75d61fbcf563 ("KVM: set_memory_region:
> 
> 
> That would be a cute acceleration for the case where we have to change
> permissions for a full slot. Unfortunately, the bulk of the changes are slot
> splits.

Ah, right, the guest will be operating on per-page granularity.

> We already built a prototype implementation of an atomic memslot update
> ioctl that allows us to keep other vCPUs running while we do the
> delete/create/create/create operation.

Please weigh in with your use case on a relevant upstream discussion regarding
"atomic" memslot updates[*].  I suspect we'll end up with a different solution
for this use case (see below), but we should at least capture all potential use
cases and ideas for modifying memslots without pausing vCPUs.

[*] https://lore.kernel.org/all/20220909104506.738478-1-eesposit@redhat.com

> But even with that, we see up to 30 min boot times for larger guests that
> most of the time are stuck in zapping pages.

Out of curiosity, did you measure runtime performance?  I would expect some amount
of runtime overhead as well dut to fragmenting memslots to that degree.

> I guess we have 2 options to make this viable:
> 
>   1) Optimize memslot splits + modifications to a point where they're fast
> enough
>   2) Add a different, faster mechanism on top of memslots for page granular
> permission bits

#2 crossed my mind as well.  This is actually nearly identical to the confidential
VM use case, where KVM needs to handle guest-initiated conversions of memory between
"private" and "shared" on a per-page granularity.  The proposed solution for that
is indeed a layer on top of memslots[*], which we arrived at in no small part because
splitting memslots was going to be a bottleneck.

Extending the proposed mem_attr_array to support additional state should be quite
easy.  The framework is all there, KVM just needs a few extra flags values, e.g.

	KVM_MEM_ATTR_SHARED	BIT(0)
	KVM_MEM_ATTR_READONLY	BIT(1)
	KVM_MEM_ATTR_NOEXEC	BIT(2)

and then new ioctls to expose the functionality to userspace.  Actually, if we
want to go this route, it might even make sense to define new a generic MEM_ATTR
ioctl() right away instead of repurposing KVM_MEMORY_ENCRYPT_(UN)REG_REGION for
the private vs. shared use case.

[*] https://lore.kernel.org/all/20220915142913.2213336-6-chao.p.peng@linux.intel.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2022-10-24 17:24 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-05 20:54 [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism Sean Christopherson
2019-02-05 20:54 ` [PATCH v2 01/27] KVM: Call kvm_arch_memslots_updated() before updating memslots Sean Christopherson
2019-02-06  9:12   ` Cornelia Huck
2019-02-12 12:36 ` [PATCH v2 00/27] KVM: x86/mmu: Remove fast invalidate mechanism Paolo Bonzini
     [not found] ` <20190205210137.1377-11-sean.j.christopherson@intel.com>
2019-08-13 16:04   ` [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot Alex Williamson
2019-08-13 17:04     ` Sean Christopherson
2019-08-13 17:57       ` Alex Williamson
2019-08-13 19:33         ` Alex Williamson
2019-08-13 20:19           ` Sean Christopherson
2019-08-13 20:37             ` Paolo Bonzini
2019-08-13 21:14               ` Alex Williamson
2019-08-13 21:15                 ` Paolo Bonzini
2019-08-13 22:10                   ` Alex Williamson
2019-08-15 14:46                 ` Sean Christopherson
2019-08-15 15:23             ` Alex Williamson
2019-08-15 16:00               ` Sean Christopherson
2019-08-15 18:16                 ` Alex Williamson
2019-08-15 19:25                   ` Sean Christopherson
2019-08-15 20:11                     ` Alex Williamson
2019-08-19 16:03               ` Paolo Bonzini
2019-08-20 20:03                 ` Sean Christopherson
2019-08-20 20:42                   ` Alex Williamson
2019-08-20 21:02                     ` Sean Christopherson
2019-08-21 19:08                       ` Alex Williamson
2019-08-21 19:35                         ` Alex Williamson
2019-08-21 20:30                           ` Sean Christopherson
2019-08-23  2:25                             ` Sean Christopherson
2019-08-23 22:05                               ` Alex Williamson
2019-08-21 20:10                         ` Sean Christopherson
2019-08-26  7:36                           ` Tian, Kevin
2019-08-26 14:56                           ` Sean Christopherson
2020-06-26 17:32                   ` Sean Christopherson
2022-10-20 18:31                     ` Alexander Graf
2022-10-20 20:37                       ` Sean Christopherson
2022-10-20 21:06                         ` Alexander Graf
2022-10-21 19:40                           ` Sean Christopherson
2022-10-24  6:12                             ` Alexander Graf
2022-10-24 15:55                               ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).