+Alex, whom I completely spaced on Cc'ing.

Alex, this is related to the dreaded VFIO memslot zapping issue from last
year.  Start of thread: https://patchwork.kernel.org/patch/11640719/.

The TL;DR of below: can you try the attached patch with your reproducer
from the original bug[*]?  I honestly don't know whether it has a legitimate
chance of working, but it's the one thing in all of this that I know was
definitely a bug.  I'd like to test it out if only to sate my curiosity.
Absolutely no rush.

[*] https://patchwork.kernel.org/patch/10798453/#22817321

On Fri, Jul 10, 2020 at 12:18:17AM +0200, Paolo Bonzini wrote:
> On 09/07/20 23:12, Sean Christopherson wrote:
> >> It's bad that we have no clue what's causing the bad behavior, but I
> >> don't think it's wise to have a bug that is known to happen when you
> >> enable the capability. :/
> 
> (Note that this wasn't a NACK, though subtly so).
> 
> > I don't necessarily disagree, but at the same time it's entirely possible
> > it's a Qemu bug.
> 
> No, it cannot be.  QEMU is not doing anything but
> KVM_SET_USER_MEMORY_REGION, and it's doing that synchronously with
> writes to the PCI configuration space BARs.

I'm not saying it's likely, but it's certainly possible.  The failure
went away when KVM zapped SPTEs for seemingly unrelated addresses, i.e. the
error likely goes beyond just the memslot aspect.

> > Even if this is a kernel bug, I'm fairly confident at this point that it's
> > not a KVM bug.  Or rather, if it's a KVM "bug", then there's a fundamental
> > dependency in memslot management that needs to be rooted out and documented.
> 
> Heh, here my surmise is that  it cannot be anything but a KVM bug,
> because  Memslots are not used by anything outside KVM...  But maybe I'm
> missing something.

As above, it's not really a memslot issue, it's more of a paging/TLB issue,
or possibly none of the above.  E.g. it could be a timing bug that goes away
simply because zapping and rebuilding slows things down to the point where
the timing window is closed.

I should have qualified "fairly confident ... that it's not a KVM bug" as
"not a KVM bug related to removing SPTEs for the deleted/moved memslot _as
implemented in this patch_".

Digging back through the old thread, I don't think we ever tried passing
%true for @lock_flush_tlb when zapping rmaps.  And a comment from Alex also
caught my eye, where he said of the following: "If anything, removing this
chunk seems to make things worse."

	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
		kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
		flush = false;
		cond_resched_lock(&kvm->mmu_lock);
	}

A somewhat far fetched theory is that passing %false to kvm_zap_rmapp()
via slot_handle_all_level() created a window where a vCPU could have both
the old stale entry and the new memslot entry in its TLB if the equivalent
to above lock dropping in slot_handle_level_range() triggered.

Removing the above intermediate flush would exacerbate the theoretical
problem by further delaying the flush, i.e. would create a bigger window
for the guest to access the stale SPTE.

Where things get really far fetched is how zapping seemingly random SPTEs
fits in.  Best crazy guess is that zapping enough random things while holding
MMU lock would eventually zap a SPTE that caused the guest to block in the
EPT violation handler.

I'm not exactly confident that the correct zapping approach will actually
resolve the VFIO issue, but I think it's worth trying since not flushing
during rmap zapping was definitely a bug.

> > And we're kind of in a catch-22; it'll be extremely difficult to narrow down
> > exactly who is breaking what without being able to easily test the optimized
> > zapping with other VMMs and/or setups.
> 
> I agree with this, and we could have a config symbol that depends on
> BROKEN and enables it unconditionally.  However a capability is the
> wrong tool.

Ya, a capability is a bad idea.  I was coming at it from the angle that, if
there is a fundamental requirement with e.g. GPU passthrough that requires
zapping all SPTEs, then enabling the precise capability on a per-VM basis
would make sense.  But adding something to the ABI on pure speculation is
silly.