+Alex, whom I completely spaced on Cc'ing. Alex, this is related to the dreaded VFIO memslot zapping issue from last year. Start of thread: https://patchwork.kernel.org/patch/11640719/. The TL;DR of below: can you try the attached patch with your reproducer from the original bug[*]? I honestly don't know whether it has a legitimate chance of working, but it's the one thing in all of this that I know was definitely a bug. I'd like to test it out if only to sate my curiosity. Absolutely no rush. [*] https://patchwork.kernel.org/patch/10798453/#22817321 On Fri, Jul 10, 2020 at 12:18:17AM +0200, Paolo Bonzini wrote: > On 09/07/20 23:12, Sean Christopherson wrote: > >> It's bad that we have no clue what's causing the bad behavior, but I > >> don't think it's wise to have a bug that is known to happen when you > >> enable the capability. :/ > > (Note that this wasn't a NACK, though subtly so). > > > I don't necessarily disagree, but at the same time it's entirely possible > > it's a Qemu bug. > > No, it cannot be. QEMU is not doing anything but > KVM_SET_USER_MEMORY_REGION, and it's doing that synchronously with > writes to the PCI configuration space BARs. I'm not saying it's likely, but it's certainly possible. The failure went away when KVM zapped SPTEs for seemingly unrelated addresses, i.e. the error likely goes beyond just the memslot aspect. > > Even if this is a kernel bug, I'm fairly confident at this point that it's > > not a KVM bug. Or rather, if it's a KVM "bug", then there's a fundamental > > dependency in memslot management that needs to be rooted out and documented. > > Heh, here my surmise is that it cannot be anything but a KVM bug, > because Memslots are not used by anything outside KVM... But maybe I'm > missing something. As above, it's not really a memslot issue, it's more of a paging/TLB issue, or possibly none of the above. E.g. it could be a timing bug that goes away simply because zapping and rebuilding slows things down to the point where the timing window is closed. I should have qualified "fairly confident ... that it's not a KVM bug" as "not a KVM bug related to removing SPTEs for the deleted/moved memslot _as implemented in this patch_". Digging back through the old thread, I don't think we ever tried passing %true for @lock_flush_tlb when zapping rmaps. And a comment from Alex also caught my eye, where he said of the following: "If anything, removing this chunk seems to make things worse." if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); flush = false; cond_resched_lock(&kvm->mmu_lock); } A somewhat far fetched theory is that passing %false to kvm_zap_rmapp() via slot_handle_all_level() created a window where a vCPU could have both the old stale entry and the new memslot entry in its TLB if the equivalent to above lock dropping in slot_handle_level_range() triggered. Removing the above intermediate flush would exacerbate the theoretical problem by further delaying the flush, i.e. would create a bigger window for the guest to access the stale SPTE. Where things get really far fetched is how zapping seemingly random SPTEs fits in. Best crazy guess is that zapping enough random things while holding MMU lock would eventually zap a SPTE that caused the guest to block in the EPT violation handler. I'm not exactly confident that the correct zapping approach will actually resolve the VFIO issue, but I think it's worth trying since not flushing during rmap zapping was definitely a bug. > > And we're kind of in a catch-22; it'll be extremely difficult to narrow down > > exactly who is breaking what without being able to easily test the optimized > > zapping with other VMMs and/or setups. > > I agree with this, and we could have a config symbol that depends on > BROKEN and enables it unconditionally. However a capability is the > wrong tool. Ya, a capability is a bad idea. I was coming at it from the angle that, if there is a fundamental requirement with e.g. GPU passthrough that requires zapping all SPTEs, then enabling the precise capability on a per-VM basis would make sense. But adding something to the ABI on pure speculation is silly.