On Wed, 2024-01-17 at 08:09 -0800, Sean Christopherson wrote: > > NAK, at least as a bug fix. OK. Not a bug fix. > > The contract with the gfn_to_pfn_cache, or rather the lack thereof, > is all kinds of screwed up. Well yes, that's my point. How's this for a reworked commit message which doesn't claim it's a bug fix.... KVM: pfncache: clean up and simplify locking mess The locking on the gfn_to_pfn_cache is... interesting. And awful. There is a rwlock in ->lock which readers take to ensure protection against concurrent changes. But __kvm_gpc_refresh() makes assumptions that certain fields will not change even while it drops the write lock and performs MM operations to revalidate the target PFN and kernel mapping. Commit 93984f19e7bc ("KVM: Fully serialize gfn=>pfn cache refresh via mutex") partly addressed that — not by fixing it, but by adding a new mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh() calls on a given gfn_to_pfn_cache, but is still only a partial solution. There is still a theoretical race where __kvm_gpc_refresh() runs in parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has dropped the write lock, kvm_gpc_deactivate() clears the ->active flag and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous ->pfn and ->khva are still valid, and reinstalls those values into the structure. This leaves the gfn_to_pfn_cache with the ->valid bit set, but ->active clear. And a ->khva which looks like a reasonable kernel address but is actually unmapped. All it takes is a subsequent reactivation to cause that ->khva to be dereferenced. This would theoretically cause an oops which would look something like this: [1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0 [1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0 I say "theoretically" because theoretically, that oops that was seen in production cannot happen. The code which uses the gfn_to_pfn_cache is supposed to have its *own* locking, to further paper over the fact that the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own rwlock abuse is not sufficient. For the Xen vcpu_info that external lock is the vcpu->mutex, and for the shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all but the cases where the vcpu or kvm object is being *destroyed*, in which case the subsequent reactivation should never happen. Theoretically. Nevertheless, this locking abuse is awful and should be fixed, even if no clear explanation can be found for how the oops happened. So... Clean up the semantics of hva_to_pfn_retry() so that it no longer does any locking gymnastics because it no longer operates on the gpc object at all. It is now called with a uhva and simply returns the corresponding pfn (pinned), and a mapped khva for it. Its caller __kvm_gpc_refresh() now sets gpc->uhva and clears gpc->valid before dropping ->lock, calling hva_to_pfn_retry() and retaking ->lock for write. If hva_to_pfn_retry() fails, *or* if the ->uhva or ->active fields in the gpc changed while the lock was dropped, the new mapping is discarded and the gpc is not modified. On success with an unchanged gpc, the new mapping is installed and the current ->pfn and ->uhva are taken into the local old_pfn and old_khva variables to be unmapped once the locks are all released. This simplification means that ->refresh_lock is no longer needed for correctness, but it does still provide a minor optimisation because it will prevent two concurrent __kvm_gpc_refresh() calls from mapping a given PFN, only for one of them to lose the race and discard its mapping. The optimisation in hva_to_pfn_retry() where it attempts to use the old mapping if the pfn doesn't change is dropped, since it makes the pinning more complex. It's a pointless optimisation anyway, since the odds of the pfn ending up the same when the uhva has changed (i.e. the odds of the two userspace addresses both pointing to the same underlying physical page) are negligible, The 'hva_changed' local variable in __kvm_gpc_refresh() is also removed, since it's simpler just to clear gpc->valid if the uhva changed. Likewise the unmap_old variable is dropped because it's just as easy to check the old_pfn variable for KVM_PFN_ERR_FAULT. Signed-off-by: David Woodhouse Reviewed-by: Paul Durrant