On Wed, 2024-01-17 at 08:09 -0800, Sean Christopherson wrote:
> 
> NAK, at least as a bug fix.

OK. Not a bug fix.

> 
> The contract with the gfn_to_pfn_cache, or rather the lack thereof,
> is all kinds of screwed up.

Well yes, that's my point. How's this for a reworked commit message
which doesn't claim it's a bug fix....


    KVM: pfncache: clean up and simplify locking mess
    
    The locking on the gfn_to_pfn_cache is... interesting. And awful.
    
    There is a rwlock in ->lock which readers take to ensure protection
    against concurrent changes. But __kvm_gpc_refresh() makes assumptions
    that certain fields will not change even while it drops the write lock
    and performs MM operations to revalidate the target PFN and kernel
    mapping.
    
    Commit 93984f19e7bc ("KVM: Fully serialize gfn=>pfn cache refresh via
    mutex") partly addressed that — not by fixing it, but by adding a new
    mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh()
    calls on a given gfn_to_pfn_cache, but is still only a partial solution.
    
    There is still a theoretical race where __kvm_gpc_refresh() runs in
    parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has
    dropped the write lock, kvm_gpc_deactivate() clears the ->active flag
    and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous
    ->pfn and ->khva are still valid, and reinstalls those values into the
    structure. This leaves the gfn_to_pfn_cache with the ->valid bit set,
    but ->active clear. And a ->khva which looks like a reasonable kernel
    address but is actually unmapped.
    
    All it takes is a subsequent reactivation to cause that ->khva to be
    dereferenced. This would theoretically cause an oops which would look
    something like this:
    
    [1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0
    [1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0
    
    I say "theoretically" because theoretically, that oops that was seen in
    production cannot happen. The code which uses the gfn_to_pfn_cache is
    supposed to have its *own* locking, to further paper over the fact that
    the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own
    rwlock abuse is not sufficient.
    
    For the Xen vcpu_info that external lock is the vcpu->mutex, and for the
    shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect
    the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all
    but the cases where the vcpu or kvm object is being *destroyed*, in
    which case the subsequent reactivation should never happen.
    
    Theoretically.
    
    Nevertheless, this locking abuse is awful and should be fixed, even if
    no clear explanation can be found for how the oops happened. So...
    
    Clean up the semantics of hva_to_pfn_retry() so that it no longer does
    any locking gymnastics because it no longer operates on the gpc object
    at all. It is now called with a uhva and simply returns the
    corresponding pfn (pinned), and a mapped khva for it.
    
    Its caller __kvm_gpc_refresh() now sets gpc->uhva and clears gpc->valid
    before dropping ->lock, calling hva_to_pfn_retry() and retaking ->lock
    for write.
    
    If hva_to_pfn_retry() fails, *or* if the ->uhva or ->active fields in
    the gpc changed while the lock was dropped, the new mapping is discarded
    and the gpc is not modified. On success with an unchanged gpc, the new
    mapping is installed and the current ->pfn and ->uhva are taken into the
    local old_pfn and old_khva variables to be unmapped once the locks are
    all released.
    
    This simplification means that ->refresh_lock is no longer needed for
    correctness, but it does still provide a minor optimisation because it
    will prevent two concurrent __kvm_gpc_refresh() calls from mapping a
    given PFN, only for one of them to lose the race and discard its
    mapping.
    
    The optimisation in hva_to_pfn_retry() where it attempts to use the old
    mapping if the pfn doesn't change is dropped, since it makes the pinning
    more complex. It's a pointless optimisation anyway, since the odds of
    the pfn ending up the same when the uhva has changed (i.e. the odds of
    the two userspace addresses both pointing to the same underlying
    physical page) are negligible,
    
    The 'hva_changed' local variable in __kvm_gpc_refresh() is also removed,
    since it's simpler just to clear gpc->valid if the uhva changed.
    Likewise the unmap_old variable is dropped because it's just as easy to
    check the old_pfn variable for KVM_PFN_ERR_FAULT.
    
    Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
    Reviewed-by: Paul Durrant <pdurrant@amazon.com>