On Mon, 2021-11-15 at 19:50 +0100, Paolo Bonzini wrote:
> On 11/15/21 17:47, David Woodhouse wrote:
> > So... a user of this must check the validity after setting its mode to
> > IN_GUEST_MODE, and the invalidation must make a request and wake any
> > vCPU(s) which might be using it.
> 
> Yes, though the check is implicit in the existing call to 
> kvm_vcpu_exit_request(vcpu).

Right, though *handling* the request isn't (and I'm still not sure
whether to use a single new KVM_REQ_GPC_INVALIDATE or let the user of
the cache specify the req to use).

I don't really want generic code refreshing these caches even when they
aren't going to be used (e.g. vmcs02 for a vCPU that isn't even in L2
guest mode right now).

> > I moved the invalidation to the invalidate_range MMU notifier, as
> > discussed. But that's where the plan falls down a little bit because
> > IIUC, that one can't sleep at all.
> 
> Which is a problem in the existing code, too.  It hasn't broken yet 
> because invalidate_range() is _usually_ called with no spinlocks taken 
> (the only caller that does call with a spinlock taken seems to be 
> hugetlb_cow).
> 
> Once the dust settles, we need to add non_block_start/end around calls 
> to ops->invalidate_range.
> 
> > I need to move it *back*  to
> > invalidate_range_start() where I had it before, if I want to let it
> > wait for vCPUs to exit. Which means... that the cache 'refresh' call
> > must wait until the mmu_notifier_count reaches zero? Am I allowed to do > that, and make the "There can be only one waiter" comment in
> > kvm_mmu_notifier_invalidate_range_end() no longer true?
> 
> You can also update the cache while taking the mmu_lock for read, and 
> retry if mmu_notifier_retry_hva tells you to do so.  Looking at the 
> scenario from commit e649b3f0188 you would have:
> 
>        (Invalidator) kvm_mmu_notifier_invalidate_range_start()
>        (Invalidator) write_lock(mmu_lock)
>        (Invalidator) increment mmu_notifier_count
>        (Invalidator) write_unlock(mmu_lock)
>        (Invalidator) request KVM_REQ_APIC_PAGE_RELOAD
>        (KVM VCPU) vcpu_enter_guest()
>        (KVM VCPU) kvm_vcpu_reload_apic_access_page()
>     +  (KVM VCPU) read_lock(mmu_lock)
>     +  (KVM VCPU) mmu_notifier_retry_hva()
>     +  (KVM VCPU) read_unlock(mmu_lock)
>     +  (KVM VCPU) retry! (mmu_notifier_count>1)


But unless we do start using a waitq, it can just spin and spin and
spin here can't it? 

    +  (KVM VCPU) read_lock(mmu_lock)>
    +  (KVM VCPU) mmu_notifier_retry_hva()>   
    +  (KVM VCPU) read_unlock(mmu_lock)>   
    +  (KVM VCPU) retry! (mmu_notifier_count>1)

    +  (KVM VCPU) read_lock(mmu_lock)>
    +  (KVM VCPU)
mmu_notifier_retry_hva()>   
    +  (KVM VCPU) read_unlock(mmu_lock)>   
    +  (KVM VCPU) retry! (mmu_notifier_count>1)

    +  (KVM VCPU) read_lock(mmu_lock)>
    +  (KVM VCPU)
mmu_notifier_retry_hva()>   
    +  (KVM VCPU) read_unlock(mmu_lock)>   
    +  (KVM VCPU) retry! (mmu_notifier_count>1)

>        (Invalidator) actually unmap page

    +  (KVM VCPU) read_lock(mmu_lock)>
    +  (KVM VCPU) mmu_notifier_retry_hva()>   
    +  (KVM VCPU) read_unlock(mmu_lock)>   
    +  (KVM VCPU) retry! (mmu_notifier_count>1)

    +  (KVM VCPU) read_lock(mmu_lock)>
    +  (KVM VCPU)
mmu_notifier_retry_hva()>   
    +  (KVM VCPU) read_unlock(mmu_lock)>   
    +  (KVM VCPU) retry! (mmu_notifier_count>1)

>     +  (Invalidator) kvm_mmu_notifier_invalidate_range_end()
>     +  (Invalidator) write_lock(mmu_lock)
>     +  (Invalidator) decrement mmu_notifier_count
>     +  (Invalidator) write_unlock(mmu_lock)
>     +  (KVM VCPU) vcpu_enter_guest()
>     +  (KVM VCPU) kvm_vcpu_reload_apic_access_page()
>     +  (KVM VCPU) mmu_notifier_retry_hva()
> 
> Changing mn_memslots_update_rcuwait to a waitq (and renaming it to 
> mn_invalidate_waitq) is of course also a possibility.

I suspect that's the answer.

I think the actual *invalidation* of the cache still lives in the
invalidate_range() callback where I have it at the moment. But making
the req to the affected vCPUs can live in invalidate_range_start(). And
then the code which *handles* that req can wait for the
mmu_notifier_count to reach zero before it proceeds. Atomic users of
the cache (like the Xen event channel code) don't have to get involved
with that.

> Also, for the small requests: since you are at it, can you add the code 
> in a new file under virt/kvm/?

Hm... only if I can make hva_to_pfn() and probably a handful of other
things non-static?