On Tue, 2021-11-16 at 18:42 +0100, Paolo Bonzini wrote:
> On 11/16/21 17:06, David Woodhouse wrote:
> > On Tue, 2021-11-16 at 16:49 +0100, Paolo Bonzini wrote:
> > > On 11/16/21 16:09, David Woodhouse wrote:
> > > > On Tue, 2021-11-16 at 15:57 +0100, Paolo Bonzini wrote:
> > > > > This should not be needed, should it?  As long as the gfn-to-pfn
> > > > > cache's vcpu field is handled properly, the request will just cause
> > > > > the vCPU not to enter.
> > > > 
> > > > If the MMU mappings never change, the request never happens. But the
> > > > memslots *can* change, so it does need to be revalidated each time
> > > > through I think?
> > > 
> > > That needs to be done on KVM_SET_USER_MEMORY_REGION, using the same
> > > request (or even the same list walking code) as the MMU notifiers.
> > 
> > Hm....  kvm_arch_memslots_updated() is already kicking every vCPU after
> > the update, and although that was asynchronous it was actually OK
> > because unlike in the MMU notifier case, that page wasn't actually
> > going away — and if that HVA *did* subsequently go away, our HVA-based
> > notifier check would still catch that and kill it synchronously.
> 
> Right, so it only needs to change the kvm_vcpu_kick into a 
> kvm_make_all_cpus_request without KVM_WAIT.

Yeah, I think that works.

> > > > Hm, in my head that was never going to *change* for a given gpc; it
> > > > *belongs* to that vCPU for ever (and was even part of vmx->nested. for
> > > > that vCPU, to replace e.g. vmx->nested.pi_desc_map).
> > > 
> > > Ah okay, I thought it would be set in nested vmentry and cleared in
> > > nested vmexit.
> > 
> > I don't think it needs to be proactively cleared; we just don't
> > *refresh* it until we need it again.
> 
> True, but if it's cleared the vCPU won't be kicked, which is nice.

The vCPU will only be kicked once when it becomes invalid anyway. It's
a trade-off. Either we leave it valid for next time that L2 vCPU is
entered, or we actively clear it. Not sure I lose much sleep either
way?

> > If we *know* the GPA and size haven't changed, and if we know that
> > gpc->valid becoming false would have been handled differently, then we
> > could optimise that whole thing away quite effectively to a single
> > check on ->generations?
> 
> I wonder if we need a per-gpc memslot generation...  Can it be global?

Theoretically, maybe. It's kind of mathematically equivalent to the
highest value of each gpc memslot. And any gpc which *isn't* at that
maximum is by definition invalid.

But I'm not sure I see how to implement it without actively going and
clearing the 'valid' bit on all GPCs when it gets bumped... which was
your previous suggestion if basically running the same code as in the
MMU notifier?




> > This one actually compiles. Not sure we have any test cases that will
> > actually exercise it though, do we?
> 
> I'll try to spend some time writing testcases.
> 
> > +		read_lock(&gpc->lock);
> > +		if (!kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc, gpc->gpa, PAGE_SIZE)) {
> > +			read_unlock(&gpc->lock);
> >   			goto mmio_needed;
> > +		}
> > +
> > +		vapic_page = gpc->khva;
> 
> If we know this gpc is of the synchronous kind, I think we can skip the 
> read_lock/read_unlock here?!?

Er... this one was OUTSIDE_GUEST_MODE and is the atomic kind, which
means it needs to hold the lock for the duration of the access in order
to prevent (preemption and) racing with the invalidate?

It's the IN_GUEST_MODE one (in my check_guest_maps()) where we might
get away without the lock, perhaps?

> 
> >   		__kvm_apic_update_irr(vmx->nested.pi_desc->pir,
> >   			vapic_page, &max_irr);
> > @@ -3749,6 +3783,7 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
> >   			status |= (u8)max_irr;
> >   			vmcs_write16(GUEST_INTR_STATUS, status);
> >   		}
> > +		read_unlock(&gpc->lock);
> >   	}
> >   

I just realised that the mark_page_dirty() on invalidation and when the
the irqfd workqueue refreshes the gpc might fall foul of the same
dirty_ring problem that I belatedly just spotted with the Xen shinfo
clock write. I'll fix it up to *always* require a vcpu (to be
associated with the writes), and reinstate the guest_uses_pa flag since
that can no longer in implicit in (vcpu!=NULL).

I may leave the actual task of fixing nesting to you, if that's OK, as
long as we consider the new gfn_to_pfn_cache sufficient to address the
problem? I think it's mostly down to how we *use* it now, rather than
the fundamental design of cache itself?