On Fri, 2021-11-12 at 10:31 +0100, Paolo Bonzini wrote: > On 11/12/21 09:28, David Woodhouse wrote: > > I do not recall that we'd actually reached a conclusion that we *will* > > make the gfn_to_pfn cache generally usable in that fashion. The latest > > I knew of that discussion was my message at > > https://lore.kernel.org/kvm/55a5d4e3fbd29dd55e276b97eeaefd0411b3290b.camel@infradead.org/ > > > > in which I said I'd be a whole lot happier with that if we could do it > > with RCU instead of an rwlock — but I don't think we can because we'd > > need to call synchronize_srcu() in the MMU notifier callback that might > > not be permitted to sleep? > > Why do you have a problem with the rwlock? If it's per-cache, and it's > mostly taken within vCPU context (with the exception of Xen), contention > should be nonexistent. My problem with the using the rwlock instead of RCU is not the contention, it's... > > I'm also slightly less comfortable with having the MMU notifier work > > through an arbitrary *list* of gfn_to_pfn caches that it potentially > > needs to invalidate, but that is very much a minor concern compared > > with the first. > > > > I started looking through the nested code which is the big user of this > > facility. > > Yes, that's also where I got stuck in my first attempt a few months ago. > I agree that it can be changed to use gfn-to-hva caches, except for > the vmcs12->posted_intr_desc_addr and vmcs12->virtual_apic_page_addr. > ... that anything accessing these will *still* need to do so in atomic context. There's an atomic access which might fail, and then you fall back to a context in which you can sleep to refresh the mapping. and you *still* need to perform the actual access with the spinlock held to protect against concurrent invalidation. So let's take a look... for posted_intr_desc_addr, that host physical address is actually written to the VMCS02, isn't it? Thinking about the case where the target page is being invalidated while the vCPU is running... surely in that case the only 'correct' solution is that the vCPU needs to be kicked out of non-root mode before the invalidate_range() notifier completes? That would have worked nicely if the MMU notifier could call scru_synchronize() on invalidation. Can it kick the vCPU and wait for it to exit though? Or maybe there's a variant where we only have to ensure that no posted interrupts will actually be *delivered* after the invaldation? Don't get me wrong, a big part of me *loves* the idea that the hairiest part of my Xen event channel delivery is actually a bug fix that we need in the kernel anyway, and then the rest of it is simple and uncontentious. I'm just not quite sure I see how to provide a generic mechanism that actually *fixes* the bugs that already exist elsewhere — at least not without them having their own special cases for invalidation anyway. (ISTR the virtual apic page is a bit different because it's only an *address* and it doesn't even have to be backed by real memory at the corresponding HPA? Otherwise it's basically the same issue?)