On Sat, Feb 09, 2019 at 10:41:38AM +0100, Cédric Le Goater wrote: > On 2/8/19 10:53 PM, Paul Mackerras wrote: > > On Fri, Feb 08, 2019 at 08:58:14AM +0100, Cédric Le Goater wrote: > >> On 2/8/19 6:15 AM, David Gibson wrote: > >>> On Thu, Feb 07, 2019 at 10:03:15AM +0100, Cédric Le Goater wrote: > >>>> That's the plan I have in mind as suggested by Paul if I understood it well. > >>>> The mechanics are more complex than the patch zapping the PTEs from the VMA > >>>> but it's also safer. > >>> > >>> Well, yes, where "safer" means "has the possibility to be correct". > >> > >> Well, the only problem with the kernel approach is keeping a pointer on > >> the VMA. If we could call find_vma(), it would be perfectly safe and much > >> more simpler. > > > > You seem to be assuming that the kernel can easily work out a single > > virtual address which will be the only place where a given set of > > interrupt pages are mapped. But that is really not possible in the > > general case, because userspace could have mapped the fd at many > > different offsets in many different places. > > > > QEMU doesn't do that; in QEMU, the mmaps are sufficiently limited that > > it can work out a single virtual address that needs to be changed. > > The way that QEMU should tell the kernel what that address is and what > > the mapping should be changed to, is via the existing munmap()/mmap() > > interface. > > Yes. We agreed on that. QEMU should handle these mappings somewhere in > VFIO. It's me grumbling, that's all. > > The discussion has moved to the mmap() interface of the KVM device. The > current proposal adds controls on the device creating fds to mmap() the > TIMA pages and the ESB pages. David is proposing to use directly the fd > of the KVM device to mmap() these pages with a different offset for each > set. > > I think that should work pretty well, for passthrough also. The fault > handler should take care of populating the VMA(s) with the appropriate > pages. > > We might support END notification one day, so we should have room for > these pages. And nested might require IRQ space extensions at L1. > something to keep in mind. I had some more thoughts on this topic. I think there's been some confusion because there are more ways of tackling this than I previously realized: 1) All in kernel The offset always maps directly to guest irq number and the kernel somehow binds it either to an IPI or a host irq as necessary. Cédric's original code attempts this, but the mechanism of keeping a pointer to the VMA can't work. But.. remapping the irqs should be sufficiently infrequent that it might be ok to consider simply stepping through all the hosting process's VMAs to do this. 2) Remapped in qemu (using memory regions) I _think_ (in hindsight) was Cédric's been discussing as the alternative in more recent posts. Qemu maps the IPI pages at one place and the passthrough IRQ pages somewhere else. The IPIs are mapped into the guest as one memory region, then any passthrough IRQ pages are mapped over that using overlapping memory regions. I don't think this approach will work well, because it could require a bunch of separate KVM memory slots, which are fairly scarce. 3) Remapped in qemu (using mmap()) This is the approach I (and I think Paul) have been suggested in contrast to (1). Qemu maps the IPI pages and maps those into the guest. When we need to set up a passthrough IRQ, qemu mmap()s its pages directly over the IPI pages, and it remains mapped into the guest with the same memory region / memslot as the IPIs are already using. If the passthrough device is removed we have to remap the IPI pages back into place. 4) Dedicated irq numbers We never re-use regular guest irq numbers for passthrough irqs, instead we put them somewhere else and keep those mapped to the passthrough irq pages. I was favouring this approach, but it does mean there will be a guest visible difference between kernel_irqchip=on and off which isn't great. (1) is the most elegant _interface_, but as we've seen it's problematic to implement. Looking at the for_all_vmas() approach could be interesting, but otherwise option (3) might be the most practical. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson