On Mon, Dec 19, 2016 at 09:52:52PM -0700, Alex Williamson wrote: > On Tue, 20 Dec 2016 11:44:41 +0800 > Peter Xu wrote: > > > On Mon, Dec 19, 2016 at 09:56:50AM -0700, Alex Williamson wrote: > > > On Mon, 19 Dec 2016 22:41:26 +0800 > > > Peter Xu wrote: > > > > > > > This is preparation work to finally enabled dynamic switching ON/OFF for > > > > VT-d protection. The old VT-d codes is using static IOMMU region, and > > > > that won't satisfy vfio-pci device listeners. > > > > > > > > Let me explain. > > > > > > > > vfio-pci devices depend on the memory region listener and IOMMU replay > > > > mechanism to make sure the device mapping is coherent with the guest > > > > even if there are domain switches. And there are two kinds of domain > > > > switches: > > > > > > > > (1) switch from domain A -> B > > > > (2) switch from domain A -> no domain (e.g., turn DMAR off) > > > > > > > > Case (1) is handled by the context entry invalidation handling by the > > > > VT-d replay logic. What the replay function should do here is to replay > > > > the existing page mappings in domain B. > > > > > > > > However for case (2), we don't want to replay any domain mappings - we > > > > just need the default GPA->HPA mappings (the address_space_memory > > > > mapping). And this patch helps on case (2) to build up the mapping > > > > automatically by leveraging the vfio-pci memory listeners. > > > > > > > > Another important thing that this patch does is to seperate > > > > IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not > > > > depend on the DMAR region (like before this patch). It should be a > > > > standalone region, and it should be able to be activated without > > > > DMAR (which is a common behavior of Linux kernel - by default it enables > > > > IR while disabled DMAR). > > > > > > > > > This seems like an improvement, but I will note that there are existing > > > locked memory accounting issues inherent with VT-d and vfio. With > > > VT-d, each device has a unique AddressSpace. This requires that each > > > is managed via a separate vfio container. Each container is accounted > > > for separately for locked pages. libvirt currently only knows that if > > > any vfio devices are attached that the locked memory limit for the > > > process needs to be set sufficient for the VM memory. When VT-d is > > > involved, we either need to figure out how to associate otherwise > > > independent vfio containers to share locked page accounting or teach > > > libvirt that the locked memory requirement needs to be multiplied by > > > the number of attached vfio devices. The latter seems far less > > > complicated but reduces the containment of QEMU a bit since the > > > process has the ability to lock potentially many multiples of the VM > > > address size. Thanks, > > > > Yes, this patch just tried to move VT-d forward a bit, rather than do > > it once and for all. I think we can do better than this in the future, > > for example, one address space per guest IOMMU domain (as you have > > mentioned before). However I suppose that will need more work (which I > > still can't estimate on the amount of work). So I am considering to > > enable the device assignments functionally first, then we can further > > improve based on a workable version. Same thoughts apply to the IOMMU > > replay RFC series. > > I'm not arguing against it, I'm just trying to set expectations for > where this gets us. An AddressSpace per guest iommu domain seems like > the right model for QEMU, but it has some fundamental issues with > vfio. We currently tie a QEMU AddressSpace to a vfio container, which > represents the host IOMMU context. The AddressSpace of a device is > currently assumed to be fixed in QEMU, Actually, I think we can work around this: you could set up a separate AddressSpace for each device which consists of nothing but a big alias into an AddressSpace associated with the current IOMMU domain. As the device is moved between domains you remove/replace the alias region - or even replace it with an alias direct into system memory when the IOMMU is disabled. > est IOMMU domains clearly > are not. vfio only let's us have access to a device while it's > protected within a container. Therefore in order to move a device to a > different AddressSpace based on the guest domain configuration, we'd > need to tear down the vfio configuration, including releasing the > device. > > > Regarding to the locked memory accounting issue: do we have existing > > way to do the accounting? If so, would you (or anyone) please > > elaborate a bit? If not, is that an ongoing/planned work? > > As I describe above, there's a vfio container per AddressSpace, each > container is an IOMMU domain in the host. In the guest, an IOMMU > domain can include multiple AddressSpaces, one for each context entry > that's part of the domain. When the guest programs a translation for > an IOMMU domain, that maps a guest IOVA to a guest physical address, > for each AddressSpace. Each AddressSpace is backed by a vfio > container, which needs to pin the pages of that translation in order to > get a host physical address, which then gets programmed into the host > IOMMU domain with the guest-IOVA and host physical address. The > pinning process is where page accounting is done. Ah.. and I take it the accounting isn't smart enough to tell that the same page is already pinned elsewhere. I guess that would take rather a lot of extra bookkeeping. > It's done per vfio > context. The worst case scenario for accounting is thus when VT-d is > present but disabled (or in passthrough mode) as each AddressSpace > duplicates address_space_memory and every page of guest memory is > pinned and accounted for each vfio container. Hmm. I imagine you'll need a copy of the current translation tables for a guest domain regardless of VFIO involvement. So, when a domain is unused - i.e. has no devices in it, won't the container have all the groups detached and so give up all the memory. Obviously when a device is assigned to the domain you'll need to replay the current mappings into VFIO. > That's the existing way we do accounting. There is no current > development that I'm aware of to change this. As above, the simplest > stop-gap solution is that libvirt would need to be aware when VT-d is > present for a VM and use a different algorithm to set QEMU locked > memory limit, but it's not without its downsides. Alternatively, a new > IOMMU model would need to be developed for vfio. The type1 model was > only ever intended to be used for relatively static user mappings and I > expect it to have horrendous performance when backing a dynamic guest > IOMMU domain. Really the only guest IOMMU usage model that makes any > sort of sense with type1 is to run the guest with passthrough (iommu=pt) > and only pull devices out of passthrough for relatively static mapping > cases within the guest userspace (nested assigned devices or dpdk). If > the expectation is that we just need this one little bit more code to > make vfio usable in the guest, that may be true, but it really is just > barely usable. It's not going to be fast for any sort of dynamic > mapping and it's going to have accounting issues that are not > compatible with how libvirt sets locked memory limits for QEMU as soon > as you go beyond a single device. Thanks, Maybe we should revisit the idea of a "type2" IOMMU which could handle both guest VT-d and guest PAPR TCEs. I'm not excessively fond of the pre-registration model that PAPR uses at the moment, but it might be the best available way to deal with the accounting issue. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson