On Mon, Oct 11, 2021 at 03:49:14PM -0300, Jason Gunthorpe wrote: > On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote: > > > > This means we cannot define an input that has a magic HW specific > > > value. > > > > I'm not entirely sure what you mean by that. > > I mean if you make a general property 'foo' that userspace must > specify correctly then your API isn't general anymore. Userspace must > know if it is A or B HW to set foo=A or foo=B. I absolutely agree. Which is exactly why I'm advocating that userspace should request from the kernel what it needs (providing a *minimum* of information) and the kernel satisfies that (filling in the missing information as suitable for the platform) or outright fails. I think that is more robust across multiple platforms and usecases than advertising a bunch of capabilities and forcing userspace to interpret those to work out what it can do. > Supported IOVA ranges are easially like that as every IOMMU is > different. So DPDK shouldn't provide such specific or binding > information. Absolutely, DPDK should not provide that. qemu *should* provide that, because the specific IOVAs matter to the guest. That will inevitably mean that the request is more likely to fail, but that's a fundamental tradeoff. > > No, I don't think that needs to be a condition. I think it's > > perfectly reasonable for a constraint to be given, and for the host > > IOMMU to just say "no, I can't do that". But that does mean that each > > of these values has to have an explicit way of userspace specifying "I > > don't care", so that the kernel will select a suitable value for those > > instead - that's what DPDK or other userspace would use nearly all the > > time. > > My feeling is that qemu should be dealing with the host != target > case, not the kernel. > > The kernel's job should be to expose the IOMMU HW it has, with all > features accessible, to userspace. See... to me this is contrary to the point we agreed on above. > Qemu's job should be to have a userspace driver for each kernel IOMMU > and the internal infrastructure to make accelerated emulations for all > supported target IOMMUs. This seems the wrong way around to me. I see qemu as providing logic to emulate each target IOMMU. Where that matches the host, there's the potential for an accelerated implementation, but it makes life a lot easier if we can at least have a fallback that will work on any sufficiently capable host IOMMU. > In other words, it is not the kernel's job to provide target IOMMU > emulation. Absolutely not. But it *is* the kernel's job to let qemu do as mach as it can with the *host* IOMMU. > The kernel should provide truely generic "works everywhere" interface > that qemu/etc can rely on to implement the least accelerated emulation > path. Right... seems like we're agreeing again. > So when I see proposals to have "generic" interfaces that actually > require very HW specific setup, and cannot be used by a generic qemu > userpace driver, I think it breaks this model. If qemu needs to know > it is on PPC (as it does today with VFIO's PPC specific API) then it > may as well speak PPC specific language and forget about pretending to > be generic. Absolutely, the current situation is a mess. > This approach is grounded in 15 years of trying to build these > user/kernel split HW subsystems (particularly RDMA) where it has > become painfully obvious that the kernel is the worst place to try and > wrangle really divergent HW into a "common" uAPI. > > This is because the kernel/user boundary is fixed. Introducing > anything generic here requires a lot of time, thought, arguing and > risk. Usually it ends up being done wrong (like the PPC specific > ioctls, for instance) Those are certainly wrong, but they came about explicitly by *not* being generic rather than by being too generic. So I'm really confused aso to what you're arguing for / against. > and when this happens we can't learn and adapt, > we are stuck with stable uABI forever. > > Exposing a device's native programming interface is much simpler. Each > device is fixed, defined and someone can sit down and figure out how > to expose it. Then that is it, it doesn't need revisiting, it doesn't > need harmonizing with a future slightly different device, it just > stays as is. I can certainly see the case for that approach. That seems utterly at odds with what /dev/iommu is trying to do, though. > The cost, is that there must be a userspace driver component for each > HW piece - which we are already paying here! > > > Ideally the host /dev/iommu will say "ok!", since both those ranges > > are within the 0..2^60 translated range of the host IOMMU, and don't > > touch the IO hole. When the guest calls the IO mapping hypercalls, > > qemu translates those into DMA_MAP operations, and since they're all > > within the previously verified windows, they should work fine. > > For instance, we are going to see HW with nested page tables, user > space owned page tables and even kernel-bypass fast IOTLB > invalidation. > In that world does it even make sense for qmeu to use slow DMA_MAP > ioctls for emulation? Probably not what you want ideally, but it's a really useful fallback case to have. > A userspace framework in qemu can make these optimizations and is > also necessarily HW specific as the host page table is HW specific.. > > Jason > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson