On Mon, Oct 11, 2021 at 03:49:14PM -0300, Jason Gunthorpe wrote:
> On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote:
> 
> > > This means we cannot define an input that has a magic HW specific
> > > value.
> > 
> > I'm not entirely sure what you mean by that.
> 
> I mean if you make a general property 'foo' that userspace must
> specify correctly then your API isn't general anymore. Userspace must
> know if it is A or B HW to set foo=A or foo=B.

I absolutely agree.  Which is exactly why I'm advocating that
userspace should request from the kernel what it needs (providing a
*minimum* of information) and the kernel satisfies that (filling in
the missing information as suitable for the platform) or outright
fails.

I think that is more robust across multiple platforms and usecases
than advertising a bunch of capabilities and forcing userspace to
interpret those to work out what it can do.

> Supported IOVA ranges are easially like that as every IOMMU is
> different. So DPDK shouldn't provide such specific or binding
> information.

Absolutely, DPDK should not provide that.  qemu *should* provide that,
because the specific IOVAs matter to the guest.  That will inevitably
mean that the request is more likely to fail, but that's a fundamental
tradeoff.

> > No, I don't think that needs to be a condition.  I think it's
> > perfectly reasonable for a constraint to be given, and for the host
> > IOMMU to just say "no, I can't do that".  But that does mean that each
> > of these values has to have an explicit way of userspace specifying "I
> > don't care", so that the kernel will select a suitable value for those
> > instead - that's what DPDK or other userspace would use nearly all the
> > time.
> 
> My feeling is that qemu should be dealing with the host != target
> case, not the kernel.
> 
> The kernel's job should be to expose the IOMMU HW it has, with all
> features accessible, to userspace.

See... to me this is contrary to the point we agreed on above.

> Qemu's job should be to have a userspace driver for each kernel IOMMU
> and the internal infrastructure to make accelerated emulations for all
> supported target IOMMUs.

This seems the wrong way around to me.  I see qemu as providing logic
to emulate each target IOMMU.  Where that matches the host, there's
the potential for an accelerated implementation, but it makes life a
lot easier if we can at least have a fallback that will work on any
sufficiently capable host IOMMU.

> In other words, it is not the kernel's job to provide target IOMMU
> emulation.

Absolutely not.  But it *is* the kernel's job to let qemu do as mach
as it can with the *host* IOMMU.

> The kernel should provide truely generic "works everywhere" interface
> that qemu/etc can rely on to implement the least accelerated emulation
> path.

Right... seems like we're agreeing again.

> So when I see proposals to have "generic" interfaces that actually
> require very HW specific setup, and cannot be used by a generic qemu
> userpace driver, I think it breaks this model. If qemu needs to know
> it is on PPC (as it does today with VFIO's PPC specific API) then it
> may as well speak PPC specific language and forget about pretending to
> be generic.

Absolutely, the current situation is a mess.

> This approach is grounded in 15 years of trying to build these
> user/kernel split HW subsystems (particularly RDMA) where it has
> become painfully obvious that the kernel is the worst place to try and
> wrangle really divergent HW into a "common" uAPI.
> 
> This is because the kernel/user boundary is fixed. Introducing
> anything generic here requires a lot of time, thought, arguing and
> risk. Usually it ends up being done wrong (like the PPC specific
> ioctls, for instance)

Those are certainly wrong, but they came about explicitly by *not*
being generic rather than by being too generic.  So I'm really
confused aso to what you're arguing for / against.

> and when this happens we can't learn and adapt,
> we are stuck with stable uABI forever.
> 
> Exposing a device's native programming interface is much simpler. Each
> device is fixed, defined and someone can sit down and figure out how
> to expose it. Then that is it, it doesn't need revisiting, it doesn't
> need harmonizing with a future slightly different device, it just
> stays as is.

I can certainly see the case for that approach.  That seems utterly at
odds with what /dev/iommu is trying to do, though.

> The cost, is that there must be a userspace driver component for each
> HW piece - which we are already paying here!
> 
> > Ideally the host /dev/iommu will say "ok!", since both those ranges
> > are within the 0..2^60 translated range of the host IOMMU, and don't
> > touch the IO hole.  When the guest calls the IO mapping hypercalls,
> > qemu translates those into DMA_MAP operations, and since they're all
> > within the previously verified windows, they should work fine.
> 
> For instance, we are going to see HW with nested page tables, user
> space owned page tables and even kernel-bypass fast IOTLB
> invalidation.

> In that world does it even make sense for qmeu to use slow DMA_MAP
> ioctls for emulation?

Probably not what you want ideally, but it's a really useful fallback
case to have.

> A userspace framework in qemu can make these optimizations and is
> also necessarily HW specific as the host page table is HW specific..
> 
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson