On Wed, Apr 28, 2021 at 11:56:22AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 28, 2021 at 10:58:29AM +1000, David Gibson wrote:
> > On Tue, Apr 27, 2021 at 02:12:12PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Apr 27, 2021 at 03:08:46PM +1000, David Gibson wrote:
> > > > > Starting from a BDF the general pseudo code is:
> > > > >  device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
> > > > >  device_fd = open("/dev/vfio/"+device_name)
> > > > >  ioasidfd = open("/dev/ioasid")
> > > > >  ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)
> > > > 
> > > > This line is the problem.
> > > > 
> > > > [Historical aside: Alex's early drafts for the VFIO interface looked
> > > > quite similar to this.  Ben Herrenschmidt and myself persuaded him it
> > > > was a bad idea, and groups were developed instead.  I still think it's
> > > > a bad idea, and not just for POWER]
> > > 
> > > Spawning the VFIO device FD from the group FD is incredibly gross from
> > > a kernel design perspective. Since that was done the struct
> > > vfio_device missed out on a sysfs presence and doesn't have the
> > > typical 'struct device' member or dedicated char device you'd expect a
> > > FD based subsystem to have.
> > > 
> > > This basically traded normal usage of the driver core for something
> > > that doesn't serve a technical usage. Given we are now nearly 10 years
> > > later and see that real widely deployed applications are not doing
> > > anything special with the group FD it makes me question the wisdom of
> > > this choice.
> > 
> > I'm really not sure what "anything special" would constitute here.
> 
> Well, really anything actually. All I see in, say, dpdk, is open the
> group fd, get a device fd, do the container dance and never touch the
> group fd again or care about groups in any way. It seems typical of
> this class of application.

Well, sure, the only operation you do on the group itself is attach it
to the container (and then every container operation can be thought of
as applying to all its attached groups).  But that attach operation
really is fundamentally about the group.  It always, unavoidably,
fundamentally affects every device in the group - including devices
you may not typically think about, like bridges and switches.

That is *not* true of the other device operations, like poking IO.

> If dpdk is exposing other devices to a risk it certainly hasn't done
> anything to make that obvious.

And in practice I suspect it will just break if you give it a >1
device group.

> > > Okay, that is fair, but let's solve that problem directly. For
> > > instance netlink has been going in the direction of adding a "extack"
> > > from the kernel which is a descriptive error string. If the failing
> > > ioctl returned the string:
> > > 
> > >   "cannot join this device to the IOASID because device XXX in the
> > >    same group #10 is in use"
> > 
> > Um.. is there a sane way to return strings from an ioctl()?
> 
> Yes, it can be done, a string buffer pointer and length in the input
> for instance.

I suppose.  Rare enough that I expect everyone will ignore it, alas :/.

> > Getting the device fds from the group fd kind of follows, because it's
> > unsafe to do basically anything on the device unless you already
> > control the group (which in this case means attaching it to a
> > container/ioasid).  I'm entirely open to ways of doing that that are
> > less inelegant from a sysfs integration point of view, but the point
> > is you must manage the group before you can do anything at all with
> > individual devices.
> 
> I think most likely VFIO is going to be the only thing to manage a
> multi-device group.

You don't get to choose that.  You could explicitly limit other things
to only one-device groups, but that would be an explicit limitation.
Essentially any device can end up in a multi-device group, if you put
it behind a PCIe to PCI bridge, or a PCIe switch which doesn't support
access controls.

The groups are still there, whether or not other things want to deal
with them.

> I see things like VDPA being primarily about PASID, and an IOASID that
> is matched to a PASID is inherently a single device IOMMU group.

I don't know enough about PASID to make sense of that.

> > I don't see why.  I mean, sure, you don't want explicitly the *vfio*
> > group as such.  But IOMMU group is already a cross-subsystem concept
> > and you can explicitly expose that in a different way.
> 
> Yes, and no, the kernel drivers in something like VDPA have decided
> what device and group they are in before we get to IOASID. It is
> illogical to try to retro-actively bolt in a group concept to their
> APIs.

Again, I don't know enough about VDPA to make sense of that.  Are we
essentially talking non-PCI virtual devices here?  In which case you
could define the VDPA "bus" to always have one-device groups.

> > Again, I realy think this is necessary complexity.  You're right that
> > far too little of the userspace properly understands group
> > restrictions.. but these come from real hardware limitations, and I
> > don't feel like making it *less* obvious in the interface is going to
> > help that.
> 
> The appeal of making it less obvious is we can have a single
> simplified API flow so that an application that doesn't understand or
> care about groups can have uniformity.

I don't think simplified-but-wrong is a good goal.  The thing about
groups is that if they're there, you can't just "not care" about them,
they affect you whether you like it or not.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson