RE: [RFC 15/18] vfio/iommufd: Implement iommufd backend

From: "Tian, Kevin" <kevin.tian@intel.com>
To: "Liu, Yi L" <yi.l.liu@intel.com>, Jason Gunthorpe <jgg@nvidia.com>
Cc: "Peng, Chao P" <chao.p.peng@intel.com>,
	"Sun, Yi Y" <yi.y.sun@intel.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"david@gibson.dropbear.id.au" <david@gibson.dropbear.id.au>,
	"thuth@redhat.com" <thuth@redhat.com>,
	"farman@linux.ibm.com" <farman@linux.ibm.com>,
	"mjrosato@linux.ibm.com" <mjrosato@linux.ibm.com>,
	"akrowiak@linux.ibm.com" <akrowiak@linux.ibm.com>,
	"pasic@linux.ibm.com" <pasic@linux.ibm.com>,
	"jjherne@linux.ibm.com" <jjherne@linux.ibm.com>,
	"jasowang@redhat.com" <jasowang@redhat.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"nicolinc@nvidia.com" <nicolinc@nvidia.com>,
	"eric.auger@redhat.com" <eric.auger@redhat.com>,
	"eric.auger.pro@gmail.com" <eric.auger.pro@gmail.com>,
	"Peng, Chao P" <chao.p.peng@intel.com>,
	"Sun, Yi Y" <yi.y.sun@intel.com>,
	"peterx@redhat.com" <peterx@redhat.com>,
	"Alex Williamson (alex.williamson@redhat.com)" 
	<alex.williamson@redhat.com>
Subject: RE: [RFC 15/18] vfio/iommufd: Implement iommufd backend
Date: Tue, 26 Apr 2022 10:41:01 +0000	[thread overview]
Message-ID: <BN9PR11MB5276AD0B0DAA59A44ED705618CFB9@BN9PR11MB5276.namprd11.prod.outlook.com> (raw)
In-Reply-To: <3576770b-e4c2-cf11-da0c-821c55ab9902@intel.com>

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Tuesday, April 26, 2022 5:55 PM
> On 2022/4/22 22:58, Jason Gunthorpe wrote:
> > On Thu, Apr 14, 2022 at 03:47:07AM -0700, Yi Liu wrote:
> >
> >> +
> >> +    /* try to attach to an existing container in this space */
> >> +    QLIST_FOREACH(bcontainer, &space->containers, next) {
> >> +        if (!object_dynamic_cast(OBJECT(bcontainer),
> >> +                                 TYPE_VFIO_IOMMUFD_CONTAINER)) {
> >> +            continue;
> >> +        }
> >> +        container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
> >> +        if (vfio_device_attach_container(vbasedev, container, &err)) {
> >> +            const char *msg = error_get_pretty(err);
> >> +
> >> +            trace_vfio_iommufd_fail_attach_existing_container(msg);
> >> +            error_free(err);
> >> +            err = NULL;
> >> +        } else {
> >> +            ret = vfio_ram_block_discard_disable(true);
> >> +            if (ret) {
> >> +                vfio_device_detach_container(vbasedev, container, &err);
> >> +                error_propagate(errp, err);
> >> +                vfio_put_address_space(space);
> >> +                close(vbasedev->fd);
> >> +                error_prepend(errp,
> >> +                              "Cannot set discarding of RAM broken (%d)", ret);
> >> +                return ret;
> >> +            }
> >> +            goto out;
> >> +        }
> >> +    }
> >
> > ?? this logic shouldn't be necessary, a single ioas always supports
> > all devices, userspace should never need to juggle multiple ioas's
> > unless it wants to have different address maps.
> 
> legacy vfio container needs to allocate multiple containers in some cases.
> Say a device is attached to a container and some iova were mapped on this
> container. When trying to attach another device to this container, it will
> be failed in case of conflicts between the mapped DMA mappings and the
> reserved iovas of the another device. For such case, legacy vfio chooses to
> create a new container and attach the group to this new container. Hotlplug
> is a typical case of such scenario.
> 

Alex provided a clear rationale when we chatted with him on the
same topic. I simply copied it here instead of trying to further
translate: (Alex, please chime in if you want to add more words. 😊)

Q:
Why existing VFIOAddressSpace has a VFIOContainer list? is it because
one device with type1 and another with no_iommu?

A:
That's one case of incompatibility, but the IOMMU attach group callback
can fail in a variety of ways.  One that we've seen that is not
uncommon is that we might have an mdev container with various  mappings  
to other devices.  None of those mappings are validated until the mdev
driver tries to pin something, where it's generally unlikely that
they'd pin those particular mappings.  Then QEMU hot-adds a regular
IOMMU backed device, we allocate a domain for the device and replay the
mappings from the container, but now they get validated and potentially
fail.  The kernel returns a failure for the SET_IOMMU ioctl, QEMU
creates a new container and fills it from the same AddressSpace, where
now QEMU can determine which mappings can be safely skipped.  

Q:
I didn't get why some mappings are valid for one device while can
be skipped for another device under the same address space. Can you
elaborate a bit? If the skipped mappings are redundant and won't
be used for dma why does userspace request it in the first place? I'm
a bit lost here...

A: 
QEMU sets up a MemoryListener for the device AddressSpace and attempts
to map anything that triggers that listener, which includes not only VM
RAM which is our primary mapping goal, but also miscellaneous devices,
unaligned regions, and other device regions, ex. BARs.  Some of these
we filter out in QEMU with broad generalizations that unaligned ranges
aren't anything we can deal with, but other device regions covers
anything that's mmap'd in QEMU, ie. it has an associated KVM memory
slot.  IIRC, in the case I'm thinking of, the mapping that triggered
the replay failure was the BAR for an mdev device.  No attempt was made
to use gup or PFNMAP to resolve the mapping when only the mdev device
was present and the mdev host driver didn't attempt to pin pages within
its own BAR, but neither of these methods worked for the replay (I
don't recall further specifics). 

QEMU always attempts to create p2p mappings for devices, but this is a
case where we don't halt the VM if such a mapping cannot be created, so
a new container would replay the AddressSpace, see the fault, and skip
the region.

Q:
If there is conflict between reserved regions of a newly-plugged device
and existing mappings of VFIOAddressSpace, the device should simply
be rejected from attaching to the address space instead of creating 
another container under that address space.

A:
From a kernel perspective, yes, and that's what we do.  That doesn't
preclude the user from instantiating a new container and determining
for themselves whether the reserved region conflict is critical.  Note
that just because containers are in the same AddressSpace doesn't mean
that there aren't rules to allow certain mappings failures to be
non-fatal.

Thanks
Kevin