All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Tian, Kevin" <kevin.tian@intel.com>
To: "Liu, Yi L" <yi.l.liu@intel.com>, Jason Gunthorpe <jgg@nvidia.com>
Cc: "Peng, Chao P" <chao.p.peng@intel.com>,
	"Sun, Yi Y" <yi.y.sun@intel.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"david@gibson.dropbear.id.au" <david@gibson.dropbear.id.au>,
	"thuth@redhat.com" <thuth@redhat.com>,
	"farman@linux.ibm.com" <farman@linux.ibm.com>,
	"mjrosato@linux.ibm.com" <mjrosato@linux.ibm.com>,
	"akrowiak@linux.ibm.com" <akrowiak@linux.ibm.com>,
	"pasic@linux.ibm.com" <pasic@linux.ibm.com>,
	"jjherne@linux.ibm.com" <jjherne@linux.ibm.com>,
	"jasowang@redhat.com" <jasowang@redhat.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"nicolinc@nvidia.com" <nicolinc@nvidia.com>,
	"eric.auger@redhat.com" <eric.auger@redhat.com>,
	"eric.auger.pro@gmail.com" <eric.auger.pro@gmail.com>,
	"Peng, Chao P" <chao.p.peng@intel.com>,
	"Sun, Yi Y" <yi.y.sun@intel.com>,
	"peterx@redhat.com" <peterx@redhat.com>,
	"Alex Williamson (alex.williamson@redhat.com)" 
	<alex.williamson@redhat.com>
Subject: RE: [RFC 15/18] vfio/iommufd: Implement iommufd backend
Date: Tue, 26 Apr 2022 10:41:01 +0000	[thread overview]
Message-ID: <BN9PR11MB5276AD0B0DAA59A44ED705618CFB9@BN9PR11MB5276.namprd11.prod.outlook.com> (raw)
In-Reply-To: <3576770b-e4c2-cf11-da0c-821c55ab9902@intel.com>

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Tuesday, April 26, 2022 5:55 PM
> On 2022/4/22 22:58, Jason Gunthorpe wrote:
> > On Thu, Apr 14, 2022 at 03:47:07AM -0700, Yi Liu wrote:
> >
> >> +
> >> +    /* try to attach to an existing container in this space */
> >> +    QLIST_FOREACH(bcontainer, &space->containers, next) {
> >> +        if (!object_dynamic_cast(OBJECT(bcontainer),
> >> +                                 TYPE_VFIO_IOMMUFD_CONTAINER)) {
> >> +            continue;
> >> +        }
> >> +        container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
> >> +        if (vfio_device_attach_container(vbasedev, container, &err)) {
> >> +            const char *msg = error_get_pretty(err);
> >> +
> >> +            trace_vfio_iommufd_fail_attach_existing_container(msg);
> >> +            error_free(err);
> >> +            err = NULL;
> >> +        } else {
> >> +            ret = vfio_ram_block_discard_disable(true);
> >> +            if (ret) {
> >> +                vfio_device_detach_container(vbasedev, container, &err);
> >> +                error_propagate(errp, err);
> >> +                vfio_put_address_space(space);
> >> +                close(vbasedev->fd);
> >> +                error_prepend(errp,
> >> +                              "Cannot set discarding of RAM broken (%d)", ret);
> >> +                return ret;
> >> +            }
> >> +            goto out;
> >> +        }
> >> +    }
> >
> > ?? this logic shouldn't be necessary, a single ioas always supports
> > all devices, userspace should never need to juggle multiple ioas's
> > unless it wants to have different address maps.
> 
> legacy vfio container needs to allocate multiple containers in some cases.
> Say a device is attached to a container and some iova were mapped on this
> container. When trying to attach another device to this container, it will
> be failed in case of conflicts between the mapped DMA mappings and the
> reserved iovas of the another device. For such case, legacy vfio chooses to
> create a new container and attach the group to this new container. Hotlplug
> is a typical case of such scenario.
> 

Alex provided a clear rationale when we chatted with him on the
same topic. I simply copied it here instead of trying to further
translate: (Alex, please chime in if you want to add more words. 😊)

Q:
Why existing VFIOAddressSpace has a VFIOContainer list? is it because
one device with type1 and another with no_iommu?

A:
That's one case of incompatibility, but the IOMMU attach group callback
can fail in a variety of ways.  One that we've seen that is not
uncommon is that we might have an mdev container with various  mappings  
to other devices.  None of those mappings are validated until the mdev
driver tries to pin something, where it's generally unlikely that
they'd pin those particular mappings.  Then QEMU hot-adds a regular
IOMMU backed device, we allocate a domain for the device and replay the
mappings from the container, but now they get validated and potentially
fail.  The kernel returns a failure for the SET_IOMMU ioctl, QEMU
creates a new container and fills it from the same AddressSpace, where
now QEMU can determine which mappings can be safely skipped.  

Q:
I didn't get why some mappings are valid for one device while can
be skipped for another device under the same address space. Can you
elaborate a bit? If the skipped mappings are redundant and won't
be used for dma why does userspace request it in the first place? I'm
a bit lost here...

A: 
QEMU sets up a MemoryListener for the device AddressSpace and attempts
to map anything that triggers that listener, which includes not only VM
RAM which is our primary mapping goal, but also miscellaneous devices,
unaligned regions, and other device regions, ex. BARs.  Some of these
we filter out in QEMU with broad generalizations that unaligned ranges
aren't anything we can deal with, but other device regions covers
anything that's mmap'd in QEMU, ie. it has an associated KVM memory
slot.  IIRC, in the case I'm thinking of, the mapping that triggered
the replay failure was the BAR for an mdev device.  No attempt was made
to use gup or PFNMAP to resolve the mapping when only the mdev device
was present and the mdev host driver didn't attempt to pin pages within
its own BAR, but neither of these methods worked for the replay (I
don't recall further specifics). 

QEMU always attempts to create p2p mappings for devices, but this is a
case where we don't halt the VM if such a mapping cannot be created, so
a new container would replay the AddressSpace, see the fault, and skip
the region.

Q:
If there is conflict between reserved regions of a newly-plugged device
and existing mappings of VFIOAddressSpace, the device should simply
be rejected from attaching to the address space instead of creating 
another container under that address space.

A:
From a kernel perspective, yes, and that's what we do.  That doesn't
preclude the user from instantiating a new container and determining
for themselves whether the reserved region conflict is critical.  Note
that just because containers are in the same AddressSpace doesn't mean
that there aren't rules to allow certain mappings failures to be
non-fatal.

Thanks
Kevin

WARNING: multiple messages have this Message-ID (diff)
From: "Tian, Kevin" <kevin.tian@intel.com>
To: "Liu, Yi L" <yi.l.liu@intel.com>, Jason Gunthorpe <jgg@nvidia.com>
Cc: "akrowiak@linux.ibm.com" <akrowiak@linux.ibm.com>,
	"jjherne@linux.ibm.com" <jjherne@linux.ibm.com>,
	"thuth@redhat.com" <thuth@redhat.com>,
	"Peng, Chao P" <chao.p.peng@intel.com>,
	"Alex Williamson \(alex.williamson@redhat.com\)"
	<alex.williamson@redhat.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"mjrosato@linux.ibm.com" <mjrosato@linux.ibm.com>,
	"jasowang@redhat.com" <jasowang@redhat.com>,
	"farman@linux.ibm.com" <farman@linux.ibm.com>,
	"peterx@redhat.com" <peterx@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"pasic@linux.ibm.com" <pasic@linux.ibm.com>,
	"eric.auger@redhat.com" <eric.auger@redhat.com>,
	"Sun, Yi Y" <yi.y.sun@intel.com>,
	"nicolinc@nvidia.com" <nicolinc@nvidia.com>,
	"eric.auger.pro@gmail.com" <eric.auger.pro@gmail.com>,
	"david@gibson.dropbear.id.au" <david@gibson.dropbear.id.au>
Subject: RE: [RFC 15/18] vfio/iommufd: Implement iommufd backend
Date: Tue, 26 Apr 2022 10:41:01 +0000	[thread overview]
Message-ID: <BN9PR11MB5276AD0B0DAA59A44ED705618CFB9@BN9PR11MB5276.namprd11.prod.outlook.com> (raw)
In-Reply-To: <3576770b-e4c2-cf11-da0c-821c55ab9902@intel.com>

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Tuesday, April 26, 2022 5:55 PM
> On 2022/4/22 22:58, Jason Gunthorpe wrote:
> > On Thu, Apr 14, 2022 at 03:47:07AM -0700, Yi Liu wrote:
> >
> >> +
> >> +    /* try to attach to an existing container in this space */
> >> +    QLIST_FOREACH(bcontainer, &space->containers, next) {
> >> +        if (!object_dynamic_cast(OBJECT(bcontainer),
> >> +                                 TYPE_VFIO_IOMMUFD_CONTAINER)) {
> >> +            continue;
> >> +        }
> >> +        container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
> >> +        if (vfio_device_attach_container(vbasedev, container, &err)) {
> >> +            const char *msg = error_get_pretty(err);
> >> +
> >> +            trace_vfio_iommufd_fail_attach_existing_container(msg);
> >> +            error_free(err);
> >> +            err = NULL;
> >> +        } else {
> >> +            ret = vfio_ram_block_discard_disable(true);
> >> +            if (ret) {
> >> +                vfio_device_detach_container(vbasedev, container, &err);
> >> +                error_propagate(errp, err);
> >> +                vfio_put_address_space(space);
> >> +                close(vbasedev->fd);
> >> +                error_prepend(errp,
> >> +                              "Cannot set discarding of RAM broken (%d)", ret);
> >> +                return ret;
> >> +            }
> >> +            goto out;
> >> +        }
> >> +    }
> >
> > ?? this logic shouldn't be necessary, a single ioas always supports
> > all devices, userspace should never need to juggle multiple ioas's
> > unless it wants to have different address maps.
> 
> legacy vfio container needs to allocate multiple containers in some cases.
> Say a device is attached to a container and some iova were mapped on this
> container. When trying to attach another device to this container, it will
> be failed in case of conflicts between the mapped DMA mappings and the
> reserved iovas of the another device. For such case, legacy vfio chooses to
> create a new container and attach the group to this new container. Hotlplug
> is a typical case of such scenario.
> 

Alex provided a clear rationale when we chatted with him on the
same topic. I simply copied it here instead of trying to further
translate: (Alex, please chime in if you want to add more words. 😊)

Q:
Why existing VFIOAddressSpace has a VFIOContainer list? is it because
one device with type1 and another with no_iommu?

A:
That's one case of incompatibility, but the IOMMU attach group callback
can fail in a variety of ways.  One that we've seen that is not
uncommon is that we might have an mdev container with various  mappings  
to other devices.  None of those mappings are validated until the mdev
driver tries to pin something, where it's generally unlikely that
they'd pin those particular mappings.  Then QEMU hot-adds a regular
IOMMU backed device, we allocate a domain for the device and replay the
mappings from the container, but now they get validated and potentially
fail.  The kernel returns a failure for the SET_IOMMU ioctl, QEMU
creates a new container and fills it from the same AddressSpace, where
now QEMU can determine which mappings can be safely skipped.  

Q:
I didn't get why some mappings are valid for one device while can
be skipped for another device under the same address space. Can you
elaborate a bit? If the skipped mappings are redundant and won't
be used for dma why does userspace request it in the first place? I'm
a bit lost here...

A: 
QEMU sets up a MemoryListener for the device AddressSpace and attempts
to map anything that triggers that listener, which includes not only VM
RAM which is our primary mapping goal, but also miscellaneous devices,
unaligned regions, and other device regions, ex. BARs.  Some of these
we filter out in QEMU with broad generalizations that unaligned ranges
aren't anything we can deal with, but other device regions covers
anything that's mmap'd in QEMU, ie. it has an associated KVM memory
slot.  IIRC, in the case I'm thinking of, the mapping that triggered
the replay failure was the BAR for an mdev device.  No attempt was made
to use gup or PFNMAP to resolve the mapping when only the mdev device
was present and the mdev host driver didn't attempt to pin pages within
its own BAR, but neither of these methods worked for the replay (I
don't recall further specifics). 

QEMU always attempts to create p2p mappings for devices, but this is a
case where we don't halt the VM if such a mapping cannot be created, so
a new container would replay the AddressSpace, see the fault, and skip
the region.

Q:
If there is conflict between reserved regions of a newly-plugged device
and existing mappings of VFIOAddressSpace, the device should simply
be rejected from attaching to the address space instead of creating 
another container under that address space.

A:
From a kernel perspective, yes, and that's what we do.  That doesn't
preclude the user from instantiating a new container and determining
for themselves whether the reserved region conflict is critical.  Note
that just because containers are in the same AddressSpace doesn't mean
that there aren't rules to allow certain mappings failures to be
non-fatal.

Thanks
Kevin

  reply	other threads:[~2022-04-26 10:46 UTC|newest]

Thread overview: 125+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-14 10:46 [RFC 00/18] vfio: Adopt iommufd Yi Liu
2022-04-14 10:46 ` Yi Liu
2022-04-14 10:46 ` [RFC 01/18] scripts/update-linux-headers: Add iommufd.h Yi Liu
2022-04-14 10:46   ` Yi Liu
2022-04-14 10:46 ` [RFC 02/18] linux-headers: Import latest vfio.h and iommufd.h Yi Liu
2022-04-14 10:46   ` Yi Liu
2022-04-14 10:46 ` [RFC 03/18] hw/vfio/pci: fix vfio_pci_hot_reset_result trace point Yi Liu
2022-04-14 10:46   ` Yi Liu
2022-04-14 10:46 ` [RFC 04/18] vfio/pci: Use vbasedev local variable in vfio_realize() Yi Liu
2022-04-14 10:46   ` Yi Liu
2022-04-14 10:46 ` [RFC 05/18] vfio/common: Rename VFIOGuestIOMMU::iommu into ::iommu_mr Yi Liu
2022-04-14 10:46   ` Yi Liu
2022-04-14 10:46 ` [RFC 06/18] vfio/common: Split common.c into common.c, container.c and as.c Yi Liu
2022-04-14 10:46 ` [RFC 07/18] vfio: Add base object for VFIOContainer Yi Liu
2022-04-14 10:46   ` Yi Liu
2022-04-29  6:29   ` David Gibson
2022-04-29  6:29     ` David Gibson
2022-05-03 13:05     ` Yi Liu
2022-04-14 10:47 ` [RFC 08/18] vfio/container: Introduce vfio_[attach/detach]_device Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 09/18] vfio/platform: Use vfio_[attach/detach]_device Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 10/18] vfio/ap: " Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 11/18] vfio/ccw: " Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 12/18] vfio/container-obj: Introduce [attach/detach]_device container callbacks Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 13/18] vfio/container-obj: Introduce VFIOContainer reset callback Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 14/18] hw/iommufd: Creation Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 15/18] vfio/iommufd: Implement iommufd backend Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-22 14:58   ` Jason Gunthorpe
2022-04-22 21:33     ` Alex Williamson
2022-04-22 21:33       ` Alex Williamson
2022-04-26  9:55     ` Yi Liu
2022-04-26  9:55       ` Yi Liu
2022-04-26 10:41       ` Tian, Kevin [this message]
2022-04-26 10:41         ` Tian, Kevin
2022-04-26 13:41         ` Jason Gunthorpe
2022-04-26 14:08           ` Yi Liu
2022-04-26 14:08             ` Yi Liu
2022-04-26 14:11             ` Jason Gunthorpe
2022-04-26 18:45               ` Alex Williamson
2022-04-26 18:45                 ` Alex Williamson
2022-04-26 19:27                 ` Jason Gunthorpe
2022-04-26 20:59                   ` Alex Williamson
2022-04-26 20:59                     ` Alex Williamson
2022-04-26 23:08                     ` Jason Gunthorpe
2022-04-26 13:53       ` Jason Gunthorpe
2022-04-14 10:47 ` [RFC 16/18] vfio/iommufd: Add IOAS_COPY_DMA support Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 17/18] vfio/as: Allow the selection of a given iommu backend Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 18/18] vfio/pci: Add an iommufd option Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-15  8:37 ` [RFC 00/18] vfio: Adopt iommufd Nicolin Chen
2022-04-17 10:30   ` Eric Auger
2022-04-17 10:30     ` Eric Auger
2022-04-19  3:26     ` Nicolin Chen
2022-04-25 19:40       ` Eric Auger
2022-04-25 19:40         ` Eric Auger
2022-04-18  8:49 ` Tian, Kevin
2022-04-18  8:49   ` Tian, Kevin
2022-04-18 12:09   ` Yi Liu
2022-04-18 12:09     ` Yi Liu
2022-04-25 19:51     ` Eric Auger
2022-04-25 19:51       ` Eric Auger
2022-04-25 19:55   ` Eric Auger
2022-04-25 19:55     ` Eric Auger
2022-04-26  8:39     ` Tian, Kevin
2022-04-26  8:39       ` Tian, Kevin
2022-04-22 22:09 ` Alex Williamson
2022-04-22 22:09   ` Alex Williamson
2022-04-25 10:10   ` Daniel P. Berrangé
2022-04-25 10:10     ` Daniel P. Berrangé
2022-04-25 13:36     ` Jason Gunthorpe
2022-04-25 14:37     ` Alex Williamson
2022-04-25 14:37       ` Alex Williamson
2022-04-26  8:37       ` Tian, Kevin
2022-04-26  8:37         ` Tian, Kevin
2022-04-26 12:33         ` Jason Gunthorpe
2022-04-26 16:21         ` Alex Williamson
2022-04-26 16:21           ` Alex Williamson
2022-04-26 16:42           ` Jason Gunthorpe
2022-04-26 19:24             ` Alex Williamson
2022-04-26 19:24               ` Alex Williamson
2022-04-26 19:36               ` Jason Gunthorpe
2022-04-28  3:21           ` Tian, Kevin
2022-04-28  3:21             ` Tian, Kevin
2022-04-28 14:24             ` Alex Williamson
2022-04-28 14:24               ` Alex Williamson
2022-04-28 16:20               ` Daniel P. Berrangé
2022-04-28 16:20                 ` Daniel P. Berrangé
2022-04-29  0:45                 ` Tian, Kevin
2022-04-29  0:45                   ` Tian, Kevin
2022-04-25 20:23   ` Eric Auger
2022-04-25 20:23     ` Eric Auger
2022-04-25 22:53     ` Alex Williamson
2022-04-25 22:53       ` Alex Williamson
2022-04-26  9:47 ` Shameerali Kolothum Thodi via
2022-04-26  9:47   ` Shameerali Kolothum Thodi
2022-04-26 11:44   ` Eric Auger
2022-04-26 11:44     ` Eric Auger
2022-04-26 12:43     ` Shameerali Kolothum Thodi
2022-04-26 12:43       ` Shameerali Kolothum Thodi via
2022-04-26 16:35       ` Alex Williamson
2022-04-26 16:35         ` Alex Williamson
2022-05-09 14:24         ` Zhangfei Gao
2022-05-10  3:17           ` Yi Liu
2022-05-10  6:51             ` Eric Auger
2022-05-10 12:35               ` Zhangfei Gao
2022-05-10 12:45                 ` Jason Gunthorpe
2022-05-10 14:08                   ` Yi Liu
2022-05-11 14:17                     ` zhangfei.gao
2022-05-12  9:01                       ` zhangfei.gao
2022-05-17  8:55                         ` Yi Liu
2022-05-18  7:22                           ` zhangfei.gao
2022-05-18 14:00                             ` Yi Liu
2022-06-28  8:14                               ` Shameerali Kolothum Thodi
2022-06-28  8:14                                 ` Shameerali Kolothum Thodi via
2022-06-28  8:58                                 ` Eric Auger
2022-05-17  8:52                       ` Yi Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BN9PR11MB5276AD0B0DAA59A44ED705618CFB9@BN9PR11MB5276.namprd11.prod.outlook.com \
    --to=kevin.tian@intel.com \
    --cc=akrowiak@linux.ibm.com \
    --cc=alex.williamson@redhat.com \
    --cc=chao.p.peng@intel.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=eric.auger.pro@gmail.com \
    --cc=eric.auger@redhat.com \
    --cc=farman@linux.ibm.com \
    --cc=jasowang@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=jjherne@linux.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=mjrosato@linux.ibm.com \
    --cc=nicolinc@nvidia.com \
    --cc=pasic@linux.ibm.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=thuth@redhat.com \
    --cc=yi.l.liu@intel.com \
    --cc=yi.y.sun@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.