* [PATCH RFC v2] vfio: Documentation for the migration region @ 2021-11-29 14:45 Jason Gunthorpe 2021-11-30 17:26 ` Alex Williamson 0 siblings, 1 reply; 29+ messages in thread From: Jason Gunthorpe @ 2021-11-29 14:45 UTC (permalink / raw) To: Alex Williamson, Jonathan Corbet, linux-doc Cc: Cornelia Huck, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas Provide some more complete documentation for the migration regions behavior, specifically focusing on the device_state bits and the whole system view from a VMM. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> --- Documentation/driver-api/vfio.rst | 277 +++++++++++++++++++++++++++++- 1 file changed, 276 insertions(+), 1 deletion(-) Alex/Cornelia, here is the second draft of the requested documentation I promised We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism. Our thinking is that NDMA would be implemented like this: +#define VFIO_DEVICE_STATE_NDMA (1 << 3) And a .add_capability ops will be used to signal to userspace driver support: +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA 6 I've described DIRTY TRACKING as a seperate concept here. With the current uAPI this would be controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START, with our change in direction this would be per-tracker control, but no semantic change. Upon some agreement we'll include this patch in the next iteration of the mlx5 driver along with the NDMA bits. v2: - RST fixups for sphinx rendering - Inclue the priority order for multi-bit-changes - Add a small discussion on devices like hns with migration control inside the same function as is being migrated. - Language cleanups from v1, the diff says almost every line was touched in some way v1: https://lore.kernel.org/r/0-v1-0ec87874bede+123-vfio_mig_doc_jgg@nvidia.com diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst index c663b6f978255b..d9be47570f878c 100644 --- a/Documentation/driver-api/vfio.rst +++ b/Documentation/driver-api/vfio.rst @@ -242,7 +242,282 @@ group and can access them as follows:: VFIO User API ------------------------------------------------------------------------------- -Please see include/linux/vfio.h for complete API documentation. +Please see include/uapi/linux/vfio.h for complete API documentation. + +------------------------------------------------------------------------------- + +VFIO migration driver API +------------------------------------------------------------------------------- + +VFIO drivers that support migration implement a migration control register +called device_state in the struct vfio_device_migration_info which is in its +VFIO_REGION_TYPE_MIGRATION region. + +The device_state controls both device action and continuous behavior. +Setting/clearing bit groups triggers device action, and each bit controls a +continuous device behavior. + +Along with the device_state the migration driver provides a data window which +allows streaming migration data into or out of the device. + +A lot of flexibility is provided to user-space in how it operates these +bits. What follows is a reference flow for saving device state in a live +migration, with all features, and an illustration how other external non-VFIO +entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in. + + RUNNING, VCPU_RUNNING + Normal operating state + RUNNING, DIRTY_TRACKING, VCPU_RUNNING + Log DMAs + + Stream all memory + SAVING | RUNNING, DIRTY_TRACKING, VCPU_RUNNING + Log internal device changes (pre-copy) + + Stream device state through the migration window + + While in this state repeat as desired: + + Atomic Read and Clear DMA Dirty log + + Stream dirty memory + SAVING | NDMA | RUNNING, VCPU_RUNNING + vIOMMU grace state + + Complete all in progress IO page faults, idle the vIOMMU + SAVING | NDMA | RUNNING + Peer to Peer DMA grace state + + Final snapshot of DMA dirty log (atomic not required) + SAVING + Stream final device state through the migration window + + Copy final dirty data + 0 + Device is halted + +and the reference flow for resuming: + + RUNNING + Issue VFIO_DEVICE_RESET to clear the internal device state + 0 + Device is halted + RESUMING + Push in migration data. Data captured during pre-copy should be + prepended to data captured during SAVING. + NDMA | RUNNING + Peer to Peer DMA grace state + RUNNING, VCPU_RUNNING + Normal operating state + +If the VMM has multiple VFIO devices undergoing migration then the grace +states act as cross device synchronization points. The VMM must bring all +devices to the grace state before advancing past it. + +The above reference flows are built around specific requirements on the +migration driver for its implementation of the migration_state input. + +The migration_state cannot change asynchronously, upon writing the +migration_state the driver will either keep the current state and return +failure, return failure and go to ERROR, or succeed and go to the new state. + +Event triggered actions happen when user-space requests a new migration_state +that differs from the current migration_state. Actions happen on a bit group +basis: + + SAVING | RUNNING + The device clears the data window and begins streaming 'pre copy' migration + data through the window. Devices that cannot log internal state changes + return a 0 length migration stream. + + SAVING | !RUNNING + The device captures its internal state that is not covered by internal + logging, as well as any logged changes. + + The device clears the data window and begins streaming the captured + migration data through the window. Devices that cannot log internal state + changes stream all their device state here. + + RESUMING + The data window is cleared, opened, and can receive the migration data + stream. + + !RESUMING + All the data transferred into the data window is loaded into the device's + internal state. The migration driver can rely on user-space issuing a + VFIO_DEVICE_RESET prior to starting RESUMING. + + To abort a RESUMING issue a VFIO_DEVICE_RESET. + + If the migration data is invalid then the ERROR state must be set. + +Continuous actions are in effect when migration_state bit groups are active: + + RUNNING | NDMA + The device is not allowed to issue new DMA operations. + + Whenever the kernel returns with a migration_state of NDMA there can be no + in progress DMAs. + + !RUNNING + The device should not change its internal state. Further implies the NDMA + behavior above. + + SAVING | !RUNNING + RESUMING | !RUNNING + The device may assume there are no incoming MMIO operations. + + Internal state logging can stop. + + RUNNING + The device can alter its internal state and must respond to incoming MMIO. + + SAVING | RUNNING + The device is logging changes to the internal state. + + ERROR + The behavior of the device is largely undefined. The device must be + recovered by issuing VFIO_DEVICE_RESET or closing the device file + descriptor. + + However, devices supporting NDMA must behave as though NDMA is asserted + during ERROR to avoid corrupting other devices or a VM during a failed + migration. + +When multiple bits change in the migration_state they may describe multiple +event triggered actions, and multiple changes to continuous actions. The +migration driver must process them in a priority order: + + - SAVING | RUNNING + - NDMA + - !RUNNING + - SAVING | !RUNNING + - RESUMING + - !RESUMING + - RUNNING + - !NDMA + +In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the +device back to device_state RUNNING. When a migration driver executes this +ioctl it should discard the data window and set migration_state to RUNNING as +part of resetting the device to a clean state. This must happen even if the +migration_state has errored. A freshly opened device FD should always be in +the RUNNING state. + +The migration driver has limitations on what device state it can affect. Any +device state controlled by general kernel subsystems must not be changed +during RESUME, and SAVING must tolerate mutation of this state. Change to +externally controlled device state can happen at any time, asynchronously, to +the migration (ie interrupt rebalancing). + +Some examples of externally controlled state: + - MSI-X interrupt page + - MSI/legacy interrupt configuration + - Large parts of the PCI configuration space, ie common control bits + - PCI power management + - Changes via VFIO_DEVICE_SET_IRQS + +During !RUNNING, especially during SAVING and RESUMING, the device may have +limitations on what it can tolerate. An ideal device will discard/return all +ones to all incoming MMIO, PIO, or equivalent operations (exclusive of the +external state above) in !RUNNING. However, devices are free to have undefined +behavior if they receive incoming operations. This includes +corrupting/aborting the migration, dirtying pages, and segfaulting user-space. + +However, a device may not compromise system integrity if it is subjected to a +MMIO. It cannot trigger an error TLP, it cannot trigger an x86 Machine Check +or similar, and it cannot compromise device isolation. + +There are several edge cases that user-space should keep in mind when +implementing migration: + +- Device Peer to Peer DMA. In this case devices are able issue DMAs to each + other's MMIO regions. The VMM can permit this if it maps the MMIO memory into + the IOMMU. + + As Peer to Peer DMA is a MMIO touch like any other, it is important that + userspace suspend these accesses before entering any device_state where MMIO + is not permitted, such as !RUNNING. This can be accomplished with the NDMA + state. Userspace may also choose to remove MMIO mappings from the IOMMU if the + device does not support NDMA and rely on that to guarantee quiet MMIO. + + The Peer to Peer Grace States exist so that all devices may reach RUNNING + before any device is subjected to a MMIO access. + + Failure to guarantee quiet MMIO may allow a hostile VM to use P2P to violate + the no-MMIO restriction during SAVING or RESUMING and corrupt the migration + on devices that cannot protect themselves. + +- IOMMU Page faults handled in user-space can occur at any time. A migration + driver is not required to serialize in-progress page faults. It can assume + that all page faults are completed before entering SAVING | !RUNNING. Since + the guest VCPU is required to complete page faults the VMM can accomplish + this by asserting NDMA | VCPU_RUNNING and clearing all pending page faults + before clearing VCPU_RUNNING. + + Device that do not support NDMA cannot be configured to generate page faults + that require the VCPU to complete. + +- pre-copy allows the device to implement a dirty log for its internal state. + During the SAVING | RUNNING state the data window should present the device + state being logged and during SAVING | !RUNNING the data window should present + the unlogged device state as well as the changes from the internal dirty log. + + On RESUME these two data streams are concatenated together. + + pre-copy is only concerned with internal device state. External DMAs are + covered by the separate DIRTY_TRACKING function. + +- Atomic Read and Clear of the DMA log is a HW feature. If the tracker + cannot support this, then NDMA could be used to synthesize it less + efficiently. + +- NDMA is optional. If the device does not support this then the NDMA States + are pushed down to the next step in the sequence and various behaviors that + rely on NDMA cannot be used. + +- Migration control registers inside the same iommu_group as the VFIO device. + This immediately raises a security concern as user-space can use Peer to Peer + DMA to manipulate these migration control registers concurrently with + any kernel actions. + + A device driver operating such a device must ensure that kernel integrity + cannot be broken by hostile user space operating the migration MMIO + registers via peer to peer, at any point in the sequence. Notably the kernel + cannot use DMA to transfer any migration data. + + However, as discussed above in the "Device Peer to Peer DMA" section, it can + assume quiet MMIO as a condition to have a successful and uncorrupted + migration. + +To elaborate details on the reference flows, they assume the following details +about the external behaviors: + + !VCPU_RUNNING + User-space must not generate dirty pages or issue MMIO, PIO or equivalent + operations to devices. For a VMM this would typically be controlled by + KVM. + + DIRTY_TRACKING + Clear the DMA log and start DMA logging + + DMA logs should be readable with an "atomic test and clear" to allow + continuous non-disruptive sampling of the log. + + This is controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the container + fd. + + !DIRTY_TRACKING + Freeze the DMA log, stop tracking and allow user-space to read it. + + If user-space is going to have any use of the dirty log it must ensure that + all DMA is suspended before clearing DIRTY_TRACKING, for instance by using + NDMA or !RUNNING on all VFIO devices. + + +TDB - discoverable feature flag for NDMA +TDB IMS xlation +TBD PASID xlation VFIO bus driver API ------------------------------------------------------------------------------- base-commit: ae0351a976d1880cf152de2bc680f1dff14d9049 -- 2.34.0 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-11-29 14:45 [PATCH RFC v2] vfio: Documentation for the migration region Jason Gunthorpe @ 2021-11-30 17:26 ` Alex Williamson 2021-11-30 18:59 ` Jason Gunthorpe 0 siblings, 1 reply; 29+ messages in thread From: Alex Williamson @ 2021-11-30 17:26 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Mon, 29 Nov 2021 10:45:52 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > Provide some more complete documentation for the migration regions > behavior, specifically focusing on the device_state bits and the whole > system view from a VMM. > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > --- > Documentation/driver-api/vfio.rst | 277 +++++++++++++++++++++++++++++- > 1 file changed, 276 insertions(+), 1 deletion(-) > > Alex/Cornelia, here is the second draft of the requested documentation I promised > > We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism. > > Our thinking is that NDMA would be implemented like this: > > +#define VFIO_DEVICE_STATE_NDMA (1 << 3) > > And a .add_capability ops will be used to signal to userspace driver support: > > +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA 6 So based on this and the discussion later in the doc, NDMA is an optional device feature, is this specifically to support HNS? IIRC, this is a simple queue based device, but is it the fact that the queue lives in non-device memory that makes it such that the driver cannot mediate queue entries and simply add them to the to migration stream? Some discussion of this requirement would be useful in the doc, otherwise it seems easier to deprecate the v1 migration region sub-type, and increment to a v2 where NDMA is a required feature. > I've described DIRTY TRACKING as a seperate concept here. With the current > uAPI this would be controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START, with our > change in direction this would be per-tracker control, but no semantic change. > > Upon some agreement we'll include this patch in the next iteration of the mlx5 > driver along with the NDMA bits. > > v2: > - RST fixups for sphinx rendering > - Inclue the priority order for multi-bit-changes > - Add a small discussion on devices like hns with migration control inside > the same function as is being migrated. > - Language cleanups from v1, the diff says almost every line was touched in some way > v1: https://lore.kernel.org/r/0-v1-0ec87874bede+123-vfio_mig_doc_jgg@nvidia.com > > diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst > index c663b6f978255b..d9be47570f878c 100644 > --- a/Documentation/driver-api/vfio.rst > +++ b/Documentation/driver-api/vfio.rst > @@ -242,7 +242,282 @@ group and can access them as follows:: > VFIO User API > ------------------------------------------------------------------------------- > > -Please see include/linux/vfio.h for complete API documentation. > +Please see include/uapi/linux/vfio.h for complete API documentation. > + > +------------------------------------------------------------------------------- > + > +VFIO migration driver API > +------------------------------------------------------------------------------- > + > +VFIO drivers that support migration implement a migration control register > +called device_state in the struct vfio_device_migration_info which is in its > +VFIO_REGION_TYPE_MIGRATION region. > + > +The device_state controls both device action and continuous behavior. > +Setting/clearing bit groups triggers device action, and each bit controls a > +continuous device behavior. > + > +Along with the device_state the migration driver provides a data window which > +allows streaming migration data into or out of the device. > + > +A lot of flexibility is provided to user-space in how it operates these > +bits. What follows is a reference flow for saving device state in a live > +migration, with all features, and an illustration how other external non-VFIO > +entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in. > + > + RUNNING, VCPU_RUNNING > + Normal operating state > + RUNNING, DIRTY_TRACKING, VCPU_RUNNING > + Log DMAs > + > + Stream all memory > + SAVING | RUNNING, DIRTY_TRACKING, VCPU_RUNNING > + Log internal device changes (pre-copy) > + > + Stream device state through the migration window > + > + While in this state repeat as desired: > + > + Atomic Read and Clear DMA Dirty log > + > + Stream dirty memory > + SAVING | NDMA | RUNNING, VCPU_RUNNING > + vIOMMU grace state > + > + Complete all in progress IO page faults, idle the vIOMMU > + SAVING | NDMA | RUNNING > + Peer to Peer DMA grace state > + > + Final snapshot of DMA dirty log (atomic not required) > + SAVING > + Stream final device state through the migration window > + > + Copy final dirty data > + 0 > + Device is halted > + > +and the reference flow for resuming: > + > + RUNNING > + Issue VFIO_DEVICE_RESET to clear the internal device state > + 0 > + Device is halted > + RESUMING > + Push in migration data. Data captured during pre-copy should be > + prepended to data captured during SAVING. > + NDMA | RUNNING > + Peer to Peer DMA grace state > + RUNNING, VCPU_RUNNING > + Normal operating state > + There are so many implicit concepts here, I'm not sure I'm making a correct interpretation, let alone someone coming to this document for understanding. > +If the VMM has multiple VFIO devices undergoing migration then the grace > +states act as cross device synchronization points. The VMM must bring all > +devices to the grace state before advancing past it. > + > +The above reference flows are built around specific requirements on the > +migration driver for its implementation of the migration_state input. I can't glean any meaning from this sentence. "device_state" here and throughout? We don't have a "migration_state". > +The migration_state cannot change asynchronously, upon writing the > +migration_state the driver will either keep the current state and return > +failure, return failure and go to ERROR, or succeed and go to the new state. > + > +Event triggered actions happen when user-space requests a new migration_state > +that differs from the current migration_state. Actions happen on a bit group > +basis: > + > + SAVING | RUNNING > + The device clears the data window and begins streaming 'pre copy' migration > + data through the window. Devices that cannot log internal state changes > + return a 0 length migration stream. > + > + SAVING | !RUNNING > + The device captures its internal state that is not covered by internal > + logging, as well as any logged changes. > + > + The device clears the data window and begins streaming the captured > + migration data through the window. Devices that cannot log internal state > + changes stream all their device state here. > + > + RESUMING > + The data window is cleared, opened, and can receive the migration data > + stream. > + > + !RESUMING > + All the data transferred into the data window is loaded into the device's > + internal state. The migration driver can rely on user-space issuing a > + VFIO_DEVICE_RESET prior to starting RESUMING. We can't really rely on userspace to do anything, nor has this sequence been part of the specified protocol. As with the previous flows, it seems like there's a ton of implicit knowledge here. Why are we documenting these here rather than in the uAPI header? I'm having a difficult time trying to understand what are proposals to modify the uAPI and what are interpretations of the existing protocol. > + > + To abort a RESUMING issue a VFIO_DEVICE_RESET. Again, this is not explicitly part of the current protocol. Any use of VFIO_DEVICE_RESET should return the device to the default state, but a user is free to try to use device_state to transition from RESUMING to any other state. The driver can choose to fail that transition and even make use of the error device_state, but there's no expectation that a VFIO_DEVICE_RESET is the first choice for userspace to abort a resume session. > + > + If the migration data is invalid then the ERROR state must be set. This is the only place the uAPI defines intervention via the RESET ioctl. > + > +Continuous actions are in effect when migration_state bit groups are active: > + > + RUNNING | NDMA > + The device is not allowed to issue new DMA operations. > + > + Whenever the kernel returns with a migration_state of NDMA there can be no > + in progress DMAs. > + There are certainly event triggered actions based on setting NDMA as well, ex. completion of outstanding DMA. > + !RUNNING > + The device should not change its internal state. Further implies the NDMA > + behavior above. Does this also imply other device regions cannot be accessed as has previously been suggested? Which? > + > + SAVING | !RUNNING > + RESUMING | !RUNNING > + The device may assume there are no incoming MMIO operations. > + > + Internal state logging can stop. > + > + RUNNING > + The device can alter its internal state and must respond to incoming MMIO. > + > + SAVING | RUNNING > + The device is logging changes to the internal state. > + > + ERROR > + The behavior of the device is largely undefined. The device must be > + recovered by issuing VFIO_DEVICE_RESET or closing the device file > + descriptor. > + > + However, devices supporting NDMA must behave as though NDMA is asserted > + during ERROR to avoid corrupting other devices or a VM during a failed > + migration. > + > +When multiple bits change in the migration_state they may describe multiple > +event triggered actions, and multiple changes to continuous actions. The > +migration driver must process them in a priority order: > + > + - SAVING | RUNNING > + - NDMA > + - !RUNNING > + - SAVING | !RUNNING > + - RESUMING > + - !RESUMING > + - RUNNING > + - !NDMA Lots of deduction left to the reader... > + > +In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the > +device back to device_state RUNNING. When a migration driver executes this > +ioctl it should discard the data window and set migration_state to RUNNING as > +part of resetting the device to a clean state. This must happen even if the > +migration_state has errored. A freshly opened device FD should always be in > +the RUNNING state. > + > +The migration driver has limitations on what device state it can affect. Any > +device state controlled by general kernel subsystems must not be changed > +during RESUME, and SAVING must tolerate mutation of this state. Change to > +externally controlled device state can happen at any time, asynchronously, to > +the migration (ie interrupt rebalancing). > + > +Some examples of externally controlled state: > + - MSI-X interrupt page More specifically, vector table and pba, but probably also config capability. > + - MSI/legacy interrupt configuration > + - Large parts of the PCI configuration space, ie common control bits > + - PCI power management > + - Changes via VFIO_DEVICE_SET_IRQS > + > +During !RUNNING, especially during SAVING and RESUMING, the device may have > +limitations on what it can tolerate. An ideal device will discard/return all > +ones to all incoming MMIO, PIO, or equivalent operations (exclusive of the > +external state above) in !RUNNING. However, devices are free to have undefined > +behavior if they receive incoming operations. This includes > +corrupting/aborting the migration, dirtying pages, and segfaulting user-space. > + > +However, a device may not compromise system integrity if it is subjected to a > +MMIO. It cannot trigger an error TLP, it cannot trigger an x86 Machine Check > +or similar, and it cannot compromise device isolation. > + > +There are several edge cases that user-space should keep in mind when > +implementing migration: > + > +- Device Peer to Peer DMA. In this case devices are able issue DMAs to each > + other's MMIO regions. The VMM can permit this if it maps the MMIO memory into > + the IOMMU. > + > + As Peer to Peer DMA is a MMIO touch like any other, it is important that > + userspace suspend these accesses before entering any device_state where MMIO > + is not permitted, such as !RUNNING. This can be accomplished with the NDMA > + state. Userspace may also choose to remove MMIO mappings from the IOMMU if the > + device does not support NDMA and rely on that to guarantee quiet MMIO. Seems that would have its own set of consequences. > + The Peer to Peer Grace States exist so that all devices may reach RUNNING > + before any device is subjected to a MMIO access. > + > + Failure to guarantee quiet MMIO may allow a hostile VM to use P2P to violate > + the no-MMIO restriction during SAVING or RESUMING and corrupt the migration > + on devices that cannot protect themselves. > + > +- IOMMU Page faults handled in user-space can occur at any time. A migration > + driver is not required to serialize in-progress page faults. It can assume > + that all page faults are completed before entering SAVING | !RUNNING. Since > + the guest VCPU is required to complete page faults the VMM can accomplish > + this by asserting NDMA | VCPU_RUNNING and clearing all pending page faults > + before clearing VCPU_RUNNING. > + > + Device that do not support NDMA cannot be configured to generate page faults > + that require the VCPU to complete. So the VMM is required to hide features like PRI based on NDMA support? > + > +- pre-copy allows the device to implement a dirty log for its internal state. > + During the SAVING | RUNNING state the data window should present the device > + state being logged and during SAVING | !RUNNING the data window should present > + the unlogged device state as well as the changes from the internal dirty log. > + This is getting a bit close to specifying an implementation. The goal of the pre-copy is to minimize downtime during the stop-and-copy, therefore the specific data exposed in each phase depends on the usage model and data size of the device. > + On RESUME these two data streams are concatenated together. > + > + pre-copy is only concerned with internal device state. External DMAs are > + covered by the separate DIRTY_TRACKING function. > + > +- Atomic Read and Clear of the DMA log is a HW feature. If the tracker > + cannot support this, then NDMA could be used to synthesize it less > + efficiently. > + > +- NDMA is optional. If the device does not support this then the NDMA States > + are pushed down to the next step in the sequence and various behaviors that > + rely on NDMA cannot be used. > + > +- Migration control registers inside the same iommu_group as the VFIO device. Not a complete sentence, is this meant as a topic header? > + This immediately raises a security concern as user-space can use Peer to Peer > + DMA to manipulate these migration control registers concurrently with > + any kernel actions. > + We haven't defined "migration control registers" beyond device_state, which is software defined "register". What physical registers that are subject to p2p DMA is this actually referring to? > + A device driver operating such a device must ensure that kernel integrity > + cannot be broken by hostile user space operating the migration MMIO > + registers via peer to peer, at any point in the sequence. Notably the kernel > + cannot use DMA to transfer any migration data. > + > + However, as discussed above in the "Device Peer to Peer DMA" section, it can > + assume quiet MMIO as a condition to have a successful and uncorrupted > + migration. > + > +To elaborate details on the reference flows, they assume the following details > +about the external behaviors: > + > + !VCPU_RUNNING > + User-space must not generate dirty pages or issue MMIO, PIO or equivalent > + operations to devices. For a VMM this would typically be controlled by > + KVM. > + > + DIRTY_TRACKING > + Clear the DMA log and start DMA logging > + > + DMA logs should be readable with an "atomic test and clear" to allow > + continuous non-disruptive sampling of the log. > + > + This is controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the container > + fd. > + > + !DIRTY_TRACKING > + Freeze the DMA log, stop tracking and allow user-space to read it. > + > + If user-space is going to have any use of the dirty log it must ensure that > + all DMA is suspended before clearing DIRTY_TRACKING, for instance by using > + NDMA or !RUNNING on all VFIO devices. > + > + > +TDB - discoverable feature flag for NDMA Is the goal to release mlx5 support without the NDMA feature and add it later? Thanks, Alex > +TDB IMS xlation > +TBD PASID xlation > > VFIO bus driver API > ------------------------------------------------------------------------------- > > base-commit: ae0351a976d1880cf152de2bc680f1dff14d9049 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-11-30 17:26 ` Alex Williamson @ 2021-11-30 18:59 ` Jason Gunthorpe 2021-11-30 22:35 ` Alex Williamson 0 siblings, 1 reply; 29+ messages in thread From: Jason Gunthorpe @ 2021-11-30 18:59 UTC (permalink / raw) To: Alex Williamson Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Tue, Nov 30, 2021 at 10:26:11AM -0700, Alex Williamson wrote: > On Mon, 29 Nov 2021 10:45:52 -0400 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > Provide some more complete documentation for the migration regions > > behavior, specifically focusing on the device_state bits and the whole > > system view from a VMM. > > > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > > Documentation/driver-api/vfio.rst | 277 +++++++++++++++++++++++++++++- > > 1 file changed, 276 insertions(+), 1 deletion(-) > > > > Alex/Cornelia, here is the second draft of the requested documentation I promised > > > > We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism. > > > > Our thinking is that NDMA would be implemented like this: > > > > +#define VFIO_DEVICE_STATE_NDMA (1 << 3) > > > > And a .add_capability ops will be used to signal to userspace driver support: > > > > +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA 6 > > So based on this and the discussion later in the doc, NDMA is an > optional device feature, is this specifically to support HNS? Yes. It is not trivial to implement NDMA in a device, we already have HNS as a public existing example that cannot do it, so it is a permanent optional feature. > IIRC, this is a simple queue based device, but is it the fact that > the queue lives in non-device memory that makes it such that the > driver cannot mediate queue entries and simply add them to the to > migration stream? From what HNS said the device driver would have to trap every MMIO to implement NDMA as it must prevent touches to the physical HW MMIO to maintain the NDMA state. The issue is that the HW migration registers can stop processing the queue and thus enter NDMA but a MMIO touch can resume queue processing, so NDMA cannot be sustained. Trapping every MMIO would have a huge negative performance impact. So it doesn't make sense to do so for a device that is not intended to be used in any situation where NDMA is required. > Some discussion of this requirement would be useful in the doc, > otherwise it seems easier to deprecate the v1 migration region > sub-type, and increment to a v2 where NDMA is a required feature. I can add some words a the bottom, but since NDMA is a completely transparent optional feature I don't see any reason to have v2. > There are so many implicit concepts here, I'm not sure I'm making a > correct interpretation, let alone someone coming to this document for > understanding. That isn't really helpful feedback.. The above is something like a summary to give context to the below which provides a lot of detail to each step. If you read the above/below together and find stuff is lacking, then please point to it and let's add it > > +If the VMM has multiple VFIO devices undergoing migration then the grace > > +states act as cross device synchronization points. The VMM must bring all > > +devices to the grace state before advancing past it. > > + > > +The above reference flows are built around specific requirements on the > > +migration driver for its implementation of the migration_state input. > > I can't glean any meaning from this sentence. "device_state" here and > throughout? We don't have a "migration_state". Yes, migration_state is a spello for device_state, I'll fix them all > > + !RESUMING > > + All the data transferred into the data window is loaded into the device's > > + internal state. The migration driver can rely on user-space issuing a > > + VFIO_DEVICE_RESET prior to starting RESUMING. > > We can't really rely on userspace to do anything, nor has this sequence > been part of the specified protocol. It is exsisting behavior of qemu - which is why we documented it. Either qemu shouldn't do it as devices must fully self-reset, or we should have it part of the canonical flow and devices may as well expect it. It is useful because post VFIO_DEVICE_RESET all DMA is quiet, no outstanding PRIs exist, etc etc. > As with the previous flows, it seems like there's a ton of implicit > knowledge here. Why are we documenting these here rather than in the > uAPI header? Because this is 300 lines already and is too complicated/long to properly live in a uapi header. > I'm having a difficult time trying to understand what are > proposals to modify the uAPI and what are interpretations of the > existing protocol. As far as we know this describes what current qemu does in the success path with a single VFIO device. ie we think mlx5 conforms to this spec and we see it works as-is with qemu, up to qemu's limitations: - qemu has no support for errors or error recovery, it just calls abort() - qemu does not stress the device_state and only does a few transition patterns - qemu doesn't support P2P cases due to the NDMA topic - simple devices like HNS will work, but not robustly in the face of a hostile VM and multiple VFIO devices. > > + To abort a RESUMING issue a VFIO_DEVICE_RESET. > > Any use of VFIO_DEVICE_RESET should return the device to the default > state, but a user is free to try to use device_state to transition > from RESUMING to any other state. Userspace can attempt all transitions, of course. However, notice this spec doesn't specify what happens on non-success RESUMING->RUNNING. So, I'm calling that undefined behavior. As you say: > The driver can choose to fail that transition and even make use of > the error device_state, but there's no expectation that a Thus, the above advice. To reliably abort RESUMING use DEVICE_RESET, do not use ->RUNNING. > > +Continuous actions are in effect when migration_state bit groups are active: > > + > > + RUNNING | NDMA > > + The device is not allowed to issue new DMA operations. > > + > > + Whenever the kernel returns with a migration_state of NDMA there can be no > > + in progress DMAs. > > + > > There are certainly event triggered actions based on setting NDMA as > well, ex. completion of outstanding DMA. I'm leaving it implied that there is always some work required to begin/end these continuous behaviors > > + !RUNNING > > + The device should not change its internal state. Further implies the NDMA > > + behavior above. > > Does this also imply other device regions cannot be accessed as has > previously been suggested? Which? This question is discussed/answered below > > + - SAVING | RUNNING > > + - NDMA > > + - !RUNNING > > + - SAVING | !RUNNING > > + - RESUMING > > + - !RESUMING > > + - RUNNING > > + - !NDMA > > Lots of deduction left to the reader... Do you have an alternative language? This is quite complicated, I advise people to refer to mlx5's implementation. > > + As Peer to Peer DMA is a MMIO touch like any other, it is important that > > + userspace suspend these accesses before entering any device_state where MMIO > > + is not permitted, such as !RUNNING. This can be accomplished with the NDMA > > + state. Userspace may also choose to remove MMIO mappings from the IOMMU if the > > + device does not support NDMA and rely on that to guarantee quiet MMIO. > > Seems that would have its own set of consequences. Sure, userspace has to make choices here. > > + Device that do not support NDMA cannot be configured to generate page faults > > + that require the VCPU to complete. > > So the VMM is required to hide features like PRI based on NDMA support? Yep, looks like. > > +- pre-copy allows the device to implement a dirty log for its internal state. > > + During the SAVING | RUNNING state the data window should present the device > > + state being logged and during SAVING | !RUNNING the data window should present > > + the unlogged device state as well as the changes from the internal dirty log. > > This is getting a bit close to specifying an implementation. Bit close, but not too close :) I think it is clarifying to describe the general working of pre-copy - at least people here raised questions about what it is doing. Overall it must work in this basic way, and devices have freedom about what internal state they can/will log. There is just a clear division that every internal state in the first step is either immutable or logged, and that the second step is a delta over the first. > > +- Migration control registers inside the same iommu_group as the VFIO device. > > Not a complete sentence, is this meant as a topic header? Sure > > + This immediately raises a security concern as user-space can use Peer to Peer > > + DMA to manipulate these migration control registers concurrently with > > + any kernel actions. > > + > > We haven't defined "migration control registers" beyond device_state, > which is software defined "register". What physical registers that are > subject to p2p DMA is this actually referring to? Here this is talking about the device's physical HW control registers. > > +TDB - discoverable feature flag for NDMA > > Is the goal to release mlx5 support without the NDMA feature and add it > later? Yishai has a patch already to add NDMA to mlx5, it will come in the next iteration once we can agree on this document. qemu will follow sometime later. Below are the changes I made Thanks, Jason diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst index d9be47570f878c..69d77fb7e5a321 100644 --- a/Documentation/driver-api/vfio.rst +++ b/Documentation/driver-api/vfio.rst @@ -315,14 +315,14 @@ states act as cross device synchronization points. The VMM must bring all devices to the grace state before advancing past it. The above reference flows are built around specific requirements on the -migration driver for its implementation of the migration_state input. +migration driver for its implementation of the device_state input. -The migration_state cannot change asynchronously, upon writing the -migration_state the driver will either keep the current state and return +The device_state cannot change asynchronously, upon writing the +device_state the driver will either keep the current state and return failure, return failure and go to ERROR, or succeed and go to the new state. -Event triggered actions happen when user-space requests a new migration_state -that differs from the current migration_state. Actions happen on a bit group +Event triggered actions happen when user-space requests a new device_state +that differs from the current device_state. Actions happen on a bit group basis: SAVING | RUNNING @@ -351,12 +351,12 @@ basis: If the migration data is invalid then the ERROR state must be set. -Continuous actions are in effect when migration_state bit groups are active: +Continuous actions are in effect when device_state bit groups are active: RUNNING | NDMA The device is not allowed to issue new DMA operations. - Whenever the kernel returns with a migration_state of NDMA there can be no + Whenever the kernel returns with a device_state of NDMA there can be no in progress DMAs. !RUNNING @@ -384,7 +384,7 @@ Continuous actions are in effect when migration_state bit groups are active: during ERROR to avoid corrupting other devices or a VM during a failed migration. -When multiple bits change in the migration_state they may describe multiple +When multiple bits change in the device_state they may describe multiple event triggered actions, and multiple changes to continuous actions. The migration driver must process them in a priority order: @@ -399,9 +399,9 @@ migration driver must process them in a priority order: In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the device back to device_state RUNNING. When a migration driver executes this -ioctl it should discard the data window and set migration_state to RUNNING as +ioctl it should discard the data window and set device_state to RUNNING as part of resetting the device to a clean state. This must happen even if the -migration_state has errored. A freshly opened device FD should always be in +device_state has errored. A freshly opened device FD should always be in the RUNNING state. The migration driver has limitations on what device state it can affect. Any @@ -476,14 +476,23 @@ implementing migration: are pushed down to the next step in the sequence and various behaviors that rely on NDMA cannot be used. -- Migration control registers inside the same iommu_group as the VFIO device. - This immediately raises a security concern as user-space can use Peer to Peer - DMA to manipulate these migration control registers concurrently with + NDMA is made optional to support simple HW implementations that either just + cannot do NDMA, or cannot do NDMA without a performance cost. NDMA is only + necessary for special features like P2P and PRI, so devices can omit it in + exchange for limitations on the guest. + +- Devices that have their HW migration control MMIO registers inside the same + iommu_group as the VFIO device have some special considerations. In this + case a driver will be operating HW registers from kernel space that are also + subjected to userspace controlled DMA due to the iommu_group. + + This immediately raises a security concern as user-space can use Peer to + Peer DMA to manipulate these migration control registers concurrently with any kernel actions. A device driver operating such a device must ensure that kernel integrity cannot be broken by hostile user space operating the migration MMIO - registers via peer to peer, at any point in the sequence. Notably the kernel + registers via peer to peer, at any point in the sequence. Further the kernel cannot use DMA to transfer any migration data. However, as discussed above in the "Device Peer to Peer DMA" section, it can ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-11-30 18:59 ` Jason Gunthorpe @ 2021-11-30 22:35 ` Alex Williamson 2021-12-01 3:14 ` Jason Gunthorpe 0 siblings, 1 reply; 29+ messages in thread From: Alex Williamson @ 2021-11-30 22:35 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Tue, 30 Nov 2021 14:59:10 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Nov 30, 2021 at 10:26:11AM -0700, Alex Williamson wrote: > > On Mon, 29 Nov 2021 10:45:52 -0400 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > Provide some more complete documentation for the migration regions > > > behavior, specifically focusing on the device_state bits and the whole > > > system view from a VMM. > > > > > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > > > Documentation/driver-api/vfio.rst | 277 +++++++++++++++++++++++++++++- > > > 1 file changed, 276 insertions(+), 1 deletion(-) > > > > > > Alex/Cornelia, here is the second draft of the requested documentation I promised > > > > > > We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism. > > > > > > Our thinking is that NDMA would be implemented like this: > > > > > > +#define VFIO_DEVICE_STATE_NDMA (1 << 3) > > > > > > And a .add_capability ops will be used to signal to userspace driver support: > > > > > > +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA 6 > > > > So based on this and the discussion later in the doc, NDMA is an > > optional device feature, is this specifically to support HNS? > > Yes. It is not trivial to implement NDMA in a device, we already have > HNS as a public existing example that cannot do it, so it is a > permanent optional feature. > > > IIRC, this is a simple queue based device, but is it the fact that > > the queue lives in non-device memory that makes it such that the > > driver cannot mediate queue entries and simply add them to the to > > migration stream? > > From what HNS said the device driver would have to trap every MMIO to > implement NDMA as it must prevent touches to the physical HW MMIO to > maintain the NDMA state. > > The issue is that the HW migration registers can stop processing the > queue and thus enter NDMA but a MMIO touch can resume queue > processing, so NDMA cannot be sustained. > > Trapping every MMIO would have a huge negative performance impact. So > it doesn't make sense to do so for a device that is not intended to be > used in any situation where NDMA is required. But migration is a cooperative activity with userspace. If necessary we can impose a requirement that mmap access to regions (other than the migration region itself) are dropped when we're in the NDMA or !RUNNING device_state. We can define additional sparse mmap capabilities for regions for various device states if we need more fine grained control. There's no reason that mediation while in the NDMA state needs to impose any performance penalty against the default RUNNING state. In fact, it seems rather the migration driver's job to provide such mediation. > > Some discussion of this requirement would be useful in the doc, > > otherwise it seems easier to deprecate the v1 migration region > > sub-type, and increment to a v2 where NDMA is a required feature. > > I can add some words a the bottom, but since NDMA is a completely > transparent optional feature I don't see any reason to have v2. It's hardly transparent, aiui userspace is going to need to impose a variety of loosely defined restrictions for devices without NDMA support. It would be far easier if we could declare NDMA support to be a requirement. > > There are so many implicit concepts here, I'm not sure I'm making a > > correct interpretation, let alone someone coming to this document for > > understanding. > > That isn't really helpful feedback.. > > The above is something like a summary to give context to the below > which provides a lot of detail to each step. > > If you read the above/below together and find stuff is lacking, then > please point to it and let's add it As I think Connie also had trouble with, combining device_state with IOMMU migration features and VMM state, without any preceding context and visual cues makes the section confusing. I did gain context as I read further though the doc, but I also had the advantage of being rather familiar with the topic. Maybe a table format would help to segment the responsibilities? > > > +If the VMM has multiple VFIO devices undergoing migration then the grace > > > +states act as cross device synchronization points. The VMM must bring all > > > +devices to the grace state before advancing past it. > > > + > > > +The above reference flows are built around specific requirements on the > > > +migration driver for its implementation of the migration_state input. > > > > I can't glean any meaning from this sentence. "device_state" here and > > throughout? We don't have a "migration_state". > > Yes, migration_state is a spello for device_state, I'll fix them all > > > > + !RESUMING > > > + All the data transferred into the data window is loaded into the device's > > > + internal state. The migration driver can rely on user-space issuing a > > > + VFIO_DEVICE_RESET prior to starting RESUMING. > > > > We can't really rely on userspace to do anything, nor has this sequence > > been part of the specified protocol. > > It is exsisting behavior of qemu - which is why we documented it. QEMU resets devices as part of initializing the VM, but I don't see that QEMU specifically resets a device in order to transition it to the RESUMING device_state. > Either qemu shouldn't do it as devices must fully self-reset, or we > should have it part of the canonical flow and devices may as well > expect it. It is useful because post VFIO_DEVICE_RESET all DMA is > quiet, no outstanding PRIs exist, etc etc. It's valid for QEMU to reset the device any time it wants, saying that it cannot perform a reset before transitioning to the RESUMING state is absurd. Userspace can do redundant things for their own convenience. We don't currently specify any precondition for a device to enter the RESUMING state. The driver can of course nak the state change with an errno, or hard nak it with an errno and ERROR device_state, which would require userspace to make use of VFIO_DEVICE_RESET. I think I would also consider if valid if the driver internally performed a reset on a RUNNING -> RESUMING state change, but we've never previously expressed any linkage of a userspace requirement to reset the device here. > > As with the previous flows, it seems like there's a ton of implicit > > knowledge here. Why are we documenting these here rather than in the > > uAPI header? > > Because this is 300 lines already and is too complicated/long to > properly live in a uapi header. Minimally we need to resolve that this document must be consistent with the uAPI. I'm not sure that's entirely the case in this draft. > > I'm having a difficult time trying to understand what are > > proposals to modify the uAPI and what are interpretations of the > > existing protocol. > > As far as we know this describes what current qemu does in the success > path with a single VFIO device. ie we think mlx5 conforms to this spec > and we see it works as-is with qemu, up to qemu's limitations: > > - qemu has no support for errors or error recovery, it just calls > abort() > - qemu does not stress the device_state and only does a few > transition patterns > - qemu doesn't support P2P cases due to the NDMA topic Or rather QEMU does support p2p cases regardless of the NDMA topic. > - simple devices like HNS will work, but not robustly in the face of > a hostile VM and multiple VFIO devices. So what's the goal here, are we trying to make the one currently implemented and unsupported userspace be the gold standard to which drivers should base their implementation? We've tried to define a specification that's more flexible than a single implementation and by these standards we seem to be flipping that implementation back into the specification. > > > + To abort a RESUMING issue a VFIO_DEVICE_RESET. > > > > Any use of VFIO_DEVICE_RESET should return the device to the default > > state, but a user is free to try to use device_state to transition > > from RESUMING to any other state. > > Userspace can attempt all transitions, of course. > > However, notice this spec doesn't specify what happens on non-success > RESUMING->RUNNING. So, I'm calling that undefined behavior. > > As you say: > > > The driver can choose to fail that transition and even make use of > > the error device_state, but there's no expectation that a > > Thus, the above advice. To reliably abort RESUMING use DEVICE_RESET, > do not use ->RUNNING. Userspace can attempt RESUMING -> RUNNING regardless of what we specify, so a driver needs to be prepared for such an attempted state change either way. So what's the advantage to telling a driver author that they can expect a given behavior? It doesn't make much sense to me to glue two separate userspace operations together to say these must be done in this sequence, back to back. If we want the device to be reset in order to enter RESUMING, the driver should simply reset the device as necessary during the state transition. The outward effect to the user is to specify that device internal state may not be retained on transition from RUNNING -> RESUMING. > > > +Continuous actions are in effect when migration_state bit groups are active: > > > + > > > + RUNNING | NDMA > > > + The device is not allowed to issue new DMA operations. > > > + > > > + Whenever the kernel returns with a migration_state of NDMA there can be no > > > + in progress DMAs. > > > + > > > > There are certainly event triggered actions based on setting NDMA as > > well, ex. completion of outstanding DMA. > > I'm leaving it implied that there is always some work required to > begin/end these continuous behaviors > > > > + !RUNNING > > > + The device should not change its internal state. Further implies the NDMA > > > + behavior above. > > > > Does this also imply other device regions cannot be accessed as has > > previously been suggested? Which? > > This question is discussed/answered below > > > > + - SAVING | RUNNING > > > + - NDMA > > > + - !RUNNING > > > + - SAVING | !RUNNING > > > + - RESUMING > > > + - !RESUMING > > > + - RUNNING > > > + - !NDMA > > > > Lots of deduction left to the reader... > > Do you have an alternative language? This is quite complicated, I > advise people to refer to mlx5's implementation. I agree with Connie on this, if the reader of the documentation needs to look at a specific driver implementation to understand the reasoning, the documentation has failed. If it can be worked out by looking at the device_state write function of the mlx5 driver, then surely a sentence or two for each priority item can be added here. Part of the problem is that the nomenclature is unclear, we're listing bit combinations, but not the changed bit(s) and we need to infer the state. The mlx5 driver specifically looks for individual state bit flips in the presence of an existing state. I'm not able to obviously map the listing above to the latest posted version of the mlx5 driver. > > > + As Peer to Peer DMA is a MMIO touch like any other, it is important that > > > + userspace suspend these accesses before entering any device_state where MMIO > > > + is not permitted, such as !RUNNING. This can be accomplished with the NDMA > > > + state. Userspace may also choose to remove MMIO mappings from the IOMMU if the > > > + device does not support NDMA and rely on that to guarantee quiet MMIO. > > > > Seems that would have its own set of consequences. > > Sure, userspace has to make choices here. It seems a bit loaded to suggest an alternative choice if it's not practical or equivalent. Maybe it's largely the phrasing, I read "remove MMIO mappings" as to drop them dynamically, when I think we've discussed that userspace might actually preclude these mappings for non-NDMA devices such that p2p DMA cannot exist, ever. > > > + Device that do not support NDMA cannot be configured to generate page faults > > > + that require the VCPU to complete. > > > > So the VMM is required to hide features like PRI based on NDMA support? > > Yep, looks like. > > > > +- pre-copy allows the device to implement a dirty log for its internal state. > > > + During the SAVING | RUNNING state the data window should present the device > > > + state being logged and during SAVING | !RUNNING the data window should present > > > + the unlogged device state as well as the changes from the internal dirty log. > > > > This is getting a bit close to specifying an implementation. > > Bit close, but not too close :) I think it is clarifying to > describe the general working of pre-copy - at least people here raised > questions about what it is doing. > > Overall it must work in this basic way, and devices have freedom about > what internal state they can/will log. There is just a clear division > that every internal state in the first step is either immutable or > logged, and that the second step is a delta over the first. I agree that it's a reasonable approach, though as I read the proposed text, there's no mention of immutable state and no reason a driver would implement a dirty log for immutable state, therefore we seem to be suggesting such data for the stop-and-copy phase when it would actually be preferable to include it in pre-copy. I think the fact that a user is not required to run the pre-copy phase until completion is also noteworthy. > > > +- Migration control registers inside the same iommu_group as the VFIO device. > > > > Not a complete sentence, is this meant as a topic header? > > Sure > > > > + This immediately raises a security concern as user-space can use Peer to Peer > > > + DMA to manipulate these migration control registers concurrently with > > > + any kernel actions. > > > + > > > > We haven't defined "migration control registers" beyond device_state, > > which is software defined "register". What physical registers that are > > subject to p2p DMA is this actually referring to? > > Here this is talking about the device's physical HW control registers. > > > > +TDB - discoverable feature flag for NDMA > > > > Is the goal to release mlx5 support without the NDMA feature and add it > > later? > > Yishai has a patch already to add NDMA to mlx5, it will come in the > next iteration once we can agree on this document. qemu will follow > sometime later. So it's not really a TBD, it's resolved in a uAPI update that will be included with the next revision? Thanks, Alex ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-11-30 22:35 ` Alex Williamson @ 2021-12-01 3:14 ` Jason Gunthorpe 2021-12-01 9:54 ` Shameerali Kolothum Thodi 2021-12-01 20:03 ` Alex Williamson 0 siblings, 2 replies; 29+ messages in thread From: Jason Gunthorpe @ 2021-12-01 3:14 UTC (permalink / raw) To: Alex Williamson Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote: > > From what HNS said the device driver would have to trap every MMIO to > > implement NDMA as it must prevent touches to the physical HW MMIO to > > maintain the NDMA state. > > > > The issue is that the HW migration registers can stop processing the > > queue and thus enter NDMA but a MMIO touch can resume queue > > processing, so NDMA cannot be sustained. > > > > Trapping every MMIO would have a huge negative performance impact. So > > it doesn't make sense to do so for a device that is not intended to be > > used in any situation where NDMA is required. > > But migration is a cooperative activity with userspace. If necessary > we can impose a requirement that mmap access to regions (other than the > migration region itself) are dropped when we're in the NDMA or !RUNNING > device_state. It is always NDMA|RUNNING, so we can't fully drop access to MMIO. Userspace would have to transfer from direct MMIO to trapping. With enough new kernel infrastructure and qemu support it could be done. Even so, we can't trap accesses through the IOMMU so such a scheme would still require removing IOMMU acess to the device. Given that the basic qemu mitigation for no NDMA support is to eliminate P2P cases by removing the IOMMU mappings this doesn't seem to advance anything and only creates complexity. At least I'm not going to insist that hns do all kinds of work like this for a edge case they don't care about as a precondition to get a migration driver. > There's no reason that mediation while in the NDMA state needs to > impose any performance penalty against the default RUNNING state. Eh? Mitigation of no NDMA support would have to mediate the MMIO on a a performance doorbell path, there is no escaping a performance hit. I'm not sure what you mean > > > Some discussion of this requirement would be useful in the doc, > > > otherwise it seems easier to deprecate the v1 migration region > > > sub-type, and increment to a v2 where NDMA is a required feature. > > > > I can add some words a the bottom, but since NDMA is a completely > > transparent optional feature I don't see any reason to have v2. > > It's hardly transparent, aiui userspace is going to need to impose a > variety of loosely defined restrictions for devices without NDMA > support. It would be far easier if we could declare NDMA support to be > a requirement. It would make userspace a bit simpler at the cost of excluding or complicating devices like hns for a use case they don't care about. On the other hand, the simple solution in qemu is when there is no universal NDMA it simply doesn't include any MMIO ranges in the IOMMU. > As I think Connie also had trouble with, combining device_state with > IOMMU migration features and VMM state, without any preceding context > and visual cues makes the section confusing. I did gain context as I > read further though the doc, but I also had the advantage of being > rather familiar with the topic. Maybe a table format would help to > segment the responsibilities? I moved the context to the bottom exactly because Connie said it was confusing at the start. :) This is a RST document so I not keen to make huge formatting adventures for minimal readability gain. I view this as something that probably needs to be read a few times, along with the code and header files for someone brand new to understand. I'm Ok with that, it is about consistent with kernel docs of this level. What I would like is if userspace focused readers can get their important bits of information with less work. > > It is exsisting behavior of qemu - which is why we documented it. > > QEMU resets devices as part of initializing the VM, but I don't see > that QEMU specifically resets a device in order to transition it to > the RESUMING device_state. We instrumented the kernel and monitored qemu, it showed up on the resume traces. > > Either qemu shouldn't do it as devices must fully self-reset, or we > > should have it part of the canonical flow and devices may as well > > expect it. It is useful because post VFIO_DEVICE_RESET all DMA is > > quiet, no outstanding PRIs exist, etc etc. > > It's valid for QEMU to reset the device any time it wants, saying that > it cannot perform a reset before transitioning to the RESUMING state is > absurd. Userspace can do redundant things for their own convenience. I didn't say cannot, I said it shouldn't. Since qemu is the only implementation it would be easy for drivers to rely on the implicit reset it seems to do, it seems an important point that should be written either way. I don't have a particular requirement to have the reset, but it does seem like a good idea. If you feel strongly, then let's say the opposite that the driver must enter RESUME with no preconditions, doing an internal reset if required. > We don't currently specify any precondition for a device to enter the > RESUMING state. The driver can of course nak the state change with an > errno, or hard nak it with an errno and ERROR device_state, which would > require userspace to make use of VFIO_DEVICE_RESET. I don't think we should be relying on every driver doing something totally differnt on the standard path. That is only going to hurt interoperability. > > > As with the previous flows, it seems like there's a ton of implicit > > > knowledge here. Why are we documenting these here rather than in the > > > uAPI header? > > > > Because this is 300 lines already and is too complicated/long to > > properly live in a uapi header. > > Minimally we need to resolve that this document must be consistent with > the uAPI. I'm not sure that's entirely the case in this draft. Can you point to something please? I can't work with "I'm not sure" IMO the header file doesn't really say much and can be read in a way that is consistent with this more specific document. > > - qemu doesn't support P2P cases due to the NDMA topic > > Or rather QEMU does support p2p cases regardless of the NDMA topic. I mean support in a way that is actually usable as without NDMA it corrupts the VM when it migrates it. > > - simple devices like HNS will work, but not robustly in the face of > > a hostile VM and multiple VFIO devices. > > So what's the goal here, are we trying to make the one currently > implemented and unsupported userspace be the gold standard to which > drivers should base their implementation? I have no idea anymore. You asked for docs and complete picture as a percondition for merging a driver. Here it is. What do you want? > We've tried to define a specification that's more flexible than a > single implementation and by these standards we seem to be flipping > that implementation back into the specification. What specification!?! All we have is a couple lines in a header file that is no where near detailed enough for multi-driver interoperability with userspace. You have no idea how much effort has been expended to get this far based on the few breadcrumbs that were left, and we have access to the team that made the only other implementation! *flexible* is not a specification. > Userspace can attempt RESUMING -> RUNNING regardless of what we specify, > so a driver needs to be prepared for such an attempted state change > either way. So what's the advantage to telling a driver author that > they can expect a given behavior? The above didn't tell a driver author to expect a certain behavior, it tells userspace what to do. > It doesn't make much sense to me to glue two separate userspace > operations together to say these must be done in this sequence, back to > back. If we want the device to be reset in order to enter RESUMING, the > driver should simply reset the device as necessary during the state > transition. The outward effect to the user is to specify that device > internal state may not be retained on transition from RUNNING -> > RESUMING. Maybe, and I'm happy if you want to specify this instead. It just doesn't match what we observe qemu to be doing. > > Do you have an alternative language? This is quite complicated, I > > advise people to refer to mlx5's implementation. > > I agree with Connie on this, if the reader of the documentation needs > to look at a specific driver implementation to understand the > reasoning, the documentation has failed. Lets agree on some objective here, this is not trying to be fully comprehensive, or fully standalone. It is intended to drive agreement, be informative to userspace, and be supplemental to the actual code. > If it can be worked out by looking at the device_state write > function of the mlx5 driver, then surely a sentence or two for each > priority item can be added here. Please give me a suggestion then, because I don't know what will help here? > Part of the problem is that the nomenclature is unclear, we're listing > bit combinations, but not the changed bit(s) and we need to infer the > state. Each line lists the new state, the changed bits are thus any bits that make up the new state. If you look at how mlx5 is constructed each if has a 'did it change' test followed by 'what state is it in now' So the document is read as listing the order the driver enters the new states. I clarified it as ""must process the new device_state bits in a priority order"" > flips in the presence of an existing state. I'm not able to obviously > map the listing above to the latest posted version of the mlx5 driver. One of the things we've done is align mlx5 more clearly to this. For instance it no longer has a mixture of state and old state in the if statements, it always tests the new state so the tests logically follow what is written here Stripping away the excess the expressions now look like this: !(state & VFIO_DEVICE_STATE_RUNNING) ((state & (VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING)) == VFIO_DEVICE_STATE_SAVING)) (state & VFIO_DEVICE_STATE_RESUMING) Which mirror what is written here. > > > > + As Peer to Peer DMA is a MMIO touch like any other, it is important that > > > > + userspace suspend these accesses before entering any device_state where MMIO > > > > + is not permitted, such as !RUNNING. This can be accomplished with the NDMA > > > > + state. Userspace may also choose to remove MMIO mappings from the IOMMU if the > > > > + device does not support NDMA and rely on that to guarantee quiet MMIO. > > > > > > Seems that would have its own set of consequences. > > > > Sure, userspace has to make choices here. > > It seems a bit loaded to suggest an alternative choice if it's not > practical or equivalent. Maybe it's largely the phrasing, I read > "remove MMIO mappings" as to drop them dynamically, when I think we've > discussed that userspace might actually preclude these mappings for > non-NDMA devices such that p2p DMA cannot exist, ever. I mean the latter. How about "never install MMIO mappings" ? > > Overall it must work in this basic way, and devices have freedom about > > what internal state they can/will log. There is just a clear division > > that every internal state in the first step is either immutable or > > logged, and that the second step is a delta over the first. > > I agree that it's a reasonable approach, though as I read the proposed > text, there's no mention of immutable state and no reason a driver > would implement a dirty log for immutable state, therefore we seem to > be suggesting such data for the stop-and-copy phase when it would > actually be preferable to include it in pre-copy. I'd say that is a detail we don't need to discuss/define, it has no user space visible consequence. > I think the fact that a user is not required to run the pre-copy > phase until completion is also noteworthy. This text doesn't try to detail how the migration window works, that is a different large task. The intention is that the migration window must be fully drained to be successful. I added this for some clarity ""The entire migration data, up to each end of stream must be transported from the saving to resuming side."" > > Yishai has a patch already to add NDMA to mlx5, it will come in the > > next iteration once we can agree on this document. qemu will follow > > sometime later. > > So it's not really a TBD, it's resolved in a uAPI update that will be > included with the next revision? Thanks, There is a patch yes, the TBD here is to include a few words about how to detect NDMA Jason ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-01 3:14 ` Jason Gunthorpe @ 2021-12-01 9:54 ` Shameerali Kolothum Thodi 2021-12-01 13:49 ` Jason Gunthorpe 2021-12-01 20:03 ` Alex Williamson 1 sibling, 1 reply; 29+ messages in thread From: Shameerali Kolothum Thodi @ 2021-12-01 9:54 UTC (permalink / raw) To: Jason Gunthorpe, Alex Williamson Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede, Max Gurtovoy, Yishai Hadas, Zengtao (B), liulongfang > -----Original Message----- > From: Jason Gunthorpe [mailto:jgg@nvidia.com] > Sent: 01 December 2021 03:14 > To: Alex Williamson <alex.williamson@redhat.com> > Cc: Jonathan Corbet <corbet@lwn.net>; linux-doc@vger.kernel.org; Cornelia > Huck <cohuck@redhat.com>; kvm@vger.kernel.org; Kirti Wankhede > <kwankhede@nvidia.com>; Max Gurtovoy <mgurtovoy@nvidia.com>; > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yishai > Hadas <yishaih@nvidia.com> > Subject: Re: [PATCH RFC v2] vfio: Documentation for the migration region > > On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote: > > > > From what HNS said the device driver would have to trap every MMIO to > > > implement NDMA as it must prevent touches to the physical HW MMIO to > > > maintain the NDMA state. > > > > > > The issue is that the HW migration registers can stop processing the > > > queue and thus enter NDMA but a MMIO touch can resume queue > > > processing, so NDMA cannot be sustained. > > > > > > Trapping every MMIO would have a huge negative performance impact. > So > > > it doesn't make sense to do so for a device that is not intended to be > > > used in any situation where NDMA is required. > > > > But migration is a cooperative activity with userspace. If necessary > > we can impose a requirement that mmap access to regions (other than the > > migration region itself) are dropped when we're in the NDMA or !RUNNING > > device_state. > > It is always NDMA|RUNNING, so we can't fully drop access to > MMIO. Userspace would have to transfer from direct MMIO to > trapping. With enough new kernel infrastructure and qemu support it > could be done. As far as our devices are concerned we put the dev queue into a PAUSE state in the !RUNNUNG state. And since we don't have any P2P support, is it ok to put the onus on userspace here that it won't try to access the MMIO during !RUNNUNG state? So just to make it clear , if a device declares that it doesn't support NDMA and P2P, is the v1 version of the spec good enough or we still need to take care the case that a malicious user might try MMIO access in !RUNNING state and should have kernel infrastructure in place to safe guard that? > > Even so, we can't trap accesses through the IOMMU so such a scheme > would still require removing IOMMU acess to the device. Given that the > basic qemu mitigation for no NDMA support is to eliminate P2P cases by > removing the IOMMU mappings this doesn't seem to advance anything and > only creates complexity. > > At least I'm not going to insist that hns do all kinds of work like > this for a edge case they don't care about as a precondition to get a > migration driver. Yes. That's our concern too. (Just a note to clarify that these are not HNS devices per se. HNS actually stands for HiSilicon Network Subsystem and doesn't currently have live migration capability. The devices capable of live migration are HiSilicon Accelerator devices). Thanks, Shameer ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-01 9:54 ` Shameerali Kolothum Thodi @ 2021-12-01 13:49 ` Jason Gunthorpe 0 siblings, 0 replies; 29+ messages in thread From: Jason Gunthorpe @ 2021-12-01 13:49 UTC (permalink / raw) To: Shameerali Kolothum Thodi Cc: Alex Williamson, Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede, Max Gurtovoy, Yishai Hadas, Zengtao (B), liulongfang On Wed, Dec 01, 2021 at 09:54:27AM +0000, Shameerali Kolothum Thodi wrote: > So just to make it clear , if a device declares that it doesn't support NDMA > and P2P, is the v1 version of the spec good enough or we still need to take > care the case that a malicious user might try MMIO access in !RUNNING > state and should have kernel infrastructure in place to safe guard that? My thinking is so long as the hostile user space cannot compromise the kernel it is OK. A corrupted migration is acceptable if userspace is not following the rules. From a qemu perspective it should prevent a hostile VM from corrupting the migration, as that is allowing the VM to attack the infrastructure even if it hopefully only harms itself. > (Just a note to clarify that these are not HNS devices per se. HNS actually > stands for HiSilicon Network Subsystem and doesn't currently have live > migration capability. The devices capable of live migration are HiSilicon > Accelerator devices). Sorry, I mostly talk to the hns team ;) Jason ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-01 3:14 ` Jason Gunthorpe 2021-12-01 9:54 ` Shameerali Kolothum Thodi @ 2021-12-01 20:03 ` Alex Williamson 2021-12-01 23:25 ` Jason Gunthorpe 1 sibling, 1 reply; 29+ messages in thread From: Alex Williamson @ 2021-12-01 20:03 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Tue, 30 Nov 2021 23:14:07 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote: > > > > From what HNS said the device driver would have to trap every MMIO to > > > implement NDMA as it must prevent touches to the physical HW MMIO to > > > maintain the NDMA state. > > > > > > The issue is that the HW migration registers can stop processing the > > > queue and thus enter NDMA but a MMIO touch can resume queue > > > processing, so NDMA cannot be sustained. > > > > > > Trapping every MMIO would have a huge negative performance impact. So > > > it doesn't make sense to do so for a device that is not intended to be > > > used in any situation where NDMA is required. > > > > But migration is a cooperative activity with userspace. If necessary > > we can impose a requirement that mmap access to regions (other than the > > migration region itself) are dropped when we're in the NDMA or !RUNNING > > device_state. > > It is always NDMA|RUNNING, so we can't fully drop access to > MMIO. Userspace would have to transfer from direct MMIO to > trapping. With enough new kernel infrastructure and qemu support it > could be done. This is simply toggling whether the mmap MemoryRegion in QEMU is enabled, when not enabled access falls through to the read/write access functions. We already have this functionality for unmasking INTx interrupts when we don't have EOI support. > Even so, we can't trap accesses through the IOMMU so such a scheme > would still require removing IOMMU acess to the device. Given that the > basic qemu mitigation for no NDMA support is to eliminate P2P cases by > removing the IOMMU mappings this doesn't seem to advance anything and > only creates complexity. NDMA only requires that the device cease initiating DMA, so I suppose you're suggesting that the MMIO of a device that doesn't expect to make use of p2p DMA could be poked through DMA, which might cause the device to initiate a DMA and potentially lose sync with mediation. That would be bad, but seems like a non-issue for hns. > At least I'm not going to insist that hns do all kinds of work like > this for a edge case they don't care about as a precondition to get a > migration driver. I wonder if we need a region flag to opt-out of IOMMU mappings for devices that do not support p2p use cases. If there were no IOMMU mappings and mediation handled CPU driven MMIO accesses, then we'd have a viable NDMA mode for hns. > > There's no reason that mediation while in the NDMA state needs to > > impose any performance penalty against the default RUNNING state. > > Eh? Mitigation of no NDMA support would have to mediate the MMIO on a > a performance doorbell path, there is no escaping a performance > hit. I'm not sure what you mean Read it again, I'm suggesting that mediation during NDMA doesn't need to carry over any performance penalty to the default run state. We don't care if mediation imposes a performance penalty while NDMA is set, we're specifically attempting to quiesce the device. The slower it runs the better ;) > > > > Some discussion of this requirement would be useful in the doc, > > > > otherwise it seems easier to deprecate the v1 migration region > > > > sub-type, and increment to a v2 where NDMA is a required feature. > > > > > > I can add some words a the bottom, but since NDMA is a completely > > > transparent optional feature I don't see any reason to have v2. > > > > It's hardly transparent, aiui userspace is going to need to impose a > > variety of loosely defined restrictions for devices without NDMA > > support. It would be far easier if we could declare NDMA support to be > > a requirement. > > It would make userspace a bit simpler at the cost of excluding or > complicating devices like hns for a use case they don't care about. > > On the other hand, the simple solution in qemu is when there is no > universal NDMA it simply doesn't include any MMIO ranges in the > IOMMU. That leads to mysterious performance penalties when a VM was previously able to make use of (ex.) GPUDirect, but adds an hns device and suddenly can no longer use p2p. Or rejected hotplugs if a device has existing NDMA capable devices, p2p might be active, but the new device does not support NDMA. This all makes it really complicated to get deterministic behavior for devices. I don't know how to make QEMU behavior predictable and supportable in such an environment. > > As I think Connie also had trouble with, combining device_state with > > IOMMU migration features and VMM state, without any preceding context > > and visual cues makes the section confusing. I did gain context as I > > read further though the doc, but I also had the advantage of being > > rather familiar with the topic. Maybe a table format would help to > > segment the responsibilities? > > I moved the context to the bottom exactly because Connie said it was > confusing at the start. :) > > This is a RST document so I not keen to make huge formatting > adventures for minimal readability gain. > > I view this as something that probably needs to be read a few times, > along with the code and header files for someone brand new to > understand. I'm Ok with that, it is about consistent with kernel docs > of this level. > > What I would like is if userspace focused readers can get their > important bits of information with less work. > > > > It is exsisting behavior of qemu - which is why we documented it. > > > > QEMU resets devices as part of initializing the VM, but I don't see > > that QEMU specifically resets a device in order to transition it to > > the RESUMING device_state. > > We instrumented the kernel and monitored qemu, it showed up on the > resume traces. Exactly, it's part of the standard VM initialization sequence, not specific to the device entering the RESUMING state. > > > Either qemu shouldn't do it as devices must fully self-reset, or we > > > should have it part of the canonical flow and devices may as well > > > expect it. It is useful because post VFIO_DEVICE_RESET all DMA is > > > quiet, no outstanding PRIs exist, etc etc. > > > > It's valid for QEMU to reset the device any time it wants, saying that > > it cannot perform a reset before transitioning to the RESUMING state is > > absurd. Userspace can do redundant things for their own convenience. > > I didn't say cannot, I said it shouldn't. Redundant perhaps, but not dis-recommended other than for overhead. > Since qemu is the only implementation it would be easy for drivers to > rely on the implicit reset it seems to do, it seems an important point > that should be written either way. > > I don't have a particular requirement to have the reset, but it does > seem like a good idea. If you feel strongly, then let's say the > opposite that the driver must enter RESUME with no preconditions, > doing an internal reset if required. It seems cleaner to me than unnecessarily requiring userspace to pass through an ioctl in order to get to the next device_state. > > We don't currently specify any precondition for a device to enter the > > RESUMING state. The driver can of course nak the state change with an > > errno, or hard nak it with an errno and ERROR device_state, which would > > require userspace to make use of VFIO_DEVICE_RESET. > > I don't think we should be relying on every driver doing something > totally differnt on the standard path. That is only going to hurt > interoperability. I agree, the above is the exception path, devices should typically allow a transition to RESUMING with no precondition, which is why I suggest an internal reset as necessary. > > > > As with the previous flows, it seems like there's a ton of implicit > > > > knowledge here. Why are we documenting these here rather than in the > > > > uAPI header? > > > > > > Because this is 300 lines already and is too complicated/long to > > > properly live in a uapi header. > > > > Minimally we need to resolve that this document must be consistent with > > the uAPI. I'm not sure that's entirely the case in this draft. > > Can you point to something please? I can't work with "I'm not sure" The reset behavior that I'm trying to clarify above is the primary offender, but "I'm not sure" I understand the bit prioritization enough to know that there isn't something there as well. I'm also not sure if the "end of stream" phrasing below matches the uAPI. > IMO the header file doesn't really say much and can be read in a way > that is consistent with this more specific document. But if this document is suggesting the mlx5/QEMU interpretation is the only valid interpretations for driver authors, those clarifications should be pushed back into the uAPI header. > > > - qemu doesn't support P2P cases due to the NDMA topic > > > > Or rather QEMU does support p2p cases regardless of the NDMA topic. > > I mean support in a way that is actually usable as without NDMA it > corrupts the VM when it migrates it. I think we're in agreement, I'm just pointing out that QEMU currently has no NDMA support nor does anything to prevent p2p with non-NDMA devices. > > > - simple devices like HNS will work, but not robustly in the face of > > > a hostile VM and multiple VFIO devices. > > > > So what's the goal here, are we trying to make the one currently > > implemented and unsupported userspace be the gold standard to which > > drivers should base their implementation? > > I have no idea anymore. You asked for docs and complete picture as a > percondition for merging a driver. Here it is. > > What do you want? Looking through past conversations, I definitely asked that we figure out how NDMA is going to work. Are you thinking of a different request from me? The restriction implied by lack of NDMA support are pretty significant. Maybe a given device like hns doesn't care because they don't intend to support p2p, but they should care if it means we can't hot-add their device to a VM in the presences of devices that can support p2p and if cold plugging their device into an existing configuration implies loss of functionality or performance elsewhere. I'd tend to expect though that we'd incorporate NDMA documentation into the uAPI with a formal proposal for discovery and outline a number of those usage implications. > > We've tried to define a specification that's more flexible than a > > single implementation and by these standards we seem to be flipping > > that implementation back into the specification. > > What specification!?! All we have is a couple lines in a header file > that is no where near detailed enough for multi-driver > interoperability with userspace. You have no idea how much effort has > been expended to get this far based on the few breadcrumbs that were > left, and we have access to the team that made the only other > implementation! > > *flexible* is not a specification. There are approximately 220 lines of the uAPI header file dealing specifically with the migration region. A bit more than a couple. We've tried to define it by looking at the device requirements to support migration rather than tailor it specifically to the current QEMU implementation of migration. Attempting to undo that generality by suggesting only current usage patterns are relevant is therefore going to generate friction. > > Userspace can attempt RESUMING -> RUNNING regardless of what we specify, > > so a driver needs to be prepared for such an attempted state change > > either way. So what's the advantage to telling a driver author that > > they can expect a given behavior? > > The above didn't tell a driver author to expect a certain behavior, it > tells userspace what to do. "The migration driver can rely on user-space issuing a VFIO_DEVICE_RESET prior to starting RESUMING." > > It doesn't make much sense to me to glue two separate userspace > > operations together to say these must be done in this sequence, back to > > back. If we want the device to be reset in order to enter RESUMING, the > > driver should simply reset the device as necessary during the state > > transition. The outward effect to the user is to specify that device > > internal state may not be retained on transition from RUNNING -> > > RESUMING. > > Maybe, and I'm happy if you want to specify this instead. It just > doesn't match what we observe qemu to be doing. Tracing that shows a reset preceding entering RESUMING doesn't suggest to me that QEMU is performing a reset for the specific purpose of entering RESUMING. Correlation != causation. > > > Do you have an alternative language? This is quite complicated, I > > > advise people to refer to mlx5's implementation. > > > > I agree with Connie on this, if the reader of the documentation needs > > to look at a specific driver implementation to understand the > > reasoning, the documentation has failed. > > Lets agree on some objective here, this is not trying to be fully > comprehensive, or fully standalone. It is intended to drive agreement, > be informative to userspace, and be supplemental to the actual code. > > > If it can be worked out by looking at the device_state write > > function of the mlx5 driver, then surely a sentence or two for each > > priority item can be added here. > > Please give me a suggestion then, because I don't know what will help > here? I'm trying... thus the suggesting just below > > Part of the problem is that the nomenclature is unclear, we're listing > > bit combinations, but not the changed bit(s) and we need to infer the > > state. > > Each line lists the new state, the changed bits are thus any bits that > make up the new state. > > If you look at how mlx5 is constructed each if has a 'did it change' > test followed by 'what state is it in now' > > So the document is read as listing the order the driver enters the new > states. I clarified it as ""must process the new device_state bits in > a priority order"" The order I see in the v5 mlx5 post is: if RUNNING 1->0 quiesce + freeze if RUNNING or SAVING change && state == !RUNNING | SAVING save device state if RESUMING 0->1 reset device state if RESUMING 1->0 load device state if RUNNING 0->1 unfreeze + unquiesce So maybe part of my confusion stems from the fact that the mlx5 driver doesn't support pre-copy, which by the provided list is the highest priority. But what actually makes that the highest priority? Any combination of SAVING and RESUMING is invalid, so we can eliminate those. We obviously can't have RUNNING and !RUNNING, so we can eliminate all cases of !RUNNING, so we can shorten the list relativeto prioritizing SAVING|RUNNING to: - SAVING | RUNNING - NDMA - !RESUMING - RUNNING - !NDMA It seems like a transition from RESUMING -> SAVING | RUNNING is valid and RESUMING(1->0) triggers the device state to be loaded from the migration area, but SAVING(0->1) clears the data window, so I think this priority scheme means a non-intuitive thing just happened. SAVING | RUNNING would need to be processed after !RESUMING, but maybe before RUNNING itself. NDMA only requires that the device cease initiating DMA before the call returns and it only has meaning if RUNNING, so I'm not sure why setting NDMA is given any priority. I guess maybe we're trying (unnecessarily?) to preempt any DMA that might occur as a result of setting RUNNING (so why is it above cases including !RUNNING?)? Obviously presenting a priority scheme without a discussion of the associativity of the states and a barely sketched out nomenclature is really not inviting an actual understanding how this is reasoned (imo). > > flips in the presence of an existing state. I'm not able to obviously > > map the listing above to the latest posted version of the mlx5 driver. > > One of the things we've done is align mlx5 more clearly to this. For > instance it no longer has a mixture of state and old state in the if > statements, it always tests the new state so the tests logically > follow what is written here > > Stripping away the excess the expressions now look like this: > > !(state & VFIO_DEVICE_STATE_RUNNING) > ((state & (VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING)) == VFIO_DEVICE_STATE_SAVING)) > (state & VFIO_DEVICE_STATE_RESUMING) > > Which mirror what is written here. I'm feeling like there's a bit of a chicken and egg problem here to decide if this is sufficiently clear documentation before a new posting of the mlx5 driver where the mlx5 driver is the authoritative source for the reasoning of the priority scheme documented here (and doesn't make use of pre-copy). > > > > > + As Peer to Peer DMA is a MMIO touch like any other, it is important that > > > > > + userspace suspend these accesses before entering any device_state where MMIO > > > > > + is not permitted, such as !RUNNING. This can be accomplished with the NDMA > > > > > + state. Userspace may also choose to remove MMIO mappings from the IOMMU if the > > > > > + device does not support NDMA and rely on that to guarantee quiet MMIO. > > > > > > > > Seems that would have its own set of consequences. > > > > > > Sure, userspace has to make choices here. > > > > It seems a bit loaded to suggest an alternative choice if it's not > > practical or equivalent. Maybe it's largely the phrasing, I read > > "remove MMIO mappings" as to drop them dynamically, when I think we've > > discussed that userspace might actually preclude these mappings for > > non-NDMA devices such that p2p DMA cannot exist, ever. > > I mean the latter. How about "never install MMIO mappings" ? Yup. > > > Overall it must work in this basic way, and devices have freedom about > > > what internal state they can/will log. There is just a clear division > > > that every internal state in the first step is either immutable or > > > logged, and that the second step is a delta over the first. > > > > I agree that it's a reasonable approach, though as I read the proposed > > text, there's no mention of immutable state and no reason a driver > > would implement a dirty log for immutable state, therefore we seem to > > be suggesting such data for the stop-and-copy phase when it would > > actually be preferable to include it in pre-copy. > > I'd say that is a detail we don't need to discuss/define, it has no > user space visible consequence. The migration data stream is entirely opaque to userspace, so what's the benefit to userspace to suggest anything about the content in each phase? This is presented in a userspace edge concerns section, but the description is much more relevant to a driver author. > > I think the fact that a user is not required to run the pre-copy > > phase until completion is also noteworthy. > > This text doesn't try to detail how the migration window works, that > is a different large task. The intention is that the migration window > must be fully drained to be successful. Clarification, the *!RUNNING* migration window must be fully drained. > I added this for some clarity ""The entire migration data, up to each > end of stream must be transported from the saving to resuming side."" Per the uAPI regarding pre-copy: "The user must not be required to consume all migration data before the device transitions to a new state, including the stop-and-copy state." If "end of stream" suggests the driver defined end of the data stream for pre-copy rather than simply the end of the user accumulated data stream, that conflicts with the uAPI. Thanks, Alex ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-01 20:03 ` Alex Williamson @ 2021-12-01 23:25 ` Jason Gunthorpe 2021-12-02 17:05 ` Cornelia Huck 2021-12-03 18:06 ` Alex Williamson 0 siblings, 2 replies; 29+ messages in thread From: Jason Gunthorpe @ 2021-12-01 23:25 UTC (permalink / raw) To: Alex Williamson Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Wed, Dec 01, 2021 at 01:03:14PM -0700, Alex Williamson wrote: > On Tue, 30 Nov 2021 23:14:07 -0400 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote: > > > > > > From what HNS said the device driver would have to trap every MMIO to > > > > implement NDMA as it must prevent touches to the physical HW MMIO to > > > > maintain the NDMA state. > > > > > > > > The issue is that the HW migration registers can stop processing the > > > > queue and thus enter NDMA but a MMIO touch can resume queue > > > > processing, so NDMA cannot be sustained. > > > > > > > > Trapping every MMIO would have a huge negative performance impact. So > > > > it doesn't make sense to do so for a device that is not intended to be > > > > used in any situation where NDMA is required. > > > > > > But migration is a cooperative activity with userspace. If necessary > > > we can impose a requirement that mmap access to regions (other than the > > > migration region itself) are dropped when we're in the NDMA or !RUNNING > > > device_state. > > > > It is always NDMA|RUNNING, so we can't fully drop access to > > MMIO. Userspace would have to transfer from direct MMIO to > > trapping. With enough new kernel infrastructure and qemu support it > > could be done. > > This is simply toggling whether the mmap MemoryRegion in QEMU is > enabled, when not enabled access falls through to the read/write access > functions. We already have this functionality for unmasking INTx > interrupts when we don't have EOI support. > > > Even so, we can't trap accesses through the IOMMU so such a scheme > > would still require removing IOMMU acess to the device. Given that the > > basic qemu mitigation for no NDMA support is to eliminate P2P cases by > > removing the IOMMU mappings this doesn't seem to advance anything and > > only creates complexity. > > NDMA only requires that the device cease initiating DMA, so I suppose > you're suggesting that the MMIO of a device that doesn't expect to make > use of p2p DMA could be poked through DMA, which might cause the device > to initiate a DMA and potentially lose sync with mediation. That would > be bad, but seems like a non-issue for hns. Yes, not sure how you get to a non-issue though? If the IOMMU map is present the huawei device can be attacked by a hostile VM and forced to exit NDMA. All this takes is any two DMA capable devices to be plugged in? > > At least I'm not going to insist that hns do all kinds of work like > > this for a edge case they don't care about as a precondition to get a > > migration driver. > > I wonder if we need a region flag to opt-out of IOMMU mappings for > devices that do not support p2p use cases. If there were no IOMMU > mappings and mediation handled CPU driven MMIO accesses, then we'd have > a viable NDMA mode for hns. This is a good idea, if we want to make huawei support NDMA then flagging it to never be in the iommu map in the first place is a great solution. Then they can use CPU mmio trapping get the rest of the way. > > > There's no reason that mediation while in the NDMA state needs to > > > impose any performance penalty against the default RUNNING state. > > > > Eh? Mitigation of no NDMA support would have to mediate the MMIO on a > > a performance doorbell path, there is no escaping a performance > > hit. I'm not sure what you mean > > Read it again, I'm suggesting that mediation during NDMA doesn't need > to carry over any performance penalty to the default run state. We > don't care if mediation imposes a performance penalty while NDMA is set, > we're specifically attempting to quiesce the device. The slower it > runs the better ;) OK, I don't read it like that. It seems OK to have a performance hit in NDMA since it is only a short grace state. > > It would make userspace a bit simpler at the cost of excluding or > > complicating devices like hns for a use case they don't care about. > > > > On the other hand, the simple solution in qemu is when there is no > > universal NDMA it simply doesn't include any MMIO ranges in the > > IOMMU. > > That leads to mysterious performance penalties when a VM was previously > able to make use of (ex.) GPUDirect, Not a penalty it will just explode somehow. There is no way right now for a VM to know P2P doesn't work. It is one of these annoying things that leaks to the VMM like the no-snoop mess. A VMM installing a device combination that is commonly used with P2P, like a GPU and a NIC, had a better make sure P2P works :) > but adds an hns device and suddenly can no longer use p2p. Or > rejected hotplugs if a device has existing NDMA capable devices, p2p > might be active, but the new device does not support NDMA. I don't think qemu can go back on what it already did, so rejected hotplug seems the only choice. > This all makes it really complicated to get deterministic behavior > for devices. I don't know how to make QEMU behavior predictable and > supportable in such an environment. And this is the thing that gives me pause to think maybe the huawei device should do the extra work? On the other hand I suspect their use case is fine with qemu set to disable P2P completely. OTOH "supportable" qemu could certainly make the default choice to require devices for simplicity. > > Since qemu is the only implementation it would be easy for drivers to > > rely on the implicit reset it seems to do, it seems an important point > > that should be written either way. > > > > I don't have a particular requirement to have the reset, but it does > > seem like a good idea. If you feel strongly, then let's say the > > opposite that the driver must enter RESUME with no preconditions, > > doing an internal reset if required. > > It seems cleaner to me than unnecessarily requiring userspace to pass > through an ioctl in order to get to the next device_state. Ok, I'll add something like this. > > Can you point to something please? I can't work with "I'm not sure" > > The reset behavior that I'm trying to clarify above is the primary > offender, but "I'm not sure" I understand the bit prioritization enough > to know that there isn't something there as well. I'm also not sure if > the "end of stream" phrasing below matches the uAPI. Realistically I think userspace should not make use of the bit prioritization. It is more as something driver implementors should follow for consistent behavior. > > IMO the header file doesn't really say much and can be read in a way > > that is consistent with this more specific document. > > But if this document is suggesting the mlx5/QEMU interpretation is the > only valid interpretations for driver authors, those clarifications > should be pushed back into the uAPI header. Can we go the other way and move more of the uAPI header text here? > > I have no idea anymore. You asked for docs and complete picture as a > > percondition for merging a driver. Here it is. > > > > What do you want? > > Looking through past conversations, I definitely asked that we figure > out how NDMA is going to work. Are you thinking of a different request > from me? It was part of the whole etherpad thing. This was what we said we'd do to resolve the discussion. I expect to come to some agreement with you and Connie on this text and we will go ahead. > The restriction implied by lack of NDMA support are pretty significant. > Maybe a given device like hns doesn't care because they don't intend to > support p2p, but they should care if it means we can't hot-add their > device to a VM in the presences of devices that can support p2p and if > cold plugging their device into an existing configuration implies loss > of functionality or performance elsewhere. > > I'd tend to expect though that we'd incorporate NDMA documentation into > the uAPI with a formal proposal for discovery and outline a number of > those usage implications. Yishai made a patch, but we have put the discussion of NDMA here, not hidden in a commit message > > > We've tried to define a specification that's more flexible than a > > > single implementation and by these standards we seem to be flipping > > > that implementation back into the specification. > > > > What specification!?! All we have is a couple lines in a header file > > that is no where near detailed enough for multi-driver > > interoperability with userspace. You have no idea how much effort has > > been expended to get this far based on the few breadcrumbs that were > > left, and we have access to the team that made the only other > > implementation! > > > > *flexible* is not a specification. > > There are approximately 220 lines of the uAPI header file dealing > specifically with the migration region. A bit more than a couple. Unfortunately more than half of that describes how the data window works, and half of the rest is kind of obvious statements. > We've tried to define it by looking at the device requirements to > support migration rather than tailor it specifically to the current QEMU > implementation of migration. Attempting to undo that generality by > suggesting only current usage patterns are relevant is therefore going > to generate friction. In your mind you see generality, in our mind we want to know how to write an inter operable driver and there is no documention saying how to do that. > > > Userspace can attempt RESUMING -> RUNNING regardless of what we specify, > > > so a driver needs to be prepared for such an attempted state change > > > either way. So what's the advantage to telling a driver author that > > > they can expect a given behavior? > > > > The above didn't tell a driver author to expect a certain behavior, it > > tells userspace what to do. > > "The migration driver can rely on user-space issuing a > VFIO_DEVICE_RESET prior to starting RESUMING." I trimmed too much, the original text you quoted was "To abort a RESUMING issue a VFIO_DEVICE_RESET." Which I still think is fine. > Tracing that shows a reset preceding entering RESUMING doesn't suggest > to me that QEMU is performing a reset for the specific purpose of > entering RESUMING. Correlation != causation. Kernel doesn't care why qemu did it - it was done. Intent doesn't matter :) > The order I see in the v5 mlx5 post is: > > if RUNNING 1->0 > quiesce + freeze > if RUNNING or SAVING change && state == !RUNNING | SAVING > save device state > if RESUMING 0->1 > reset device state > if RESUMING 1->0 > load device state > if RUNNING 0->1 > unfreeze + unquiesce Right, which matches the text: - !RUNNING - SAVING | !RUNNING - RESUMING - !RESUMING - RUNNING > So maybe part of my confusion stems from the fact that the mlx5 driver > doesn't support pre-copy, which by the provided list is the highest > priority. Right. > But what actually makes that the highest priority? Any > combination of SAVING and RESUMING is invalid, so we can eliminate > those. We obviously can't have RUNNING and !RUNNING, so we can > eliminate all cases of !RUNNING, so we can shorten the list relativeto > prioritizing SAVING|RUNNING to: There are several orders that can make sense. What we've found is following the reference flow order has given something workable for precedence. > SAVING | RUNNING would need to be processed after !RESUMING, but > maybe before RUNNING itself. It is a good point, it does make more sense after RUNNING as a device should be already RUNNING before entering pre-copy. I moved it to before !NDMA > NDMA only requires that the device cease initiating DMA before the call > returns and it only has meaning if RUNNING, so I'm not sure why setting > NDMA is given any priority. I guess maybe we're trying > (unnecessarily?) to preempt any DMA that might occur as a result of > setting RUNNING (so why is it above cases including !RUNNING?)? Everything was given priority so there is no confusing omission. For the order, as NDMA has no meaning outside RUNNING, it makes sense you'd do a NDMA before making it meaningless / after making it meaningful. This is made concrete by mlx5's scheme that always requires quiesce (NDMA) to happen before freeze (!RUNNING) and viceversa, so we get to this order. mlx5 implicitly does NDMA on !RUNNING > Obviously presenting a priority scheme without a discussion of the > associativity of the states and a barely sketched out nomenclature is > really not inviting an actual understanding how this is reasoned (imo). Sure, but how we got here isn't really important to the intent of the document to guide implementors. Well, you wrote a lot, and found a correction, but I haven't been left with a way to write this more clearly? Now that you obviously understand what it is saying, what do you think? > I'm feeling like there's a bit of a chicken and egg problem here to > decide if this is sufficiently clear documentation before a new posting > of the mlx5 driver where the mlx5 driver is the authoritative source > for the reasoning of the priority scheme documented here (and doesn't > make use of pre-copy). I wouldn't fixate on the ordering, it is a small part of the document.. > The migration data stream is entirely opaque to userspace, so what's > the benefit to userspace to suggest anything about the content in each > phase? This is presented in a userspace edge concerns section, but the > description is much more relevant to a driver author. It is informative for the device driver author to understand what device functionality to map to this. > > > I think the fact that a user is not required to run the pre-copy > > > phase until completion is also noteworthy. > > > > This text doesn't try to detail how the migration window works, that > > is a different large task. The intention is that the migration window > > must be fully drained to be successful. > > Clarification, the *!RUNNING* migration window must be fully drained. > > > I added this for some clarity ""The entire migration data, up to each > > end of stream must be transported from the saving to resuming side."" > > Per the uAPI regarding pre-copy: > > "The user must not be required to consume all migration data before > the device transitions to a new state, including the stop-and-copy > state." > > If "end of stream" suggests the driver defined end of the data stream > for pre-copy rather than simply the end of the user accumulated data > stream, that conflicts with the uAPI. Thanks, Hmm, yes. I can try to clarify how this all works better. We don't implement pre-copy but it should still be described better than it has been. I'm still not sure how this works. We are in SAVING|RUNNING and we dump all the dirty data and return end of stream. We stay in SAVING|RUNNING and some more device state became dirty. How does userspace know? Should it poll and see if the stream got longer? Below is what I collected from your feedback so far Thanks, Jason diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst index d9be47570f878c..2ff47823a889b4 100644 --- a/Documentation/driver-api/vfio.rst +++ b/Documentation/driver-api/vfio.rst @@ -258,7 +258,9 @@ Setting/clearing bit groups triggers device action, and each bit controls a continuous device behavior. Along with the device_state the migration driver provides a data window which -allows streaming migration data into or out of the device. +allows streaming migration data into or out of the device. The entire +migration data, up to the end of stream must be transported from the saving to +resuming side. A lot of flexibility is provided to user-space in how it operates these bits. What follows is a reference flow for saving device state in a live @@ -299,12 +301,9 @@ entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in. and the reference flow for resuming: RUNNING - Issue VFIO_DEVICE_RESET to clear the internal device state - 0 - Device is halted + Use ioctl(VFIO_GROUP_GET_DEVICE_FD) to obtain a fresh device RESUMING - Push in migration data. Data captured during pre-copy should be - prepended to data captured during SAVING. + Push in migration data. NDMA | RUNNING Peer to Peer DMA grace state RUNNING, VCPU_RUNNING @@ -315,48 +314,73 @@ states act as cross device synchronization points. The VMM must bring all devices to the grace state before advancing past it. The above reference flows are built around specific requirements on the -migration driver for its implementation of the migration_state input. +migration driver for its implementation of the device_state input. -The migration_state cannot change asynchronously, upon writing the -migration_state the driver will either keep the current state and return +The device_state cannot change asynchronously, upon writing the +device_state the driver will either keep the current state and return failure, return failure and go to ERROR, or succeed and go to the new state. -Event triggered actions happen when user-space requests a new migration_state -that differs from the current migration_state. Actions happen on a bit group +Event triggered actions happen when user-space requests a new device_state +that differs from the current device_state. Actions happen on a bit group basis: + SAVING + The device clears the data window and prepares to stream migration data. + The entire data from the start of SAVING to the end of stream is transfered + to the other side to execute a resume. + SAVING | RUNNING - The device clears the data window and begins streaming 'pre copy' migration - data through the window. Devices that cannot log internal state changes - return a 0 length migration stream. + The device beings streaming 'pre copy' migration data through the window. + + A device that does not support internal state logging should return a 0 + length stream. + + The migration window may reach an end of stream, this can be a permanent or + temporary condition. + + User space can do SAVING | !RUNNING at any time, any in progress transfer + through the migration window is carried forward. + + This allows the device to implement a dirty log for its internal state. + During this state the data window should present the device state being + logged and during SAVING | !RUNNING the data window should transfer the + dirtied state and conclude the migration data. + + The state is only concerned with internal device state. External DMAs are + covered by the separate DIRTY_TRACKING function. SAVING | !RUNNING - The device captures its internal state that is not covered by internal - logging, as well as any logged changes. + The device captures its internal state and streams it through the + migration window. - The device clears the data window and begins streaming the captured - migration data through the window. Devices that cannot log internal state - changes stream all their device state here. + When the migration window reaches an end of stream the saving is concluded + and there is no further data. All of the migration data streamed from the + time SAVING starts to this final end of stream is concatenated together + and provided to RESUMING. + + Devices that cannot log internal state changes stream all their device + state here. RESUMING The data window is cleared, opened, and can receive the migration data - stream. + stream. The device must always be able to enter resuming and it may reset + the device to do so. !RESUMING All the data transferred into the data window is loaded into the device's - internal state. The migration driver can rely on user-space issuing a - VFIO_DEVICE_RESET prior to starting RESUMING. + internal state. - To abort a RESUMING issue a VFIO_DEVICE_RESET. + The internal state of a device is undefined while in RESUMING. To abort a + RESUMING and return to a known state issue a VFIO_DEVICE_RESET. If the migration data is invalid then the ERROR state must be set. -Continuous actions are in effect when migration_state bit groups are active: +Continuous actions are in effect when device_state bit groups are active: RUNNING | NDMA The device is not allowed to issue new DMA operations. - Whenever the kernel returns with a migration_state of NDMA there can be no + Whenever the kernel returns with a device_state of NDMA there can be no in progress DMAs. !RUNNING @@ -384,24 +408,24 @@ Continuous actions are in effect when migration_state bit groups are active: during ERROR to avoid corrupting other devices or a VM during a failed migration. -When multiple bits change in the migration_state they may describe multiple -event triggered actions, and multiple changes to continuous actions. The -migration driver must process them in a priority order: +When multiple bits change in the device_state they may describe multiple event +triggered actions, and multiple changes to continuous actions. The migration +driver must process the new device_state bits in a priority order: - - SAVING | RUNNING - NDMA - !RUNNING - SAVING | !RUNNING - RESUMING - !RESUMING - RUNNING + - SAVING | RUNNING - !NDMA In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the device back to device_state RUNNING. When a migration driver executes this -ioctl it should discard the data window and set migration_state to RUNNING as +ioctl it should discard the data window and set device_state to RUNNING as part of resetting the device to a clean state. This must happen even if the -migration_state has errored. A freshly opened device FD should always be in +device_state has errored. A freshly opened device FD should always be in the RUNNING state. The migration driver has limitations on what device state it can affect. Any @@ -438,8 +462,9 @@ implementing migration: As Peer to Peer DMA is a MMIO touch like any other, it is important that userspace suspend these accesses before entering any device_state where MMIO is not permitted, such as !RUNNING. This can be accomplished with the NDMA - state. Userspace may also choose to remove MMIO mappings from the IOMMU if the - device does not support NDMA and rely on that to guarantee quiet MMIO. + state. Userspace may also choose to never install MMIO mappings into the + IOMMU if devices do not support NDMA and rely on that to guarantee quiet + MMIO. The Peer to Peer Grace States exist so that all devices may reach RUNNING before any device is subjected to a MMIO access. @@ -458,16 +483,6 @@ implementing migration: Device that do not support NDMA cannot be configured to generate page faults that require the VCPU to complete. -- pre-copy allows the device to implement a dirty log for its internal state. - During the SAVING | RUNNING state the data window should present the device - state being logged and during SAVING | !RUNNING the data window should present - the unlogged device state as well as the changes from the internal dirty log. - - On RESUME these two data streams are concatenated together. - - pre-copy is only concerned with internal device state. External DMAs are - covered by the separate DIRTY_TRACKING function. - - Atomic Read and Clear of the DMA log is a HW feature. If the tracker cannot support this, then NDMA could be used to synthesize it less efficiently. @@ -476,14 +491,23 @@ implementing migration: are pushed down to the next step in the sequence and various behaviors that rely on NDMA cannot be used. -- Migration control registers inside the same iommu_group as the VFIO device. - This immediately raises a security concern as user-space can use Peer to Peer - DMA to manipulate these migration control registers concurrently with + NDMA is made optional to support simple HW implementations that either just + cannot do NDMA, or cannot do NDMA without a performance cost. NDMA is only + necessary for special features like P2P and PRI, so devices can omit it in + exchange for limitations on the guest. + +- Devices that have their HW migration control MMIO registers inside the same + iommu_group as the VFIO device have some special considerations. In this + case a driver will be operating HW registers from kernel space that are also + subjected to userspace controlled DMA due to the iommu_group. + + This immediately raises a security concern as user-space can use Peer to + Peer DMA to manipulate these migration control registers concurrently with any kernel actions. A device driver operating such a device must ensure that kernel integrity cannot be broken by hostile user space operating the migration MMIO - registers via peer to peer, at any point in the sequence. Notably the kernel + registers via peer to peer, at any point in the sequence. Further the kernel cannot use DMA to transfer any migration data. However, as discussed above in the "Device Peer to Peer DMA" section, it can ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-01 23:25 ` Jason Gunthorpe @ 2021-12-02 17:05 ` Cornelia Huck 2021-12-02 17:41 ` Jason Gunthorpe 2021-12-03 18:06 ` Alex Williamson 1 sibling, 1 reply; 29+ messages in thread From: Cornelia Huck @ 2021-12-02 17:05 UTC (permalink / raw) To: Jason Gunthorpe, Alex Williamson Cc: Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Wed, Dec 01 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Dec 01, 2021 at 01:03:14PM -0700, Alex Williamson wrote: >> But if this document is suggesting the mlx5/QEMU interpretation is the >> only valid interpretations for driver authors, those clarifications >> should be pushed back into the uAPI header. > > Can we go the other way and move more of the uAPI header text here? Where should a userspace author look when they try to implement support for vfio migration? I think we need to answer that question first. Maybe we should separate "these are the rules that an implementation must obey" from "here's a more verbose description of how things work, and how you can arrive at a working implementation". The former would go into the header, while the latter can go into this document. (The generated documentation can be linked from the header file.) ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-02 17:05 ` Cornelia Huck @ 2021-12-02 17:41 ` Jason Gunthorpe 2021-12-02 17:46 ` Cornelia Huck 0 siblings, 1 reply; 29+ messages in thread From: Jason Gunthorpe @ 2021-12-02 17:41 UTC (permalink / raw) To: Cornelia Huck, Jonathan Corbet, linux-doc Cc: Alex Williamson, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Thu, Dec 02, 2021 at 06:05:36PM +0100, Cornelia Huck wrote: > On Wed, Dec 01 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, Dec 01, 2021 at 01:03:14PM -0700, Alex Williamson wrote: > >> But if this document is suggesting the mlx5/QEMU interpretation is the > >> only valid interpretations for driver authors, those clarifications > >> should be pushed back into the uAPI header. > > > > Can we go the other way and move more of the uAPI header text here? > > Where should a userspace author look when they try to implement support > for vfio migration? I think we need to answer that question first. > > Maybe we should separate "these are the rules that an implementation > must obey" from "here's a more verbose description of how things work, > and how you can arrive at a working implementation". The former would go > into the header, while the latter can go into this document. (The > generated documentation can be linked from the header file.) I think the usual kernel expectation now is to find userspace information either in man pages or in the Documentation/ html pages? The uapi header is fine to be a terse summary of what the ioctl does and some important points, but I wouldn't try to write a spec for anything complicated in a header file. Maybe Jonathan has an advice? Thanks, Jason ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-02 17:41 ` Jason Gunthorpe @ 2021-12-02 17:46 ` Cornelia Huck 0 siblings, 0 replies; 29+ messages in thread From: Cornelia Huck @ 2021-12-02 17:46 UTC (permalink / raw) To: Jason Gunthorpe, Jonathan Corbet, linux-doc Cc: Alex Williamson, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Thu, Dec 02 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: > On Thu, Dec 02, 2021 at 06:05:36PM +0100, Cornelia Huck wrote: >> On Wed, Dec 01 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: >> >> > On Wed, Dec 01, 2021 at 01:03:14PM -0700, Alex Williamson wrote: >> >> But if this document is suggesting the mlx5/QEMU interpretation is the >> >> only valid interpretations for driver authors, those clarifications >> >> should be pushed back into the uAPI header. >> > >> > Can we go the other way and move more of the uAPI header text here? >> >> Where should a userspace author look when they try to implement support >> for vfio migration? I think we need to answer that question first. >> >> Maybe we should separate "these are the rules that an implementation >> must obey" from "here's a more verbose description of how things work, >> and how you can arrive at a working implementation". The former would go >> into the header, while the latter can go into this document. (The >> generated documentation can be linked from the header file.) > > I think the usual kernel expectation now is to find userspace > information either in man pages or in the Documentation/ html pages? > > The uapi header is fine to be a terse summary of what the ioctl does > and some important points, but I wouldn't try to write a spec for > anything complicated in a header file. I was thinking less of a complete spec, more of "these are the fields with some basic rules, consult $LINK for more information". ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-01 23:25 ` Jason Gunthorpe 2021-12-02 17:05 ` Cornelia Huck @ 2021-12-03 18:06 ` Alex Williamson 2021-12-06 16:03 ` Cornelia Huck 2021-12-06 19:15 ` Jason Gunthorpe 1 sibling, 2 replies; 29+ messages in thread From: Alex Williamson @ 2021-12-03 18:06 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Wed, 1 Dec 2021 19:25:02 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Dec 01, 2021 at 01:03:14PM -0700, Alex Williamson wrote: > > On Tue, 30 Nov 2021 23:14:07 -0400 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote: > > > > > > > > From what HNS said the device driver would have to trap every MMIO to > > > > > implement NDMA as it must prevent touches to the physical HW MMIO to > > > > > maintain the NDMA state. > > > > > > > > > > The issue is that the HW migration registers can stop processing the > > > > > queue and thus enter NDMA but a MMIO touch can resume queue > > > > > processing, so NDMA cannot be sustained. > > > > > > > > > > Trapping every MMIO would have a huge negative performance impact. So > > > > > it doesn't make sense to do so for a device that is not intended to be > > > > > used in any situation where NDMA is required. > > > > > > > > But migration is a cooperative activity with userspace. If necessary > > > > we can impose a requirement that mmap access to regions (other than the > > > > migration region itself) are dropped when we're in the NDMA or !RUNNING > > > > device_state. > > > > > > It is always NDMA|RUNNING, so we can't fully drop access to > > > MMIO. Userspace would have to transfer from direct MMIO to > > > trapping. With enough new kernel infrastructure and qemu support it > > > could be done. > > > > This is simply toggling whether the mmap MemoryRegion in QEMU is > > enabled, when not enabled access falls through to the read/write access > > functions. We already have this functionality for unmasking INTx > > interrupts when we don't have EOI support. > > > > > Even so, we can't trap accesses through the IOMMU so such a scheme > > > would still require removing IOMMU acess to the device. Given that the > > > basic qemu mitigation for no NDMA support is to eliminate P2P cases by > > > removing the IOMMU mappings this doesn't seem to advance anything and > > > only creates complexity. > > > > NDMA only requires that the device cease initiating DMA, so I suppose > > you're suggesting that the MMIO of a device that doesn't expect to make > > use of p2p DMA could be poked through DMA, which might cause the device > > to initiate a DMA and potentially lose sync with mediation. That would > > be bad, but seems like a non-issue for hns. > > Yes, not sure how you get to a non-issue though? If the IOMMU map is > present the huawei device can be attacked by a hostile VM and forced > to exit NDMA. All this takes is any two DMA capable devices to be > plugged in? I'm confused by your idea of a "hostile VM". Devices within a VM can only DMA to memory owned by the VM. So the only hostile VM that can attack this device is the VM that owns the device in the first place. Meanwhile the VMM is transitioning all devices to the NDMA state, so the opportunity to poke one device from another device within the same VM is pretty limited. And for what gain? Is this a DoS opportunity that the guest might trigger the migration to abort or fail? I suggest this might be a non-issue for hns because I understand it to not support any sort of p2p, therefore this DMA write to the device while it's under mediation would never happen in practice. Skipping the IOMMU mappings for the device would prevent it, but I'm still not sure if this goes beyond a VM shooting itself scenario. > > > At least I'm not going to insist that hns do all kinds of work like > > > this for a edge case they don't care about as a precondition to get a > > > migration driver. > > > > I wonder if we need a region flag to opt-out of IOMMU mappings for > > devices that do not support p2p use cases. If there were no IOMMU > > mappings and mediation handled CPU driven MMIO accesses, then we'd have > > a viable NDMA mode for hns. > > This is a good idea, if we want to make huawei support NDMA then > flagging it to never be in the iommu map in the first place is a great > solution. Then they can use CPU mmio trapping get the rest of the way. NB, this flag would only be a hint to the VMM, the vfio IOMMU backend would need to be involved if we intended to prevent such mappings, but that seems like overkill if we're expecting userspace to follow the rules to get working migration and migration drivers have sufficient robustness to maintain device isolation. > > > > There's no reason that mediation while in the NDMA state needs to > > > > impose any performance penalty against the default RUNNING state. > > > > > > Eh? Mitigation of no NDMA support would have to mediate the MMIO on a > > > a performance doorbell path, there is no escaping a performance > > > hit. I'm not sure what you mean > > > > Read it again, I'm suggesting that mediation during NDMA doesn't need > > to carry over any performance penalty to the default run state. We > > don't care if mediation imposes a performance penalty while NDMA is set, > > we're specifically attempting to quiesce the device. The slower it > > runs the better ;) > > OK, I don't read it like that. It seems OK to have a performance hit > in NDMA since it is only a short grace state. Is it the length of time we're in the NDMA state or the utility of the NDMA state itself? If we're telling the device "no more outbound DMA, no more interrupts", then it doesn't seem super relevant for the majority of use cases what happens to the MMIO latency. > > > It would make userspace a bit simpler at the cost of excluding or > > > complicating devices like hns for a use case they don't care about. > > > > > > On the other hand, the simple solution in qemu is when there is no > > > universal NDMA it simply doesn't include any MMIO ranges in the > > > IOMMU. > > > > That leads to mysterious performance penalties when a VM was previously > > able to make use of (ex.) GPUDirect, > > Not a penalty it will just explode somehow. There is no way right now > for a VM to know P2P doesn't work. It is one of these annoying things > that leaks to the VMM like the no-snoop mess. A VMM installing a > device combination that is commonly used with P2P, like a GPU and a > NIC, had a better make sure P2P works :) > > > but adds an hns device and suddenly can no longer use p2p. Or > > rejected hotplugs if a device has existing NDMA capable devices, p2p > > might be active, but the new device does not support NDMA. > > I don't think qemu can go back on what it already did, so rejected > hotplug seems the only choice. Agreed > > This all makes it really complicated to get deterministic behavior > > for devices. I don't know how to make QEMU behavior predictable and > > supportable in such an environment. > > And this is the thing that gives me pause to think maybe the huawei > device should do the extra work? > > On the other hand I suspect their use case is fine with qemu set to > disable P2P completely. I don't know how we look at that use case in isolation though. > OTOH "supportable" qemu could certainly make the default choice to > require devices for simplicity. I get a bit lost every time I try to sketch out how QEMU would implement it. Forgive the stream of consciousness and rhetorical discussion below... - Does it make sense that a device itself can opt-out of p2p mappings? This is simply an MMIO access from another device rather than from the CPU. Vendors cannot preclude this use case on bare metal, should they be able to in a VM? We create heterogeneous p2p maps, device A can access device B, but device B maybe can't access device A. Seems troublesome. - If we can't have a per-device policy, then we'd need a per VM policy, likely some way to opt-out of all p2p mappings for vfio devices. We need to support hotplug of devices though, so is it a per VM policy or is it a per device policy which needs to be consistent among all attached devices? Perhaps a "enable_p2p=on/off" option on the vfio-pci device, default [on] to match current behavior. For any case of this option being set to non-default, all devices would need to set it to the same value, non-compliant devices rejected. - We could possibly allow migration=on,enable_p2p=on for a non-NDMA device, but the rules change if additional devices are added, they need to be rejected or migration support implicitly disappears. That seems hard to document, non-deterministic as far as a user is concerned. So maybe for a non-NDMA device we'd require enable_p2p=off in order to set migration=on. That means we could never enable migration on non-NDMA devices by default, which probably means we also cannot enable it by default on NDMA devices or we get user confusion/inconsistency. - Can a user know which devices will require enable_p2p=off in order to set migration=on? "Read the error log" is a poor user experience and difficult hurdle for libvirt. So in order to create a predictable QEMU experience in the face of optional NDMA per device, I think we preclude being able to enable migration support for any vfio device by default and we have an exercise to determine how a user or management tool could easily determine NDMA compatibility :-\ > > > Since qemu is the only implementation it would be easy for drivers to > > > rely on the implicit reset it seems to do, it seems an important point > > > that should be written either way. > > > > > > I don't have a particular requirement to have the reset, but it does > > > seem like a good idea. If you feel strongly, then let's say the > > > opposite that the driver must enter RESUME with no preconditions, > > > doing an internal reset if required. > > > > It seems cleaner to me than unnecessarily requiring userspace to pass > > through an ioctl in order to get to the next device_state. > > Ok, I'll add something like this. > > > > Can you point to something please? I can't work with "I'm not sure" > > > > The reset behavior that I'm trying to clarify above is the primary > > offender, but "I'm not sure" I understand the bit prioritization enough > > to know that there isn't something there as well. I'm also not sure if > > the "end of stream" phrasing below matches the uAPI. > > Realistically I think userspace should not make use of the bit > prioritization. It is more as something driver implementors should > follow for consistent behavior. > > > > IMO the header file doesn't really say much and can be read in a way > > > that is consistent with this more specific document. > > > > But if this document is suggesting the mlx5/QEMU interpretation is the > > only valid interpretations for driver authors, those clarifications > > should be pushed back into the uAPI header. > > Can we go the other way and move more of the uAPI header text here? Currently our Documentation/ file describes an overview of vfio, the high level theory of groups, devices, and iommus, provides a detailed userspace usage example, an overview of device registration and ops, and a bunch of random spapr implementation notes largely around userspace considerations on power platforms. So I would generalize that our Documentation/ is currently focused on userspace access and interaction with the uAPI. The uAPI itself therefore generally provides the detail regarding how a given interface is intended to work from an implementation perspective. I'd say the proposed document here is mixing both of those in Documentation/, some aspects relevant to implementation of the uAPI, others to userspace. Currently I would clearly say that the uAPI header is the canonical source of truth and I think it should continue to be so for describing the implementation of the uAPI. > > > I have no idea anymore. You asked for docs and complete picture as a > > > percondition for merging a driver. Here it is. > > > > > > What do you want? > > > > Looking through past conversations, I definitely asked that we figure > > out how NDMA is going to work. Are you thinking of a different request > > from me? > > It was part of the whole etherpad thing. This was what we said we'd do > to resolve the discussion. > > I expect to come to some agreement with you and Connie on this text > and we will go ahead. > > > The restriction implied by lack of NDMA support are pretty significant. > > Maybe a given device like hns doesn't care because they don't intend to > > support p2p, but they should care if it means we can't hot-add their > > device to a VM in the presences of devices that can support p2p and if > > cold plugging their device into an existing configuration implies loss > > of functionality or performance elsewhere. > > > > I'd tend to expect though that we'd incorporate NDMA documentation into > > the uAPI with a formal proposal for discovery and outline a number of > > those usage implications. > > Yishai made a patch, but we have put the discussion of NDMA here, not > hidden in a commit message > > > > > We've tried to define a specification that's more flexible than a > > > > single implementation and by these standards we seem to be flipping > > > > that implementation back into the specification. > > > > > > What specification!?! All we have is a couple lines in a header file > > > that is no where near detailed enough for multi-driver > > > interoperability with userspace. You have no idea how much effort has > > > been expended to get this far based on the few breadcrumbs that were > > > left, and we have access to the team that made the only other > > > implementation! > > > > > > *flexible* is not a specification. > > > > There are approximately 220 lines of the uAPI header file dealing > > specifically with the migration region. A bit more than a couple. > > Unfortunately more than half of that describes how the data window > works, and half of the rest is kind of obvious statements. How does that rationalize forking implementation details to userspace Documentation/ rather than resolving the issues you see with the uAPI description? > > We've tried to define it by looking at the device requirements to > > support migration rather than tailor it specifically to the current QEMU > > implementation of migration. Attempting to undo that generality by > > suggesting only current usage patterns are relevant is therefore going > > to generate friction. > > In your mind you see generality, in our mind we want to know how to > write an inter operable driver and there is no documention saying how > to do that. I didn't know there was a hive mind over there, maybe that explains your quick replies ;) There's a fine line between writing an inter-operable driver and writing a driver for the current QEMU implementation. Obviously we want to support the current QEMU implementation, but we want an interface that can accommodate how that implementation might evolve. Once we start telling driver authors to expect specific flows rather than looking at the operation of each bit, then our implementations become more fragile, less versatile relative to the user. > > > > Userspace can attempt RESUMING -> RUNNING regardless of what we specify, > > > > so a driver needs to be prepared for such an attempted state change > > > > either way. So what's the advantage to telling a driver author that > > > > they can expect a given behavior? > > > > > > The above didn't tell a driver author to expect a certain behavior, it > > > tells userspace what to do. > > > > "The migration driver can rely on user-space issuing a > > VFIO_DEVICE_RESET prior to starting RESUMING." > > I trimmed too much, the original text you quoted was > > "To abort a RESUMING issue a VFIO_DEVICE_RESET." > > Which I still think is fine. If we're writing a specification, that's really a MAY statement, userspace MAY issue a reset to abort the RESUMING process and return the device to RUNNING. They MAY also write the device_state directly, which MAY return an error depending on various factors such as whether data has been written to the migration state and whether that data is complete. If a failed transitions results in an ERROR device_state, the user MUST issue a reset in order to return it to a RUNNING state without closing the interface. A recommendation to use reset to skip over potential error conditions when the goal is simply a new, clean RUNNING state irrespective of data written to the migration region, is fine. But drivers shouldn't be written with only that expectation, just like they shouldn't expect a reset precondition to entering RESUMING. > > Tracing that shows a reset preceding entering RESUMING doesn't suggest > > to me that QEMU is performing a reset for the specific purpose of > > entering RESUMING. Correlation != causation. > > Kernel doesn't care why qemu did it - it was done. Intent doesn't > matter :) This is exactly the sort of "designed for QEMU implementation" inter-operability that I want to avoid. It doesn't take much of a crystal ball to guess that gratuitous and redundant device resets slow VM instantiation and are a likely target for optimization. > > The order I see in the v5 mlx5 post is: > > > > if RUNNING 1->0 > > quiesce + freeze > > if RUNNING or SAVING change && state == !RUNNING | SAVING > > save device state > > if RESUMING 0->1 > > reset device state > > if RESUMING 1->0 > > load device state > > if RUNNING 0->1 > > unfreeze + unquiesce > > Right, which matches the text: > > - !RUNNING > - SAVING | !RUNNING > - RESUMING > - !RESUMING > - RUNNING > > > So maybe part of my confusion stems from the fact that the mlx5 driver > > doesn't support pre-copy, which by the provided list is the highest > > priority. > > Right. > > > But what actually makes that the highest priority? Any > > combination of SAVING and RESUMING is invalid, so we can eliminate > > those. We obviously can't have RUNNING and !RUNNING, so we can > > eliminate all cases of !RUNNING, so we can shorten the list relativeto > > prioritizing SAVING|RUNNING to: > > There are several orders that can make sense. What we've found is > following the reference flow order has given something workable for > precedence. I think though that your reference flow priority depends a lot on your implementation that resuming state is stored somewhere and only processed on the transition of the RESUMING bit. You're sharing a buffer between SAVING and RESUMING. That's implementation, not specification. A device may choose to validate and incorporate data written to the migration region as it's written. A device may choose to expose actual portions of on device memory through the migration region during either SAVING or RESTORING. Therefore setting or clearing of these bits may not have the data dependencies that drive the priority scheme of mlx5. So why is this the one true implementation? > > SAVING | RUNNING would need to be processed after !RESUMING, but > > maybe before RUNNING itself. > > It is a good point, it does make more sense after RUNNING as a device > should be already RUNNING before entering pre-copy. I moved it to > before !NDMA > > > NDMA only requires that the device cease initiating DMA before the call > > returns and it only has meaning if RUNNING, so I'm not sure why setting > > NDMA is given any priority. I guess maybe we're trying > > (unnecessarily?) to preempt any DMA that might occur as a result of > > setting RUNNING (so why is it above cases including !RUNNING?)? > > Everything was given priority so there is no confusing omission. For > the order, as NDMA has no meaning outside RUNNING, it makes sense you'd > do a NDMA before making it meaningless / after making it meaningful. > > This is made concrete by mlx5's scheme that always requires quiesce > (NDMA) to happen before freeze (!RUNNING) and viceversa, so we get to > this order. mlx5 implicitly does NDMA on !RUNNING > > > Obviously presenting a priority scheme without a discussion of the > > associativity of the states and a barely sketched out nomenclature is > > really not inviting an actual understanding how this is reasoned (imo). > > Sure, but how we got here isn't really important to the intent of the > document to guide implementors. > > Well, you wrote a lot, and found a correction, but I haven't been left > with a way to write this more clearly? Now that you obviously > understand what it is saying, what do you think? I think implementation and clarification of the actual definition of the uAPI should exist in the uAPI header, userspace clarification and examples for the consumption of the uAPI should stay here in Documentation/, and we should not assume either the QEMU or mlx5 implementations as the canonical reference for either. > > I'm feeling like there's a bit of a chicken and egg problem here to > > decide if this is sufficiently clear documentation before a new posting > > of the mlx5 driver where the mlx5 driver is the authoritative source > > for the reasoning of the priority scheme documented here (and doesn't > > make use of pre-copy). > > I wouldn't fixate on the ordering, it is a small part of the > document.. > > > The migration data stream is entirely opaque to userspace, so what's > > the benefit to userspace to suggest anything about the content in each > > phase? This is presented in a userspace edge concerns section, but the > > description is much more relevant to a driver author. > > It is informative for the device driver author to understand what > device functionality to map to this. Sounds like we're in agreement, so why does this belong in userspace Documentation/? > > > > I think the fact that a user is not required to run the pre-copy > > > > phase until completion is also noteworthy. > > > > > > This text doesn't try to detail how the migration window works, that > > > is a different large task. The intention is that the migration window > > > must be fully drained to be successful. > > > > Clarification, the *!RUNNING* migration window must be fully drained. > > > > > I added this for some clarity ""The entire migration data, up to each > > > end of stream must be transported from the saving to resuming side."" > > > > Per the uAPI regarding pre-copy: > > > > "The user must not be required to consume all migration data before > > the device transitions to a new state, including the stop-and-copy > > state." > > > > If "end of stream" suggests the driver defined end of the data stream > > for pre-copy rather than simply the end of the user accumulated data > > stream, that conflicts with the uAPI. Thanks, > > Hmm, yes. I can try to clarify how this all works better. We don't > implement pre-copy but it should still be described better than it has > been. > > I'm still not sure how this works. > > We are in SAVING|RUNNING and we dump all the dirty data and return end > of stream. > > We stay in SAVING|RUNNING and some more device state became dirty. How > does userspace know? Should it poll and see if the stream got longer? The uAPI might need some clarification here, but the only viable scenario would seem to be that yes, userspace should continue to poll the device so long as it remains in SAVING|RUNNING as internal state can continue to change asynchronously from userspace polling. It's only when pending_bytes returns zero while !RUNNING that the data stream is complete and the device is precluded from deciding any new state is available. Thanks, Alex ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-03 18:06 ` Alex Williamson @ 2021-12-06 16:03 ` Cornelia Huck 2021-12-06 17:34 ` Jason Gunthorpe 2021-12-06 19:15 ` Jason Gunthorpe 1 sibling, 1 reply; 29+ messages in thread From: Cornelia Huck @ 2021-12-06 16:03 UTC (permalink / raw) To: Alex Williamson, Jason Gunthorpe Cc: Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Fri, Dec 03 2021, Alex Williamson <alex.williamson@redhat.com> wrote: > On Wed, 1 Dec 2021 19:25:02 -0400 > Jason Gunthorpe <jgg@nvidia.com> wrote: > >> On Wed, Dec 01, 2021 at 01:03:14PM -0700, Alex Williamson wrote: >> > On Tue, 30 Nov 2021 23:14:07 -0400 >> > Jason Gunthorpe <jgg@nvidia.com> wrote: >> > >> > > On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote: >> OTOH "supportable" qemu could certainly make the default choice to >> require devices for simplicity. > > I get a bit lost every time I try to sketch out how QEMU would > implement it. Forgive the stream of consciousness and rhetorical > discussion below... > > - Does it make sense that a device itself can opt-out of p2p mappings? > This is simply an MMIO access from another device rather than from > the CPU. Vendors cannot preclude this use case on bare metal, > should they be able to in a VM? We create heterogeneous p2p maps, > device A can access device B, but device B maybe can't access device > A. Seems troublesome. > > - If we can't have a per-device policy, then we'd need a per VM > policy, likely some way to opt-out of all p2p mappings for vfio > devices. We need to support hotplug of devices though, so is it a > per VM policy or is it a per device policy which needs to be > consistent among all attached devices? Perhaps a > "enable_p2p=on/off" option on the vfio-pci device, default [on] to > match current behavior. For any case of this option being set to > non-default, all devices would need to set it to the same value, > non-compliant devices rejected. > > - We could possibly allow migration=on,enable_p2p=on for a non-NDMA > device, but the rules change if additional devices are added, they > need to be rejected or migration support implicitly disappears. That > seems hard to document, non-deterministic as far as a user is > concerned. So maybe for a non-NDMA device we'd require > enable_p2p=off in order to set migration=on. That means we could > never enable migration on non-NDMA devices by default, which > probably means we also cannot enable it by default on NDMA devices > or we get user confusion/inconsistency. > > - Can a user know which devices will require enable_p2p=off in order > to set migration=on? "Read the error log" is a poor user experience > and difficult hurdle for libvirt. > > So in order to create a predictable QEMU experience in the face of > optional NDMA per device, I think we preclude being able to enable > migration support for any vfio device by default and we have an > exercise to determine how a user or management tool could easily > determine NDMA compatibility :-\ Hm, maybe we can take a page out of the confidential guest support book and control this like we do for the virtio iommu flag? I'm not sure whether there is a pressing need to support p2p for non-NDMA devices? > There's a fine line between writing an inter-operable driver and > writing a driver for the current QEMU implementation. Obviously we want > to support the current QEMU implementation, but we want an interface > that can accommodate how that implementation might evolve. Once we > start telling driver authors to expect specific flows rather than > looking at the operation of each bit, then our implementations become > more fragile, less versatile relative to the user. What is actually on the table regarding the current QEMU implementation? The interface is still marked as experimental, so we have some room for changes, but we probably don't want to throw everything away. > >> > > > Userspace can attempt RESUMING -> RUNNING regardless of what we specify, >> > > > so a driver needs to be prepared for such an attempted state change >> > > > either way. So what's the advantage to telling a driver author that >> > > > they can expect a given behavior? >> > > >> > > The above didn't tell a driver author to expect a certain behavior, it >> > > tells userspace what to do. >> > >> > "The migration driver can rely on user-space issuing a >> > VFIO_DEVICE_RESET prior to starting RESUMING." >> >> I trimmed too much, the original text you quoted was >> >> "To abort a RESUMING issue a VFIO_DEVICE_RESET." >> >> Which I still think is fine. > > If we're writing a specification, that's really a MAY statement, > userspace MAY issue a reset to abort the RESUMING process and return > the device to RUNNING. They MAY also write the device_state directly, > which MAY return an error depending on various factors such as whether > data has been written to the migration state and whether that data is > complete. If a failed transitions results in an ERROR device_state, > the user MUST issue a reset in order to return it to a RUNNING state > without closing the interface. Are we actually writing a specification? If yes, we need to be more clear on what is mandatory (MUST), advised (SHOULD), or allowed (MAY). If I look at the current proposal, I'm not sure into which category some of the statements fall. > > A recommendation to use reset to skip over potential error conditions > when the goal is simply a new, clean RUNNING state irrespective of data > written to the migration region, is fine. But drivers shouldn't be > written with only that expectation, just like they shouldn't expect a > reset precondition to entering RESUMING. > >> > Tracing that shows a reset preceding entering RESUMING doesn't suggest >> > to me that QEMU is performing a reset for the specific purpose of >> > entering RESUMING. Correlation != causation. >> >> Kernel doesn't care why qemu did it - it was done. Intent doesn't >> matter :) > > This is exactly the sort of "designed for QEMU implementation" > inter-operability that I want to avoid. It doesn't take much of a > crystal ball to guess that gratuitous and redundant device resets slow > VM instantiation and are a likely target for optimization. Which brings me back to my question above: Are we wedded to all details of the current QEMU implementation? Should we consider some of them more as a MAY? Can we change QEMU to do things differently, given the experimental nature of the support? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-06 16:03 ` Cornelia Huck @ 2021-12-06 17:34 ` Jason Gunthorpe 2021-12-06 18:06 ` Cornelia Huck 0 siblings, 1 reply; 29+ messages in thread From: Jason Gunthorpe @ 2021-12-06 17:34 UTC (permalink / raw) To: Cornelia Huck Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Mon, Dec 06, 2021 at 05:03:00PM +0100, Cornelia Huck wrote: > > If we're writing a specification, that's really a MAY statement, > > userspace MAY issue a reset to abort the RESUMING process and return > > the device to RUNNING. They MAY also write the device_state directly, > > which MAY return an error depending on various factors such as whether > > data has been written to the migration state and whether that data is > > complete. If a failed transitions results in an ERROR device_state, > > the user MUST issue a reset in order to return it to a RUNNING state > > without closing the interface. > > Are we actually writing a specification? If yes, we need to be more > clear on what is mandatory (MUST), advised (SHOULD), or allowed > (MAY). If I look at the current proposal, I'm not sure into which > category some of the statements fall. I deliberately didn't use such formal language because this is far from what I'd consider an acceptable spec. It is more words about how things work and some kind of basis for agreement between user and kernel. Under Linus's "don't break userspace" guideline whatever userspace ends up doing becomes the spec the kernel is wedded to, regardless of what we write down here. Which basically means whatever mlx5 and qemu does after we go forward is the definitive spec and we cannot change qemu in a way that is incompatible with mlx5 or introduce a new driver that is incompatible with qemu. Jason ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-06 17:34 ` Jason Gunthorpe @ 2021-12-06 18:06 ` Cornelia Huck 2021-12-06 19:19 ` Jason Gunthorpe 0 siblings, 1 reply; 29+ messages in thread From: Cornelia Huck @ 2021-12-06 18:06 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, Dec 06, 2021 at 05:03:00PM +0100, Cornelia Huck wrote: > >> > If we're writing a specification, that's really a MAY statement, >> > userspace MAY issue a reset to abort the RESUMING process and return >> > the device to RUNNING. They MAY also write the device_state directly, >> > which MAY return an error depending on various factors such as whether >> > data has been written to the migration state and whether that data is >> > complete. If a failed transitions results in an ERROR device_state, >> > the user MUST issue a reset in order to return it to a RUNNING state >> > without closing the interface. >> >> Are we actually writing a specification? If yes, we need to be more >> clear on what is mandatory (MUST), advised (SHOULD), or allowed >> (MAY). If I look at the current proposal, I'm not sure into which >> category some of the statements fall. > > I deliberately didn't use such formal language because this is far > from what I'd consider an acceptable spec. It is more words about how > things work and some kind of basis for agreement between user and > kernel. We don't really need formal language, but there are too many unclear statements, as the discussion above showed. Therefore my question: What are we actually writing? Even if it is not a formal specification, it still needs to be clear. > > Under Linus's "don't break userspace" guideline whatever userspace > ends up doing becomes the spec the kernel is wedded to, regardless of > what we write down here. All the more important that we actually agree before this is merged! I don't want choices hidden deep inside the mlx5 driver dictating what other drivers should do, it must be reasonably easy to figure out (including what is mandatory, and what is flexible.) > Which basically means whatever mlx5 and qemu does after we go forward > is the definitive spec and we cannot change qemu in a way that is > incompatible with mlx5 or introduce a new driver that is incompatible > with qemu. TBH, I'm not too happy with the current QEMU state, either. We need to take a long, hard look first and figure out what we need to do to make the QEMU support non-experimental. We're discussing a complex topic here, and we really don't want to perpetuate an unclear uAPI. This is where my push for more precise statements is coming from. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-06 18:06 ` Cornelia Huck @ 2021-12-06 19:19 ` Jason Gunthorpe 2021-12-07 11:16 ` Cornelia Huck 0 siblings, 1 reply; 29+ messages in thread From: Jason Gunthorpe @ 2021-12-06 19:19 UTC (permalink / raw) To: Cornelia Huck Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote: > We're discussing a complex topic here, and we really don't want to > perpetuate an unclear uAPI. This is where my push for more precise > statements is coming from. I appreciate that, and I think we've made a big effort toward that direction. Can we have some crisp feedback which statements need SHOULD/MUST/MUST NOT and come to something? The world needs to move forward, we can't debate this endlessly forever. It is already another 6 weeks past since the last mlx5 driver posting. I don't feel like this is really converging, you two are the maintainers, can you please drive to something acceptable, whatever it is? Jason ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-06 19:19 ` Jason Gunthorpe @ 2021-12-07 11:16 ` Cornelia Huck 2021-12-07 15:51 ` Jason Gunthorpe 0 siblings, 1 reply; 29+ messages in thread From: Cornelia Huck @ 2021-12-07 11:16 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote: > >> We're discussing a complex topic here, and we really don't want to >> perpetuate an unclear uAPI. This is where my push for more precise >> statements is coming from. > > I appreciate that, and I think we've made a big effort toward that > direction. > > Can we have some crisp feedback which statements need SHOULD/MUST/MUST > NOT and come to something? I'm not sure what I should actually comment on, some general remarks: - If we consider a possible vfio-ccw implementation that will quiesce the device and not rely on tracking I/O, we need to make the parts that talk about tracking non-mandatory. - NDMA sounds like something that needs to be non-mandatory as well. - The discussion regarding bit group changes has me confused. You seem to be saying that mlx5 needs that, so it needs to have some mandatory component; but are actually all devices able to deal with those bits changing as a group? - In particular, the flow needs definitive markings about what is mandatory to implement, what is strongly suggested, and what is optional. It is unclear to me what is really expected, and what is simply one way to implement it. > > The world needs to move forward, we can't debate this endlessly > forever. It is already another 6 weeks past since the last mlx5 driver > posting. 6 weeks is already blazingly fast in any vfio migration discussion. /s Remember that we have other things to do as well, not all of which will be visible to you. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-07 11:16 ` Cornelia Huck @ 2021-12-07 15:51 ` Jason Gunthorpe 2021-12-07 16:30 ` Cornelia Huck 2021-12-08 16:06 ` Alex Williamson 0 siblings, 2 replies; 29+ messages in thread From: Jason Gunthorpe @ 2021-12-07 15:51 UTC (permalink / raw) To: Cornelia Huck Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Tue, Dec 07, 2021 at 12:16:32PM +0100, Cornelia Huck wrote: > On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote: > > > >> We're discussing a complex topic here, and we really don't want to > >> perpetuate an unclear uAPI. This is where my push for more precise > >> statements is coming from. > > > > I appreciate that, and I think we've made a big effort toward that > > direction. > > > > Can we have some crisp feedback which statements need SHOULD/MUST/MUST > > NOT and come to something? > > I'm not sure what I should actually comment on, some general remarks: You should comment on the paragraphs that prevent you from adding a reviewed-by. > - If we consider a possible vfio-ccw implementation that will quiesce > the device and not rely on tracking I/O, we need to make the parts > that talk about tracking non-mandatory. I'm not sure what you mean by 'tracking I/O'? I thought we were good on ccw? > - NDMA sounds like something that needs to be non-mandatory as well. I agree, Alex are we agreed now ? > - The discussion regarding bit group changes has me confused. You seem > to be saying that mlx5 needs that, so it needs to have some mandatory > component; but are actually all devices able to deal with those bits > changing as a group? Yes, all devices can support this as written. If you think of the device_state as initiating some action pre bit group then we have multiple bit group that can change at once and thus multiple actions that can be triggered. All devices must support userspace initiating actions one by one in a manner that supports the reference flow. Thus, every driver can decompose a request for multiple actions into an ordered list of single actions and execute those actions exactly as if userspace had issued single actions. The precedence follows the reference flow so that any conflicts resolve along the path that already has defined behaviors. I honestly don't know why this is such a discussion point, beyond being a big oversight of the original design. > - In particular, the flow needs definitive markings about what is > mandatory to implement, what is strongly suggested, and what is > optional. It is unclear to me what is really expected, and what is > simply one way to implement it. I'm not sure either, this hasn't been clear at all to me. Alex has asked for things to be general and left undefined, but we need some minimum definition to actually implement driver/VMM interoperability for what we need to do. Really what qemu does will set the mandatory to implement. > > The world needs to move forward, we can't debate this endlessly > > forever. It is already another 6 weeks past since the last mlx5 driver > > posting. > > 6 weeks is already blazingly fast in any vfio migration discussion. /s We've invested a lot of engineer months in this project, it is disrespectful to all of this effort to leave us hanging with no clear path forward and no actionable review comments after so much time. This is another kernel cycle lost. > Remember that we have other things to do as well, not all of which will > be visible to you. As do we all, but your name is in the maintainer file, and that comes with some responsibility. Jason ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-07 15:51 ` Jason Gunthorpe @ 2021-12-07 16:30 ` Cornelia Huck 2021-12-07 17:00 ` Jason Gunthorpe 2021-12-08 16:06 ` Alex Williamson 1 sibling, 1 reply; 29+ messages in thread From: Cornelia Huck @ 2021-12-07 16:30 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Tue, Dec 07 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Dec 07, 2021 at 12:16:32PM +0100, Cornelia Huck wrote: >> On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: >> >> > On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote: >> > >> >> We're discussing a complex topic here, and we really don't want to >> >> perpetuate an unclear uAPI. This is where my push for more precise >> >> statements is coming from. >> > >> > I appreciate that, and I think we've made a big effort toward that >> > direction. >> > >> > Can we have some crisp feedback which statements need SHOULD/MUST/MUST >> > NOT and come to something? >> >> I'm not sure what I should actually comment on, some general remarks: > > You should comment on the paragraphs that prevent you from adding a > reviewed-by. On which copy? There have been updates, and I haven't found a conchise email to reply to. > >> - If we consider a possible vfio-ccw implementation that will quiesce >> the device and not rely on tracking I/O, we need to make the parts >> that talk about tracking non-mandatory. > > I'm not sure what you mean by 'tracking I/O'? MMIO. > > I thought we were good on ccw? We are, if we don't make things mandatory that are not needed for non-MMIO. > >> - NDMA sounds like something that needs to be non-mandatory as well. > > I agree, Alex are we agreed now ? > >> - The discussion regarding bit group changes has me confused. You seem >> to be saying that mlx5 needs that, so it needs to have some mandatory >> component; but are actually all devices able to deal with those bits >> changing as a group? > > Yes, all devices can support this as written. > > If you think of the device_state as initiating some action pre bit > group then we have multiple bit group that can change at once and thus > multiple actions that can be triggered. > > All devices must support userspace initiating actions one by one in a > manner that supports the reference flow. > > Thus, every driver can decompose a request for multiple actions into > an ordered list of single actions and execute those actions exactly as > if userspace had issued single actions. > > The precedence follows the reference flow so that any conflicts > resolve along the path that already has defined behaviors. Well, yes. I'm just wondering where bit groups are coming in then. That's where I'm confused (by the discussion). > > I honestly don't know why this is such a discussion point, beyond > being a big oversight of the original design. > >> - In particular, the flow needs definitive markings about what is >> mandatory to implement, what is strongly suggested, and what is >> optional. It is unclear to me what is really expected, and what is >> simply one way to implement it. > > I'm not sure either, this hasn't been clear at all to me. Alex has > asked for things to be general and left undefined, but we need some > minimum definition to actually implement driver/VMM interoperability > for what we need to do. > > Really what qemu does will set the mandatory to implement. We really, really need to revisit QEMU before that. I'm staring at the code and I'm not quite sure if that really is what we want. We might have been too tired after years of review cycles when merging that. > >> > The world needs to move forward, we can't debate this endlessly >> > forever. It is already another 6 weeks past since the last mlx5 driver >> > posting. >> >> 6 weeks is already blazingly fast in any vfio migration discussion. /s > > We've invested a lot of engineer months in this project, it is > disrespectful to all of this effort to leave us hanging with no clear > path forward and no actionable review comments after so much > time. This is another kernel cycle lost. Well... it's not only you who are spending time on this. I'm trying to follow the discussion, which is not easy, and try to come up with feedback, which is not easy, either. This is using up a huge chunk of my time. Compared with the long and tedious discussions that led to the initial code being merged, we're really going very fast. And expecting people to drop everything and make a definite desicion quickly when there are still open questions on a complex topic does not strike me as particularly respectful, either. > >> Remember that we have other things to do as well, not all of which will >> be visible to you. > > As do we all, but your name is in the maintainer file, and that comes > with some responsibility. It, however, does not mean that someone listed in MAINTAINERS must immediately deal with anything that is thrown at them to the detriment of everything else. It *especially* does not mean that someone listed in MAINTAINERS is neglecting their responsibilies if things are not going as well as you'd hope them to go. [There is a reason why I have dropped out of some maintainership entries recently, the asymmetry of people requiring feedback and merging and people actually giving feedback and merging seems to have gotten worse over the last years. I can certainly delist myself as a vfio reviewer as well, and while that would certainly help my wellbeing, I'm not sure whether that is what you want.] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-07 16:30 ` Cornelia Huck @ 2021-12-07 17:00 ` Jason Gunthorpe 0 siblings, 0 replies; 29+ messages in thread From: Jason Gunthorpe @ 2021-12-07 17:00 UTC (permalink / raw) To: Cornelia Huck Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Tue, Dec 07, 2021 at 05:30:29PM +0100, Cornelia Huck wrote: > On Tue, Dec 07 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Tue, Dec 07, 2021 at 12:16:32PM +0100, Cornelia Huck wrote: > >> On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: > >> > >> > On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote: > >> > > >> >> We're discussing a complex topic here, and we really don't want to > >> >> perpetuate an unclear uAPI. This is where my push for more precise > >> >> statements is coming from. > >> > > >> > I appreciate that, and I think we've made a big effort toward that > >> > direction. > >> > > >> > Can we have some crisp feedback which statements need SHOULD/MUST/MUST > >> > NOT and come to something? > >> > >> I'm not sure what I should actually comment on, some general remarks: > > > > You should comment on the paragraphs that prevent you from adding a > > reviewed-by. > > On which copy? There have been updates, and I haven't found a conchise > email to reply to. I'll send a v3 > >> - If we consider a possible vfio-ccw implementation that will quiesce > >> the device and not rely on tracking I/O, we need to make the parts > >> that talk about tracking non-mandatory. > > > > I'm not sure what you mean by 'tracking I/O'? > > MMIO. If there is no MMIO there is nothing to worry about, right? S390 should define its own rules around quieting whatever non-MMIO control path exists between the VM and the mdev. > > The precedence follows the reference flow so that any conflicts > > resolve along the path that already has defined behaviors. > > Well, yes. I'm just wondering where bit groups are coming in > then. That's where I'm confused (by the discussion). Bit groups are things like "SAVING | RUNNING". Bit SAVING does not have any meaning on its own, it is always qualified by running. > >> 6 weeks is already blazingly fast in any vfio migration discussion. /s > > > > We've invested a lot of engineer months in this project, it is > > disrespectful to all of this effort to leave us hanging with no clear > > path forward and no actionable review comments after so much > > time. This is another kernel cycle lost. > > Well... it's not only you who are spending time on this. I'm trying to > follow the discussion, which is not easy, and try to come up with > feedback, which is not easy, either. This is using up a huge chunk of my > time. Compared with the long and tedious discussions that led to the > initial code being merged, we're really going very fast. Well, yes, we are serious about getting this merged and we are fully committed to upstream quality and good design. In the end we all agree - we are only working through all the edges and special cases that always existed here but were never fully discussed. > And expecting people to drop everything and make a definite desicion > quickly when there are still open questions on a complex topic does > not strike me as particularly respectful, either. I don't even know what is the list open questions anymore! As far as I know everything in the etherpad is addressed by the documentation I prepared. > >> Remember that we have other things to do as well, not all of which will > >> be visible to you. > > > > As do we all, but your name is in the maintainer file, and that comes > > with some responsibility. > > It, however, does not mean that someone listed in MAINTAINERS must > immediately deal with anything that is thrown at them to the detriment > of everything else. I don't see this as immediately at all, please don't take it as such a demand. We posted v1 in September, it is now Decemeber, we want to know what we need to do to get this merged. I've asked several times now, and I'm not seeing a clear answer :( My suggestion is we agree on the document I wrote as a general understanding what we are doing (not so much a spec) and we proceed. We now have a lot of experience with physical device migration here, I think it is reasonable to extend some trust that we are implementing something that is reasonable as a uAPI. > [There is a reason why I have dropped out of some maintainership entries > recently, the asymmetry of people requiring feedback and merging and > people actually giving feedback and merging seems to have gotten worse > over the last years. I can certainly delist myself as a vfio reviewer as > well, and while that would certainly help my wellbeing, I'm not sure > whether that is what you want.] No, I do not want that, and am not suggesting that. Regards, Jason ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-07 15:51 ` Jason Gunthorpe 2021-12-07 16:30 ` Cornelia Huck @ 2021-12-08 16:06 ` Alex Williamson 2021-12-08 20:27 ` Jason Gunthorpe 1 sibling, 1 reply; 29+ messages in thread From: Alex Williamson @ 2021-12-08 16:06 UTC (permalink / raw) To: Jason Gunthorpe Cc: Cornelia Huck, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Tue, 7 Dec 2021 11:51:45 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Dec 07, 2021 at 12:16:32PM +0100, Cornelia Huck wrote: > > On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote: > > > > > >> We're discussing a complex topic here, and we really don't want to > > >> perpetuate an unclear uAPI. This is where my push for more precise > > >> statements is coming from. > > > > > > I appreciate that, and I think we've made a big effort toward that > > > direction. > > > > > > Can we have some crisp feedback which statements need SHOULD/MUST/MUST > > > NOT and come to something? > > > > I'm not sure what I should actually comment on, some general remarks: > > You should comment on the paragraphs that prevent you from adding a > reviewed-by. > > > - If we consider a possible vfio-ccw implementation that will quiesce > > the device and not rely on tracking I/O, we need to make the parts > > that talk about tracking non-mandatory. > > I'm not sure what you mean by 'tracking I/O'? > > I thought we were good on ccw? > > > - NDMA sounds like something that needs to be non-mandatory as well. > > I agree, Alex are we agreed now ? No. When last we left our thread, you seemed to be suggesting QEMU maintains two IOMMU domains, ie. containers, the first of which would include p2p mappings for all PCI devices, the second would include no p2p mappings. Device supporting NDMA get attached to the former, non-NDMA devices the latter. So some devices can access all devices via p2p DMA, other devices can access none. Are there any bare metal systems that expose such asymmetric p2p constraints? I'm not inclined to invent new p2p scenarios that only exist in VMs. In addition to creating this asymmetric topology, forcing QEMU to maintain two containers not only increases the overhead, but doubles the locked memory requirements for QEMU since our current locked memory accounting is unfortunately per container. Then we need to also consider that multi-device groups exist where a group can only be attached to one container and also vIOMMU cases where presumably we'd only get these dual-containers when multiple groups are attached to a container. Maybe also worth noting that we cannot atomically move a device between containers, due to both vfio and often IOMMU constraints afaik. So it still seems like the only QEMU policy that we could manage to document and support would require that non-mandatory NDMA support implies that migration cannot be enabled by default for any vfio device and that enabling migration sets in motion a binary policy regarding p2p mappings across the VM. I'm still not convinced how supportable that is, but I can at least imagine explicit per device options that need to align. I don't know if lack of NDMA on ccw was Connie's reasoning for making NDMA non-mandatory, but it seems like NDMA is only relevant to buses that support DMA, so AIUI it would be just as valid for ccw devices to report NDMA as a no-op. > > - The discussion regarding bit group changes has me confused. You seem > > to be saying that mlx5 needs that, so it needs to have some mandatory > > component; but are actually all devices able to deal with those bits > > changing as a group? > > Yes, all devices can support this as written. > > If you think of the device_state as initiating some action pre bit > group then we have multiple bit group that can change at once and thus > multiple actions that can be triggered. > > All devices must support userspace initiating actions one by one in a > manner that supports the reference flow. > > Thus, every driver can decompose a request for multiple actions into > an ordered list of single actions and execute those actions exactly as > if userspace had issued single actions. > > The precedence follows the reference flow so that any conflicts > resolve along the path that already has defined behaviors. > > I honestly don't know why this is such a discussion point, beyond > being a big oversight of the original design. In my case, because it's still not clearly a universal algorithm, yet it's being proposed as one. We already discussed that {!}NDMA placement is fairly arbitrary and looking at v3 I'm wondering how a RESUMING -> SAVING|!RUNNING transition works. For an implementation that shares a buffer between SAVING and RESUMING, the ordering seems to suggest the SAVING action has precedence over the !RESUMING action, which is clearly wrong, but for an implementation where migration data is read or written to the device directly, the ordering is not such a concern. > > - In particular, the flow needs definitive markings about what is > > mandatory to implement, what is strongly suggested, and what is > > optional. It is unclear to me what is really expected, and what is > > simply one way to implement it. > > I'm not sure either, this hasn't been clear at all to me. Alex has > asked for things to be general and left undefined, but we need some > minimum definition to actually implement driver/VMM interoperability > for what we need to do. > > Really what qemu does will set the mandatory to implement. And therefore anything that works with QEMU is correct and how a driver gets to that correct result can be implementation specific, depending on factors like whether device data is buffered or the device is accessed directly. We can have a valid, interoperable uAPI without constraining ourselves to a specific implementation. Largely I think that trying to impose an implementation as the specification is the source of our friction. > > > The world needs to move forward, we can't debate this endlessly > > > forever. It is already another 6 weeks past since the last mlx5 driver > > > posting. > > > > 6 weeks is already blazingly fast in any vfio migration discussion. /s > > We've invested a lot of engineer months in this project, it is > disrespectful to all of this effort to leave us hanging with no clear > path forward and no actionable review comments after so much > time. This is another kernel cycle lost. > > > Remember that we have other things to do as well, not all of which will > > be visible to you. > > As do we all, but your name is in the maintainer file, and that comes > with some responsibility. This is a bit infuriating, responding to it at all is probably ill advised. We're all investing a lot of time into this. We're all disappointed how the open source use case of the previous implementation fell apart and nobody else stepped up until now. Rubbing salt in that wound is not helpful or productive. Regardless, this implementation has highlighted gaps in the initial design and it's critical that those known gaps are addressed before we commit to the design with an in-kernel driver. Referring to the notes Connie copied from etherpad, those gaps include uAPI clarification regarding various device states and accesses allowed in those states, definition of a quiescent (NDMA) device state, discussion of per-device dirty state, and documentation such as userspace usage and edge cases. Only the latter items were specifically requested outside of the header and previously provided comments questioned if we're not actually creating contradictory documentation to the uAPI and why clarifications are not applied to the existing uAPI descriptions. Personally I'm a bit disappointed to see v3 posted where the diffstat indicates no uAPI updates, so we actually have no formal definition of this NDMA state, nor does it feel like we've really come to a conclusion on that discussion and how it affects userspace. What is this documenting if NDMA is not formally part of the uAPI? More so, it seems we're just trying to push to get a sign-off, demanding specific actions to get there. Isn't that how we got into this situation, approving the uAPI, or in this case documentation, without an in-kernel implementation and vetted userspace? Thanks, Alex ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-08 16:06 ` Alex Williamson @ 2021-12-08 20:27 ` Jason Gunthorpe 0 siblings, 0 replies; 29+ messages in thread From: Jason Gunthorpe @ 2021-12-08 20:27 UTC (permalink / raw) To: Alex Williamson Cc: Cornelia Huck, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Wed, Dec 08, 2021 at 09:06:47AM -0700, Alex Williamson wrote: > On Tue, 7 Dec 2021 11:51:45 -0400 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Tue, Dec 07, 2021 at 12:16:32PM +0100, Cornelia Huck wrote: > > > On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote: > > > > > > > >> We're discussing a complex topic here, and we really don't want to > > > >> perpetuate an unclear uAPI. This is where my push for more precise > > > >> statements is coming from. > > > > > > > > I appreciate that, and I think we've made a big effort toward that > > > > direction. > > > > > > > > Can we have some crisp feedback which statements need SHOULD/MUST/MUST > > > > NOT and come to something? > > > > > > I'm not sure what I should actually comment on, some general remarks: > > > > You should comment on the paragraphs that prevent you from adding a > > reviewed-by. > > > > > - If we consider a possible vfio-ccw implementation that will quiesce > > > the device and not rely on tracking I/O, we need to make the parts > > > that talk about tracking non-mandatory. > > > > I'm not sure what you mean by 'tracking I/O'? > > > > I thought we were good on ccw? > > > > > - NDMA sounds like something that needs to be non-mandatory as well. > > > > I agree, Alex are we agreed now ? > > No. When last we left our thread, you seemed to be suggesting QEMU > maintains two IOMMU domains, ie. containers, the first of which would > include p2p mappings for all PCI devices, the second would include no > p2p mappings. Device supporting NDMA get attached to the former, > non-NDMA devices the latter. I think it is another valid solution, yes, As are the ideas you identified It is also OK what you said earlier, that qemu can just permit a guest to mangle its migration and not do anything. I don't like this, but as policy it is something userspace could reasonably choose to do. The question here, right now, is what should the kernel do? - NDMA mandatory/not? Given we know Huawei cannot implement it, I vote for not mandatory - Should we expose some other flag, eg you 'fragile mmio' to help userspace form a policy? Might be useful, but also can be defined when someone creates a userspace policy that requires it. Either way kerenl is just reflecting HW capability to userspace, I'm not seeing how this can be the wrong thing for the kernel to do?? > So some devices can access all devices via p2p DMA, other devices can > access none. Are there any bare metal systems that expose such > asymmetric p2p constraints? I'm not inclined to invent new p2p > scenarios that only exist in VMs. Migration on baremetal is a bit odd, but all the rules are the same. The baremetal must reach the P2P grace state to avoid corruption and if it can rely on internal knowledge of how it operates the devices it can avoid things like NDMA or multiple domains entirely. > In addition to creating this asymmetric topology, forcing QEMU to > maintain two containers not only increases the overhead, Yes, this policy choice has downsides. I outlined a few options where qemu can avoid the overheads. Other choices have different downsides. I think part of the reason we keep going around this topic is I'm offering options that a VMM could do, and I can't much help you decide what matches qemu's policy framework. Frankly I think it is very operator specific. > but doubles the locked memory requirements for QEMU since our > current locked memory accounting is unfortunately per container. You asked for that to be fixed in iommufd and I've got that almost done now. With iommufd we can now have two domains with only the overhead of the actual io page table entries under the 2nd iommu_domain. When VDPA uses iommufd as well it should allow qemu to be single pin for everything. > Then we need to also consider that multi-device groups exist where a > group can only be attached to one container It is a problem, yes, but again qemu could adopt a policy of not support multi-group migration devices, if it wants. > and also vIOMMU cases where presumably we'd only get these > dual-containers when multiple groups are attached to a container. Not sure where this means? vIOMMU should operate both containers in unison, I'd think? > Maybe also worth noting that we cannot atomically move a device > between containers, due to both vfio and often IOMMU constraints > afaik. I don't think we need move, it is a permanent setup? > So it still seems like the only QEMU policy that we could manage to > document and support would require that non-mandatory NDMA support > implies that migration cannot be enabled by default for any vfio device > and that enabling migration sets in motion a binary policy regarding > p2p mappings across the VM. I'm still not convinced how supportable > that is, but I can at least imagine explicit per device options that > need to align. If this is what qemu community wants, OK. > I don't know if lack of NDMA on ccw was Connie's reasoning for making > NDMA non-mandatory, but it seems like NDMA is only relevant to buses > that support DMA, so AIUI it would be just as valid for ccw devices to > report NDMA as a no-op. ccw seems to support DMA, it calls vfio_pin_pages() after all? That is DMA in this definition. NDMA here was defined in terms of *any* DMA, not just DMA to certain addresses, so I expect ccw will probably say it does not support NDMA for many of the same reasons Huawei can't do it. So, I think Connie is right to be worried about a direction to require NDMA as a pre-condition for a migration driver. I suspect something like CCW is very much migratable with a method that relies on no mmio touches much like the Huawei driver does. In the S390 case I believe "no mmio touches" means 'whatever that thing inside kvm that causes the device to work' is. BTW, it is not just VFIO, but VDPA too. Currently VDPA cannot participate in the P2P mappings so it acts like the dual domain solution, but with IOMMUFD I expect VDPA to be able to access the P2P mappings and so it must also support NDMA somehow. Future things like the "netgpu" patches make it reasonable to combine VDPA and P2P in a guest. > > I honestly don't know why this is such a discussion point, beyond > > being a big oversight of the original design. > > In my case, because it's still not clearly a universal algorithm, yet > it's being proposed as one. We already discussed that {!}NDMA > placement is fairly arbitrary and looking at v3 I'm wondering how a > RESUMING -> SAVING|!RUNNING transition works. Ok, I've been struggling since I wrote this to figure out how to validate the precedence algorithmically, and I think I see it now. Minimally, the precedence must be ordered so we never transit through an invalid state. Ie if userspace asks for RESUMING -> SAVING | !RUNNING Then the start and end states are valid. However, we cannot transit through the invalid state 'RESUMING | SAVING | !RUNNING' on the way to get there. It means the driver cannot do RESUMING first, and is really the underlying reason why we found we needed precedence in the first place. So, my original design idea is just *completely* wrong. Since we allow 0->ANY the better solution is to go toward 0 first, then go away from 0: - !NDMA - !SAVING - !RUNNING - !RESUMING - RESUMING - RUNNING - SAVING - NDMA and even this I can't tell by inspection if it is OK or not - need a program to explore all combinations. Basically, this is too hard. I can possbily figure it out now, but if someone wants to add another state someday they *will* screw it up. How do you feel about simply forbidding this corner case entirely? Require that new_state ^ old_state is one hot. I have to check, but I think this is incompatible with current qemu. Maybe we can allow & document the few combinations qemu uses? A core function can enforce this consistently for drivers. ?? (frankly, I have decided I absolutely hate this device_state bit field as a uAPI choice, the fact we have spent so much time on this is just *grrrrrr*) > accessed directly. We can have a valid, interoperable uAPI without > constraining ourselves to a specific implementation. Largely I think > that trying to impose an implementation as the specification is the > source of our friction. I agree, but please accept it is not easy to understand what is spec and what is qemu. It is part of why I want to write this down. > This is a bit infuriating, responding to it at all is probably ill > advised. We're all investing a lot of time into this. We're all > disappointed how the open source use case of the previous > implementation fell apart and nobody else stepped up until now. > Rubbing salt in that wound is not helpful or productive. Sorry, I'm not trying to rub salt in a wound - but it is maddening we are so far down this path and we seem to be back at square one and debating exactly what the existing uAPI even is. > Regardless, this implementation has highlighted gaps in the initial > design and it's critical that those known gaps are addressed before we > commit to the design with an in-kernel driver. Referring to the notes > Connie copied from etherpad, those gaps include uAPI clarification > regarding various device states and accesses allowed in those states, > definition of a quiescent (NDMA) device state, discussion of per-device > dirty state, and documentation such as userspace usage and edge > cases. Okay, can summarize your view where we are on all these points? I really don't know anymore. I think the document addresses all of them. > Only the latter items were specifically requested outside of the header > and previously provided comments questioned if we're not actually > creating contradictory documentation to the uAPI and why clarifications > are not applied to the existing uAPI descriptions. As I said, it is simply too much text, there is no simple fix to make the header file documentation completely clear. If you have some idea how to fix it please share. I don't think the new docs are contradictory because the header file doesn't really say that much.. At least I haven't noticed anything. Yishai's patch to add NDMA should add it to the header comment a little bit, I'll check on that. > Personally I'm a bit disappointed to see v3 posted where the diffstat > indicates no uAPI updates, so we actually have no formal definition of > this NDMA state, I didn't realize this is something you felt was important for this RFC patch. The purpose of the RFC patch is to get some overall mutual agreement on the uAPI protocol as discussed on the RST text. The non-RFC version of this RST will come with the next mlx5 driver v6 posting and the implementation of NDMA for mlx5 and all the cap bits Yishai built for it. This is already coded - we have only refrained from posting a v6 out of respect for everyone's time to re-look on all the mlx5 code again if we are still debating big topcis in this text. If you recall, I said we'd like to go ahead with mlx5 as-is and you were very concerned about the implication of the NDMA problem. Thus this document was prepared to explain our view of NDMA and more, plus we went and updated mlx5 to fully implement it so we will not come back to this topic later in the kernel. > get there. Isn't that how we got into this situation, approving the > uAPI, or in this case documentation, without an in-kernel > implementation and vetted userspace? Thanks, I'm certainly not proposing we do that, please don't mis-understand. Jason ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-03 18:06 ` Alex Williamson 2021-12-06 16:03 ` Cornelia Huck @ 2021-12-06 19:15 ` Jason Gunthorpe 2021-12-07 10:50 ` Cornelia Huck 1 sibling, 1 reply; 29+ messages in thread From: Jason Gunthorpe @ 2021-12-06 19:15 UTC (permalink / raw) To: Alex Williamson Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Fri, Dec 03, 2021 at 11:06:19AM -0700, Alex Williamson wrote: > > Yes, not sure how you get to a non-issue though? If the IOMMU map is > > present the huawei device can be attacked by a hostile VM and forced > > to exit NDMA. All this takes is any two DMA capable devices to be > > plugged in? > > I'm confused by your idea of a "hostile VM". Devices within a VM can > only DMA to memory owned by the VM. So the only hostile VM that can > attack this device is the VM that owns the device in the first place. > Meanwhile the VMM is transitioning all devices to the NDMA state, so > the opportunity to poke one device from another device within the same > VM is pretty limited. And for what gain? Is this a DoS opportunity > that the guest might trigger the migration to abort or fail? I'm quite worried about what could be possible if the VM is able to impact the hypervisor. Something like migration is a hypervisor activity and it should be fully isolated. If we are certain that the only consequence is that the VM corrupts itself, then I agree that is OK, but I've seen enough crazy security stuff to be skeptical we are smart enough to know that :( > I suggest this might be a non-issue for hns because I understand it to > not support any sort of p2p, therefore this DMA write to the device > while it's under mediation would never happen in practice. It has nothing to do with the huawei device, any DMA capable device could be instructed to generate a P2P write to the huawei device. > > This is a good idea, if we want to make huawei support NDMA then > > flagging it to never be in the iommu map in the first place is a great > > solution. Then they can use CPU mmio trapping get the rest of the way. > > NB, this flag would only be a hint to the VMM, the vfio IOMMU backend > would need to be involved if we intended to prevent such mappings, but > that seems like overkill if we're expecting userspace to follow the > rules to get working migration and migration drivers have sufficient > robustness to maintain device isolation. I'm fine with hint. Secure userspace needs to follow the security rules, the kernel doesn't need to nanny a VMM, AFAIK? > > OK, I don't read it like that. It seems OK to have a performance hit > > in NDMA since it is only a short grace state. > > Is it the length of time we're in the NDMA state or the utility of the > NDMA state itself? If we're telling the device "no more outbound DMA, > no more interrupts", then it doesn't seem super relevant for the > majority of use cases what happens to the MMIO latency. Actually, I wrote down that the P2P grace state should have !VCPU_RUNNING, so we don't even need to worry about anything else here. Just remove it from the IOMMU map. PRI is the only wrinkle, but I think we have to insist on NDMA to do vPRI, and the huawei device doesn't support PRI anyhow. > > OTOH "supportable" qemu could certainly make the default choice to > > require devices for simplicity. > > I get a bit lost every time I try to sketch out how QEMU would > implement it. Forgive the stream of consciousness and rhetorical > discussion below... IMHO the simple solution without ugly edges: qemu maintains two IOMMU domains on demand: 1) has all the MMIO maps for all PCI devices. 2) has no MMIO maps for any PCI devices. A device supporting NDMA gets assigned to #1 A device without NDMA gets assigned to #2 Some qemu flag 'enable_p2p=off' causes all devices to use #2 and optimizes away the 2nd iommu domain. Thus all combinations of devices can be hotplugged in/out in any order and nothing breaks. > - Does it make sense that a device itself can opt-out of p2p mappings? Bit strange, but yes it kind of does make sense. It is more of a 'device strictly requires quiet MMIO when !RUNNING' flag. Which means the user space process must guarentee not to touch the MMIO through any means, including CPU, VCPU or IOMMU. Combined with the above, the special flag would allow such devices to join the #2 IOMMU domain and reduces cases where #1 != #2 > - Can a user know which devices will require enable_p2p=off in order > to set migration=on? "Read the error log" is a poor user experience > and difficult hurdle for libvirt. I'm sure we can add something in sysfs. There is a patch series adding a cdev for the vfio_device and that comes along with a full sysfs directory that can hold per-vfio-device information for userspace to access. > > Can we go the other way and move more of the uAPI header text here? > > Currently our Documentation/ file describes an overview of vfio, the > high level theory of groups, devices, and iommus, provides a detailed > userspace usage example, an overview of device registration and ops, > and a bunch of random spapr implementation notes largely around > userspace considerations on power platforms. Currently, but it doesn't have to stay that way. > So I would generalize that our Documentation/ is currently focused on > userspace access and interaction with the uAPI. The uAPI itself > therefore generally provides the detail regarding how a given interface > is intended to work from an implementation perspective. I'd say the > proposed document here is mixing both of those in Documentation/, some > aspects relevant to implementation of the uAPI, others to userspace. > > Currently I would clearly say that the uAPI header is the canonical > source of truth and I think it should continue to be so for describing > the implementation of the uAPI. Well, I don't want to have a > 1000 line comment in the uAPI header. > > Unfortunately more than half of that describes how the data window > > works, and half of the rest is kind of obvious statements. > > How does that rationalize forking implementation details to userspace > Documentation/ rather than resolving the issues you see with the uAPI > description? I object to putting a enormous comment in a header file and using poor tools to write documentation. If there are small things to update and correct then please point to them and we can fix it. Otherwise lets keep them split and try to work to improve the discussion in the rst in general for all the uapi. > > In your mind you see generality, in our mind we want to know how to > > write an inter operable driver and there is no documention saying how > > to do that. > > I didn't know there was a hive mind over there, maybe that explains > your quick replies ;) No :) It is just been a long topic here. You recall the first version of the series had a very different interpritation of the header file comment regarding the device_state. It was rather surprising your remark that the device_state is not a FSM when the header file document has FSM-like discussions. > This is exactly the sort of "designed for QEMU implementation" > inter-operability that I want to avoid. It doesn't take much of a > crystal ball to guess that gratuitous and redundant device resets > slow VM instantiation and are a likely target for optimization. Sorry, but Linus's "don't break userspace" forces us to this world. It does not matter what is written in text files, only what userspace actually does and the kernel must accommodate existing userspace going forward. So once released qemu forms some definitive spec and the guardrails that limit what we can do going forward. In part this document is also an effort to describe to kernel devs what these guardrails are. > I think though that your reference flow priority depends a lot on your > implementation that resuming state is stored somewhere and only > processed on the transition of the RESUMING bit. You're sharing a > buffer between SAVING and RESUMING. That's implementation, not > specification. A device may choose to validate and incorporate data > written to the migration region as it's written. A device may choose > to expose actual portions of on device memory through the migration > region during either SAVING or RESTORING. Therefore setting or > clearing of these bits may not have the data dependencies that drive > the priority scheme of mlx5. I've said this before, it doesn't matter what other devices might do. mlx5 needs to work this way, so least-common-denominator. The only other option is to declare multi-bit changes as undefined behavior - that seems dangerous in general. The reason this is a good ordering is because it matches the ordering userspace would do normally if it does one bit change at a time. Therefore, *any* device must be able to implement this order if it simply decomposes multi-bit changes to single actions, it must already support, using the priority listed here. Ideally we'd put this logic in the core code and drivers would just have callbacks, but I'm not quite sure we see this probelm well enough yet to make the callbacks. Maybe once we have three drivers.. > So why is this the one true implementation? Because it is what we are going to write down? > I think implementation and clarification of the actual definition of > the uAPI should exist in the uAPI header, userspace clarification and > examples for the consumption of the uAPI should stay here in > Documentation/, and we should not assume either the QEMU or mlx5 > implementations as the canonical reference for either. I'm not even sure anymore what you are looking for in order to advance. Can you please out line exactly what things you want from us? Like, as a task list? > > It is informative for the device driver author to understand what > > device functionality to map to this. > > Sounds like we're in agreement, so why does this belong in userspace > Documentation/? Because this is where documentation should live, not in a header file. > The uAPI might need some clarification here, but the only viable > scenario would seem to be that yes, userspace should continue to poll > the device so long as it remains in SAVING|RUNNING as internal state > can continue to change asynchronously from userspace polling. It's > only when pending_bytes returns zero while !RUNNING that the data > stream is complete and the device is precluded from deciding any new > state is available. Thanks, I updated around these lines, thanks Jason ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-06 19:15 ` Jason Gunthorpe @ 2021-12-07 10:50 ` Cornelia Huck 2021-12-07 15:37 ` Jason Gunthorpe 0 siblings, 1 reply; 29+ messages in thread From: Cornelia Huck @ 2021-12-07 10:50 UTC (permalink / raw) To: Jason Gunthorpe, Alex Williamson Cc: Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: > On Fri, Dec 03, 2021 at 11:06:19AM -0700, Alex Williamson wrote: >> This is exactly the sort of "designed for QEMU implementation" >> inter-operability that I want to avoid. It doesn't take much of a >> crystal ball to guess that gratuitous and redundant device resets >> slow VM instantiation and are a likely target for optimization. > > Sorry, but Linus's "don't break userspace" forces us to this world. > > It does not matter what is written in text files, only what userspace > actually does and the kernel must accommodate existing userspace going > forward. So once released qemu forms some definitive spec and the > guardrails that limit what we can do going forward. But QEMU support is *experimental*, i.e. if it breaks, you get to keep the pieces, things may change in incompatible ways. And it is experimental for good reason! I don't want something that we don't support in QEMU lock us into a bad kernel interface, that's just utterly broken. It would mean that we must never introduce experimental interfaces in QEMU that may need some rework of the kernel interface, but need to keep those out of the tree -- and that can't be in the best interest of implementing things requiring interaction between the kernel and QEMU. If you really think we cannot make changes, I vote for declaring the current interface legacy and not further developed, and introduce a new, separate one that doesn't carry all of the baggage that this one does. We could even end up with a QEMU part that looks better! ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-07 10:50 ` Cornelia Huck @ 2021-12-07 15:37 ` Jason Gunthorpe 2021-12-07 15:56 ` Cornelia Huck 2021-12-07 16:22 ` Alex Williamson 0 siblings, 2 replies; 29+ messages in thread From: Jason Gunthorpe @ 2021-12-07 15:37 UTC (permalink / raw) To: Cornelia Huck Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Tue, Dec 07, 2021 at 11:50:47AM +0100, Cornelia Huck wrote: > On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Fri, Dec 03, 2021 at 11:06:19AM -0700, Alex Williamson wrote: > > >> This is exactly the sort of "designed for QEMU implementation" > >> inter-operability that I want to avoid. It doesn't take much of a > >> crystal ball to guess that gratuitous and redundant device resets > >> slow VM instantiation and are a likely target for optimization. > > > > Sorry, but Linus's "don't break userspace" forces us to this world. > > > > It does not matter what is written in text files, only what userspace > > actually does and the kernel must accommodate existing userspace going > > forward. So once released qemu forms some definitive spec and the > > guardrails that limit what we can do going forward. > > But QEMU support is *experimental*, i.e. if it breaks, you get to keep > the pieces, things may change in incompatible ways. And it is > experimental for good reason! And we can probably make an breakage exception for this existing experimental qemu. My point was going forward, once we userspace starts to become deployed, it doesn't matter what we write in these text files and comments. It only matters what deployed userspace actually does. > It would mean that we must never introduce experimental interfaces > in QEMU that may need some rework of the kernel interface, but need > to keep those out of the tree -- and that can't be in the best > interest of implementing things requiring interaction between the > kernel and QEMU. In general we should not be merging uAPI to the kernel that is so incomplete as to be unusable. I'm sorry, this whole thing from the day the migration stuff was first merged to include/uapi is a textbook example how not to do things in the kernel community. Jason ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-07 15:37 ` Jason Gunthorpe @ 2021-12-07 15:56 ` Cornelia Huck 2021-12-07 16:13 ` Jason Gunthorpe 2021-12-07 16:22 ` Alex Williamson 1 sibling, 1 reply; 29+ messages in thread From: Cornelia Huck @ 2021-12-07 15:56 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Tue, Dec 07 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Dec 07, 2021 at 11:50:47AM +0100, Cornelia Huck wrote: >> On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: >> >> > On Fri, Dec 03, 2021 at 11:06:19AM -0700, Alex Williamson wrote: >> >> >> This is exactly the sort of "designed for QEMU implementation" >> >> inter-operability that I want to avoid. It doesn't take much of a >> >> crystal ball to guess that gratuitous and redundant device resets >> >> slow VM instantiation and are a likely target for optimization. >> > >> > Sorry, but Linus's "don't break userspace" forces us to this world. >> > >> > It does not matter what is written in text files, only what userspace >> > actually does and the kernel must accommodate existing userspace going >> > forward. So once released qemu forms some definitive spec and the >> > guardrails that limit what we can do going forward. >> >> But QEMU support is *experimental*, i.e. if it breaks, you get to keep >> the pieces, things may change in incompatible ways. And it is >> experimental for good reason! > > And we can probably make an breakage exception for this existing > experimental qemu. > > My point was going forward, once we userspace starts to become > deployed, it doesn't matter what we write in these text files and > comments. It only matters what deployed userspace actually does. Absolutely, as soon as QEMU support is made non-experimental, this needs to be stable. > >> It would mean that we must never introduce experimental interfaces >> in QEMU that may need some rework of the kernel interface, but need >> to keep those out of the tree -- and that can't be in the best >> interest of implementing things requiring interaction between the >> kernel and QEMU. > > In general we should not be merging uAPI to the kernel that is so > incomplete as to be unusable. > > I'm sorry, this whole thing from the day the migration stuff was first > merged to include/uapi is a textbook example how not to do things in > the kernel community. Honestly, I regret ever acking this and the QEMU part. Maybe the history of this and the archived discussions are instructive to others so that they don't repeat this :/ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-07 15:56 ` Cornelia Huck @ 2021-12-07 16:13 ` Jason Gunthorpe 0 siblings, 0 replies; 29+ messages in thread From: Jason Gunthorpe @ 2021-12-07 16:13 UTC (permalink / raw) To: Cornelia Huck Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Tue, Dec 07, 2021 at 04:56:17PM +0100, Cornelia Huck wrote: > Honestly, I regret ever acking this and the QEMU part. Maybe the history > of this and the archived discussions are instructive to others so that > they don't repeat this :/ The good news is we can fix it :) Jason ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH RFC v2] vfio: Documentation for the migration region 2021-12-07 15:37 ` Jason Gunthorpe 2021-12-07 15:56 ` Cornelia Huck @ 2021-12-07 16:22 ` Alex Williamson 1 sibling, 0 replies; 29+ messages in thread From: Alex Williamson @ 2021-12-07 16:22 UTC (permalink / raw) To: Jason Gunthorpe Cc: Cornelia Huck, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas On Tue, 7 Dec 2021 11:37:43 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Dec 07, 2021 at 11:50:47AM +0100, Cornelia Huck wrote: > > On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Fri, Dec 03, 2021 at 11:06:19AM -0700, Alex Williamson wrote: > > > > >> This is exactly the sort of "designed for QEMU implementation" > > >> inter-operability that I want to avoid. It doesn't take much of a > > >> crystal ball to guess that gratuitous and redundant device resets > > >> slow VM instantiation and are a likely target for optimization. > > > > > > Sorry, but Linus's "don't break userspace" forces us to this world. > > > > > > It does not matter what is written in text files, only what userspace > > > actually does and the kernel must accommodate existing userspace going > > > forward. So once released qemu forms some definitive spec and the > > > guardrails that limit what we can do going forward. > > > > But QEMU support is *experimental*, i.e. if it breaks, you get to keep > > the pieces, things may change in incompatible ways. And it is > > experimental for good reason! > > And we can probably make an breakage exception for this existing > experimental qemu. > > My point was going forward, once we userspace starts to become > deployed, it doesn't matter what we write in these text files and > comments. It only matters what deployed userspace actually does. I think we're losing sight of my concern in designing for QEMU. The document included a statement that migration driver writers could rely on userspace performing a device reset prior to entering the RESUMING device_state because of an unfounded correlation that QEMU resets the VM on the way to loading device state. Now, if we say QEMU does this thing and we need to support that usage model, I'm 100% on board. If we turn it around and say QEMU does this thing therefore migration drivers can expect exactly this usage model, full stop, that's the wrong direction. That is what I'm trying to avoid. The obvious way to remove the any question of breaking userspace is to simply rev the migration region sub-type. The kernel stops exposing any v1 sub-types, we don't break any userspaces, userspaces need to be updated to v2 in order to continue having any functionality. Thanks, Alex ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2021-12-08 20:27 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-11-29 14:45 [PATCH RFC v2] vfio: Documentation for the migration region Jason Gunthorpe 2021-11-30 17:26 ` Alex Williamson 2021-11-30 18:59 ` Jason Gunthorpe 2021-11-30 22:35 ` Alex Williamson 2021-12-01 3:14 ` Jason Gunthorpe 2021-12-01 9:54 ` Shameerali Kolothum Thodi 2021-12-01 13:49 ` Jason Gunthorpe 2021-12-01 20:03 ` Alex Williamson 2021-12-01 23:25 ` Jason Gunthorpe 2021-12-02 17:05 ` Cornelia Huck 2021-12-02 17:41 ` Jason Gunthorpe 2021-12-02 17:46 ` Cornelia Huck 2021-12-03 18:06 ` Alex Williamson 2021-12-06 16:03 ` Cornelia Huck 2021-12-06 17:34 ` Jason Gunthorpe 2021-12-06 18:06 ` Cornelia Huck 2021-12-06 19:19 ` Jason Gunthorpe 2021-12-07 11:16 ` Cornelia Huck 2021-12-07 15:51 ` Jason Gunthorpe 2021-12-07 16:30 ` Cornelia Huck 2021-12-07 17:00 ` Jason Gunthorpe 2021-12-08 16:06 ` Alex Williamson 2021-12-08 20:27 ` Jason Gunthorpe 2021-12-06 19:15 ` Jason Gunthorpe 2021-12-07 10:50 ` Cornelia Huck 2021-12-07 15:37 ` Jason Gunthorpe 2021-12-07 15:56 ` Cornelia Huck 2021-12-07 16:13 ` Jason Gunthorpe 2021-12-07 16:22 ` Alex Williamson
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.