All of lore.kernel.org
 help / color / mirror / Atom feed
From: Cornelia Huck <cohuck@redhat.com>
To: Yishai Hadas <yishaih@nvidia.com>,
	alex.williamson@redhat.com, bhelgaas@google.com, jgg@nvidia.com,
	saeedm@nvidia.com
Cc: linux-pci@vger.kernel.org, kvm@vger.kernel.org,
	netdev@vger.kernel.org, kuba@kernel.org, leonro@nvidia.com,
	kwankhede@nvidia.com, mgurtovoy@nvidia.com, yishaih@nvidia.com,
	maorg@nvidia.com
Subject: vfio migration discussions (was: [PATCH V2 mlx5-next 00/14] Add mlx5 live migration driver)
Date: Wed, 17 Nov 2021 17:42:58 +0100	[thread overview]
Message-ID: <87mtm2loml.fsf@redhat.com> (raw)
In-Reply-To: <20211019105838.227569-1-yishaih@nvidia.com>

Ok, here's the contents (as of 2021-11-17 16:30 UTC) of the etherpad at
https://etherpad.opendev.org/p/VFIOMigrationDiscussions -- in the hope
of providing a better starting point for further discussion (I know that
discussions are still ongoing in other parts of this thread; but
frankly, I'm getting a headache trying to follow them, and I think it
would be beneficial to concentrate on the fundamental questions
first...)

VFIO migration: current state and open questions

Current status
  * Linux
    * uAPI has been merged with a8a24f3f6e38 ("vfio: UAPI for migration
    interface for device state") in 5.8
      * no kernel user of the uAPI merged
      * Several out of tree drivers apparently
    * support for mlx5 currently on the list (latest:
    https://lore.kernel.org/all/20211027095658.144468-1-yishaih@nvidia.com/
    with discussion still happening on older versions)
    * support for HiSilicon ACC devices is on the list too. Adds support
    for HiSilicon crypto accelerator VF device live migration. These are
    simple DMA queue based PCIe integrated endpoint devices. No support
    for P2P and doesn't use DMA for migration. <latest:
    https://lore.kernel.org/lkml/20210915095037.1149-1-shameerali.kolothum.thodi@huawei.com/>
  * QEMU
    * basic support added in 5.2, some fixes later
      * support for vfio-pci only so far, still experimental
      ("x-enable-migration") as of 6.2
      * Only tested with out of tree drivers
  * other software?

Problems/open questions
  * Are the status bits currently defined in the uAPI
  (_RESUMING/_SAVING/_RUNNING) sufficient to express all states we need?
  * What does clearing _RUNNING imply? In particular, does it mean that
  the device is frozen, or are some operations still permitted?
  * various points brought up: P2P, SET_IRQS, ... <please summarize :)>:
    * P2P DMA support between devices requires an additional HW control
    state where the device can receive but not transmit DMA
    * No definition of what HW needs to preserve when RESUMING toggles
    off - (eg today SET_IRQS must work, what else?).
    * In general, how do IRQs work with future non-trapping IMS?
    * Dealing with pending IOMMU PRI faults during migration
  * Problems identified with the !RUNNING state:
    * When there are multiple devices in a user context (VM), we can't
    atomically move all devices to the !_RUNNING state concurently.
      * Suggests the current uAPI has a usage restriction for
      environments that do not make use of peer-to-peer DMA (ie. we
      can't have a device generating DMA to a p2p target that cannot
      accept it - especially if error response from target can generate
      host fatal conditions)
      * Possible userspace implications:
        * VMs could be limited to a single device to guarantee that no
        p2p exists - non-vfio devices generating physical p2p DMA in the
        future is a concern
        * Hypervisor may skip creating p2p DMA mappings, creating a
        choice whether the VM supports migration or p2p
      * Jason proposed a new NDMA (no-dma) state that seems to match the
      mlx5 implementation of "quiesce" vs "freeze" states, where NDMA
      would indicate the device cannot generate DMA or interrupts such
      that once userspace places all devices into the (NDMA | RUNNING)
      state the environment is fully quiesced.  A flag or capability on
      the migration region could indicate support for this feature.
      * Alex proposed that this could be equally resolved within the
      current device states if !RUNNING becomes the quiescent point
      where the device stops generating DMA and interrupts, with a
      requirement that the user moves all devices to !RUNNING before
      collecting device migration data (as indicated by reading
      pending_bytes) or else risk corrupting the migration data, which
      the device could indicate via an errno in the migration process.
      A flag or capability would still be required to indicate this
      support.
        * Jason does not favor this approach, objecting that the mode
        transition is implicit, and still needs qemu changes anyhow
    * In general, what operations or accesses is the user restricted
    from performing on the device while !RUNNING
      * Jason has proposed very restricted access (essentially none
      beyond the migration region itself), including no MMIO access
      <20211028234750.GP2744544@nvidia.com>  This essentially imposes
      device transmission to an intermediate state between SAVING and
      RUNNING.
        * Alex requested a formal uAPI update defining what accesses are
        allowed, including which regions and ioctls.
        * The existing uAPI does not require any such transition to a
        "null" state or TBD new device state bit.  QEMU currently
        expects access to config space and the ability to call SET_IRQS
        and create  mmaps while in the RESUMING state, without the
        RUNNING bit set.  Restoring MSI-X interrupt configuration
        necessarily requires MMIO access to the device.
        * Jason suggested a new device state bit and user protocol to
        account for this, where the device is in a !RUNNING and
        !RESTORING, but to some degree becomes manipulable via device
        regions and ioctls.  No compatibility mechanism proposed.
        * Alex suggested that this is potentially supportable via a spec
        clarification that requires the device migration data to be
        written to completion before userspace performs other region or
        ioctl access to the device. (mlx5's driver is designed to not
        inspect the migration blob itself, so it can't detect the
        "end". The migration blob is finished when mlx5 sees RESUMING
        clear.)
    * PRI into the guest (guest user process SVA) has a sequencing
    problem with RUNNING - can not migrate a vIOMMU in the middle of a
    page fault, must stop and flush faults before stopping vCPUs
  * The uAPI could benefit from some more detailed documentation
  (e.g. how to use it, what to do in edge cases, ...) outside of the
  header file.
  * Trying to use the mlx5 support currently on the list has unearthed
  some problems in QEMU <please summarize :)>
  * Discussion regarding dirty tracking and how much it should be
  controlled by user space still ongoing
  * General questions:
    * How much do we want to change the uAPI and/or the documentation to
    accommodate what QEMU has implemented so far?
    * How much do we want to change QEMU?

Possible solutions
  * uAPI
    * fine as is, or
    * needs some clarifications, or
    * needs rework, which might mean a v2
  * QEMU
    * fine as is (modulo bugfixes), or
    * needs some rework, but not impacting the uAPI, or
    * needs some rework, which also needs some changes in the uAPI
  * Suggested approach:
    * Work on the documentation, and try to come up with some more
    HW-centric docs
    * Depending on that, decide how many changes we want/need to do in
    QEMU


  parent reply	other threads:[~2021-11-17 16:43 UTC|newest]

Thread overview: 100+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-19 10:58 [PATCH V2 mlx5-next 00/14] Add mlx5 live migration driver Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 01/14] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 02/14] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 03/14] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 04/14] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 05/14] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 06/14] vdpa/mlx5: Use mlx5_vf_get_core_dev() to get PF device Yishai Hadas
2021-10-19 11:16   ` Max Gurtovoy
2021-10-20  8:58     ` Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 07/14] vfio: Fix VFIO_DEVICE_STATE_SET_ERROR macro Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 08/14] vfio: Add a macro for VFIO_DEVICE_STATE_ERROR Yishai Hadas
2021-10-19 15:48   ` Alex Williamson
2021-10-19 15:50     ` Alex Williamson
2021-10-20  7:35       ` Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 09/14] vfio/pci_core: Make the region->release() function optional Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 10/14] net/mlx5: Introduce migration bits and structures Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 11/14] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
2021-10-19 18:43   ` Alex Williamson
2021-10-19 19:23     ` Jason Gunthorpe
2021-10-19 20:58       ` Alex Williamson
2021-10-19 23:04         ` Jason Gunthorpe
2021-10-20  8:28           ` Yishai Hadas
2021-10-20 16:52             ` Alex Williamson
2021-10-20 18:59               ` Jason Gunthorpe
2021-10-20 21:07                 ` Alex Williamson
2021-10-21  9:34                   ` Cornelia Huck
2021-10-21 21:47                     ` Alex Williamson
2021-10-25 12:29                       ` Jason Gunthorpe
2021-10-25 14:28                         ` Alex Williamson
2021-10-25 14:56                           ` Jason Gunthorpe
2021-10-26 14:42                             ` Alex Williamson
2021-10-26 15:18                               ` Jason Gunthorpe
2021-10-26 19:50                                 ` Alex Williamson
2021-10-26 23:43                                   ` Jason Gunthorpe
2021-10-27 19:05                                     ` Alex Williamson
2021-10-27 19:23                                       ` Jason Gunthorpe
2021-10-28 15:08                                         ` Cornelia Huck
2021-10-29  0:26                                           ` Jason Gunthorpe
2021-10-29  7:35                                             ` Yishai Hadas
2021-10-28 15:30                                         ` Alex Williamson
2021-10-28 23:47                                           ` Jason Gunthorpe
2021-10-29  6:57                                             ` Cornelia Huck
2021-10-29  7:48                                               ` Yishai Hadas
2021-10-29 10:32                                             ` Shameerali Kolothum Thodi
2021-10-29 12:15                                               ` Jason Gunthorpe
2021-10-29 22:06                                             ` Alex Williamson
2021-11-01 17:25                                               ` Jason Gunthorpe
2021-11-02 11:19                                                 ` Shameerali Kolothum Thodi
2021-11-02 14:56                                                 ` Alex Williamson
2021-11-02 15:54                                                   ` Jason Gunthorpe
2021-11-02 16:22                                                     ` Alex Williamson
2021-11-02 16:36                                                       ` Jason Gunthorpe
2021-11-02 20:15                                                         ` Alex Williamson
2021-11-03 12:09                                                           ` Jason Gunthorpe
2021-11-03 15:44                                                             ` Alex Williamson
2021-11-03 16:10                                                               ` Jason Gunthorpe
2021-11-03 18:04                                                                 ` Alex Williamson
2021-11-04 11:19                                                                   ` Cornelia Huck
2021-11-05 16:53                                                                     ` Cornelia Huck
2021-11-16 16:59                                                                       ` Cornelia Huck
2021-11-05 13:24                                                                   ` Jason Gunthorpe
2021-11-05 15:31                                                                     ` Alex Williamson
2021-11-15 23:29                                                                       ` Jason Gunthorpe
2021-11-16 17:57                                                                         ` Alex Williamson
2021-11-16 19:25                                                                           ` Jason Gunthorpe
2021-11-16 21:10                                                                             ` Alex Williamson
2021-11-17  1:48                                                                               ` Jason Gunthorpe
2021-11-18 18:15                                                                                 ` Alex Williamson
2021-11-22 19:18                                                                                   ` Jason Gunthorpe
2021-11-08  8:53                                 ` Tian, Kevin
2021-11-08 12:35                                   ` Jason Gunthorpe
2021-11-09  0:58                                     ` Tian, Kevin
2021-11-09 12:45                                       ` Jason Gunthorpe
2021-10-25 16:34               ` Dr. David Alan Gilbert
2021-10-25 17:55                 ` Alex Williamson
2021-10-25 18:47                   ` Dr. David Alan Gilbert
2021-10-25 19:15                     ` Jason Gunthorpe
2021-10-26  8:40                       ` Dr. David Alan Gilbert
2021-10-26 12:13                         ` Jason Gunthorpe
2021-10-26 14:52                           ` Alex Williamson
2021-10-26 15:56                             ` Jason Gunthorpe
2021-10-26 14:29                     ` Alex Williamson
2021-10-26 14:51                       ` Dr. David Alan Gilbert
2021-10-26 15:25                         ` Jason Gunthorpe
2021-10-20  8:01     ` Yishai Hadas
2021-10-20 16:25       ` Jason Gunthorpe
2021-10-21 10:46         ` Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 13/14] vfio/pci: Expose vfio_pci_aer_err_detected() Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI reset_done error handler Yishai Hadas
2021-10-19 18:55   ` Alex Williamson
2021-10-19 19:10     ` Jason Gunthorpe
2021-10-20  8:46       ` Yishai Hadas
2021-10-20 16:46         ` Jason Gunthorpe
2021-10-20 17:45           ` Alex Williamson
2021-10-20 18:57             ` Jason Gunthorpe
2021-10-20 21:38               ` Alex Williamson
2021-10-21 10:39             ` Yishai Hadas
2021-11-17 16:42 ` Cornelia Huck [this message]
2021-11-17 17:47   ` vfio migration discussions (was: [PATCH V2 mlx5-next 00/14] Add mlx5 live migration driver) Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mtm2loml.fsf@redhat.com \
    --to=cohuck@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=jgg@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=leonro@nvidia.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=maorg@nvidia.com \
    --cc=mgurtovoy@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.