From: Auger Eric <eric.auger@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>,
Yan Zhao <yan.y.zhao@intel.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>,
"cjia@nvidia.com" <cjia@nvidia.com>,
"Tian, Kevin" <kevin.tian@intel.com>,
"Yang, Ziye" <ziye.yang@intel.com>,
"Liu, Changpeng" <changpeng.liu@intel.com>,
"Liu, Yi L" <yi.l.liu@intel.com>,
"mlevitsk@redhat.com" <mlevitsk@redhat.com>,
"eskultet@redhat.com" <eskultet@redhat.com>,
"cohuck@redhat.com" <cohuck@redhat.com>,
"dgilbert@redhat.com" <dgilbert@redhat.com>,
"jonathan.davies@nutanix.com" <jonathan.davies@nutanix.com>,
"eauger@redhat.com" <eauger@redhat.com>,
"aik@ozlabs.ru" <aik@ozlabs.ru>,
"pasic@linux.ibm.com" <pasic@linux.ibm.com>,
"felipe@nutanix.com" <felipe@nutanix.com>,
"Zhengxiao.zx@Alibaba-inc.com" <Zhengxiao.zx@Alibaba-inc.com>,
"shuangtai.tst@alibaba-inc.com" <shuangtai.tst@alibaba-inc.com>,
"Ken.Xue@amd.com" <Ken.Xue@amd.com>,
"Wang, Zhi A" <zhi.a.wang@intel.com>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>
Subject: Re: [PATCH v14 Kernel 1/7] vfio: KABI for migration interface for device state
Date: Mon, 23 Mar 2020 15:45:27 +0100 [thread overview]
Message-ID: <70f46a97-1bdc-e7f9-637d-327c54471ffc@redhat.com> (raw)
In-Reply-To: <20200319070921.565177ca@x1.home>
Hi,
On 3/19/20 2:09 PM, Alex Williamson wrote:
> On Thu, 19 Mar 2020 01:05:54 -0400
> Yan Zhao <yan.y.zhao@intel.com> wrote:
>
>> On Thu, Mar 19, 2020 at 11:49:26AM +0800, Alex Williamson wrote:
>>> On Wed, 18 Mar 2020 21:17:03 -0400
>>> Yan Zhao <yan.y.zhao@intel.com> wrote:
>>>
>>>> On Thu, Mar 19, 2020 at 03:41:08AM +0800, Kirti Wankhede wrote:
>>>>> - Defined MIGRATION region type and sub-type.
>>>>>
>>>>> - Defined vfio_device_migration_info structure which will be placed at the
>>>>> 0th offset of migration region to get/set VFIO device related
>>>>> information. Defined members of structure and usage on read/write access.
>>>>>
>>>>> - Defined device states and state transition details.
>>>>>
>>>>> - Defined sequence to be followed while saving and resuming VFIO device.
>>>>>
>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>>> ---
>>>>> include/uapi/linux/vfio.h | 227 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>> 1 file changed, 227 insertions(+)
>>>>>
>>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>>> index 9e843a147ead..d0021467af53 100644
>>>>> --- a/include/uapi/linux/vfio.h
>>>>> +++ b/include/uapi/linux/vfio.h
>>>>> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
>>>>> #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0xffff)
>>>>> #define VFIO_REGION_TYPE_GFX (1)
>>>>> #define VFIO_REGION_TYPE_CCW (2)
>>>>> +#define VFIO_REGION_TYPE_MIGRATION (3)
>>>>>
>>>>> /* sub-types for VFIO_REGION_TYPE_PCI_* */
>>>>>
>>>>> @@ -379,6 +380,232 @@ struct vfio_region_gfx_edid {
>>>>> /* sub-types for VFIO_REGION_TYPE_CCW */
>>>>> #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD (1)
>>>>>
>>>>> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION (1)
>>>>> +
>>>>> +/*
>>>>> + * The structure vfio_device_migration_info is placed at the 0th offset of
>>>>> + * the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related
>>>>> + * migration information. Field accesses from this structure are only supported
>>>>> + * at their native width and alignment. Otherwise, the result is undefined and
>>>>> + * vendor drivers should return an error.
>>>>> + *
>>>>> + * device_state: (read/write)
>>>>> + * - The user application writes to this field to inform the vendor driver
>>>>> + * about the device state to be transitioned to.
>>>>> + * - The vendor driver should take the necessary actions to change the
>>>>> + * device state. After successful transition to a given state, the
>>>>> + * vendor driver should return success on write(device_state, state)
>>>>> + * system call. If the device state transition fails, the vendor driver
>>>>> + * should return an appropriate -errno for the fault condition.
>>>>> + * - On the user application side, if the device state transition fails,
>>>>> + * that is, if write(device_state, state) returns an error, read
>>>>> + * device_state again to determine the current state of the device from
>>>>> + * the vendor driver.
>>>>> + * - The vendor driver should return previous state of the device unless
>>>>> + * the vendor driver has encountered an internal error, in which case
>>>>> + * the vendor driver may report the device_state VFIO_DEVICE_STATE_ERROR.
>>>>> + * - The user application must use the device reset ioctl to recover the
>>>>> + * device from VFIO_DEVICE_STATE_ERROR state. If the device is
>>>>> + * indicated to be in a valid device state by reading device_state, the
>>>>> + * user application may attempt to transition the device to any valid
>>>>> + * state reachable from the current state or terminate itself.
>>>>> + *
>>>>> + * device_state consists of 3 bits:
>>>>> + * - If bit 0 is set, it indicates the _RUNNING state. If bit 0 is clear,
>>>>> + * it indicates the _STOP state. When the device state is changed to
>>>>> + * _STOP, driver should stop the device before write() returns.
>>>>> + * - If bit 1 is set, it indicates the _SAVING state, which means that the
>>>>> + * driver should start gathering device state information that will be
>>>>> + * provided to the VFIO user application to save the device's state.
>>>>> + * - If bit 2 is set, it indicates the _RESUMING state, which means that
>>>>> + * the driver should prepare to resume the device. Data provided through
>>>>> + * the migration region should be used to resume the device.
>>>>> + * Bits 3 - 31 are reserved for future use. To preserve them, the user
>>>>> + * application should perform a read-modify-write operation on this
>>>>> + * field when modifying the specified bits.
>>>>> + *
>>>>> + * +------- _RESUMING
>>>>> + * |+------ _SAVING
>>>>> + * ||+----- _RUNNING
>>>>> + * |||
>>>>> + * 000b => Device Stopped, not saving or resuming
>>>>> + * 001b => Device running, which is the default state
>>>>> + * 010b => Stop the device & save the device state, stop-and-copy state
>>>>> + * 011b => Device running and save the device state, pre-copy state
>>>>> + * 100b => Device stopped and the device state is resuming
>>>>> + * 101b => Invalid state
>>>>> + * 110b => Error state
>>>>> + * 111b => Invalid state
>>>>> + *
>>>>> + * State transitions:
>>>>> + *
>>>>> + * _RESUMING _RUNNING Pre-copy Stop-and-copy _STOP
>>>>> + * (100b) (001b) (011b) (010b) (000b)
>>>>> + * 0. Running or default state
>>>>> + * |
>>>>> + *
>>>>> + * 1. Normal Shutdown (optional)
>>>>> + * |------------------------------------->|
>>>>> + *
>>>>> + * 2. Save the state or suspend
>>>>> + * |------------------------->|---------->|
>>>>> + *
>>>>> + * 3. Save the state during live migration
>>>>> + * |----------->|------------>|---------->|
>>>>> + *
>>>>> + * 4. Resuming
>>>>> + * |<---------|
>>>>> + *
>>>>> + * 5. Resumed
>>>>> + * |--------->|
>>>>> + *
>>>>> + * 0. Default state of VFIO device is _RUNNNG when the user application starts.
>>>>> + * 1. During normal shutdown of the user application, the user application may
>>>>> + * optionally change the VFIO device state from _RUNNING to _STOP. This
>>>>> + * transition is optional. The vendor driver must support this transition but
>>>>> + * must not require it.
>>>>> + * 2. When the user application saves state or suspends the application, the
>>>>> + * device state transitions from _RUNNING to stop-and-copy and then to _STOP.
>>>>> + * On state transition from _RUNNING to stop-and-copy, driver must stop the
>>>>> + * device, save the device state and send it to the application through the
>>>>> + * migration region. The sequence to be followed for such transition is given
>>>>> + * below.
>>>>> + * 3. In live migration of user application, the state transitions from _RUNNING
>>>>> + * to pre-copy, to stop-and-copy, and to _STOP.
>>>>> + * On state transition from _RUNNING to pre-copy, the driver should start
>>>>> + * gathering the device state while the application is still running and send
>>>>> + * the device state data to application through the migration region.
>>>>> + * On state transition from pre-copy to stop-and-copy, the driver must stop
>>>>> + * the device, save the device state and send it to the user application
>>>>> + * through the migration region.
>>>>> + * Vendor drivers must support the pre-copy state even for implementations
>>>>> + * where no data is provided to the user before the stop-and-copy state. The
>>>>> + * user must not be required to consume all migration data before the device
>>>>> + * transitions to a new state, including the stop-and-copy state.
>>>>> + * The sequence to be followed for above two transitions is given below.
>>>>> + * 4. To start the resuming phase, the device state should be transitioned from
>>>>> + * the _RUNNING to the _RESUMING state.
>>>>> + * In the _RESUMING state, the driver should use the device state data
>>>>> + * received through the migration region to resume the device.
>>>>> + * 5. After providing saved device data to the driver, the application should
>>>>> + * change the state from _RESUMING to _RUNNING.
>>>>> + *
>>>>> + * reserved:
>>>>> + * Reads on this field return zero and writes are ignored.
>>>>> + *
>>>>> + * pending_bytes: (read only)
>>>>> + * The number of pending bytes still to be migrated from the vendor driver.
>>>>> + *
>>>>> + * data_offset: (read only)
>>>>> + * The user application should read data_offset in the migration region
>>>>> + * from where the user application should read the device data during the
>>>>> + * _SAVING state or write the device data during the _RESUMING state. See
>>>>> + * below for details of sequence to be followed.
>>>>> + *
>>>>> + * data_size: (read/write)
>>>>> + * The user application should read data_size to get the size in bytes of
>>>>> + * the data copied in the migration region during the _SAVING state and
>>>>> + * write the size in bytes of the data copied in the migration region
>>>>> + * during the _RESUMING state.
>>>>> + *
>>>>> + * The format of the migration region is as follows:
>>>>> + * ------------------------------------------------------------------
>>>>> + * |vfio_device_migration_info| data section |
>>>>> + * | | /////////////////////////////// |
>>>>> + * ------------------------------------------------------------------
>>>>> + * ^ ^
>>>>> + * offset 0-trapped part data_offset
>>>>> + *
>>>>> + * The structure vfio_device_migration_info is always followed by the data
>>>>> + * section in the region, so data_offset will always be nonzero. The offset
>>>>> + * from where the data is copied is decided by the kernel driver. The data
>>>>> + * section can be trapped, mapped, or partitioned, depending on how the kernel
>>>>> + * driver defines the data section. The data section partition can be defined
>>>>> + * as mapped by the sparse mmap capability. If mmapped, data_offset should be
>>>>> + * page aligned, whereas initial section which contains the
>>>>> + * vfio_device_migration_info structure, might not end at the offset, which is
>>>>> + * page aligned. The user is not required to access through mmap regardless
>>>>> + * of the capabilities of the region mmap.
>>>>> + * The vendor driver should determine whether and how to partition the data
>>>>> + * section. The vendor driver should return data_offset accordingly.
>>>>> + *
>>>>> + * The sequence to be followed for the _SAVING|_RUNNING device state or
>>>>> + * pre-copy phase and for the _SAVING device state or stop-and-copy phase is as
>>>>> + * follows:
>>>>> + * a. Read pending_bytes, indicating the start of a new iteration to get device
>>>>> + * data. Repeated read on pending_bytes at this stage should have no side
>>>>> + * effects.
>>>>> + * If pending_bytes == 0, the user application should not iterate to get data
>>>>> + * for that device.
>>>>> + * If pending_bytes > 0, perform the following steps.
>>>>> + * b. Read data_offset, indicating that the vendor driver should make data
>>>>> + * available through the data section. The vendor driver should return this
>>>>> + * read operation only after data is available from (region + data_offset)
>>>>> + * to (region + data_offset + data_size).
>>>>> + * c. Read data_size, which is the amount of data in bytes available through
>>>>> + * the migration region.
>>>>> + * Read on data_offset and data_size should return the offset and size of
>>>>> + * the current buffer if the user application reads data_offset and
>>>>> + * data_size more than once here.
>>>> If data region is mmaped, merely reading data_offset and data_size
>>>> cannot let kernel know what are correct values to return.
>>>> Consider to add a read operation which is trapped into kernel to let
>>>> kernel exactly know it needs to move to the next offset and update data_size
>>>> ?
>>>
>>> Both operations b. and c. above are to trapped registers, operation d.
>>> below may potentially be to an mmap'd area, which is why we have step
>>> f. which indicates to the vendor driver that the data has been
>>> consumed. Does that address your concern? Thanks,
>>>
>> No. :)
>> the problem is about semantics of data_offset, data_size, and
>> pending_bytes.
>> b and c do not tell kernel that the data is read by user.
>> so, without knowing step d happen, kernel cannot update pending_bytes to
>> be returned in step f.
>
> Sorry, I'm still not understanding, I see step f. as the indicator
> you're looking for. The user reads pending_bytes to indicate the data
> in the migration area has been consumed. The vendor driver updates its
> internal state on that read and returns the updated value for
> pending_bytes. Thanks,
That's my understanding too. f) tells the data was consumed.
Thanks
Eric
>
> Alex
>
>>>>> + * d. Read data_size bytes of data from (region + data_offset) from the
>>>>> + * migration region.
>>>>> + * e. Process the data.
>>>>> + * f. Read pending_bytes, which indicates that the data from the previous
>>>>> + * iteration has been read. If pending_bytes > 0, go to step b.
>>>>> + *
>>>>> + * If an error occurs during the above sequence, the vendor driver can return
>>>>> + * an error code for next read() or write() operation, which will terminate the
>>>>> + * loop. The user application should then take the next necessary action, for
>>>>> + * example, failing migration or terminating the user application.
>>>>> + *
>>>>> + * The user application can transition from the _SAVING|_RUNNING
>>>>> + * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
>>>>> + * number of pending bytes. The user application should iterate in _SAVING
>>>>> + * (stop-and-copy) until pending_bytes is 0.
>>>>> + *
>>>>> + * The sequence to be followed while _RESUMING device state is as follows:
>>>>> + * While data for this device is available, repeat the following steps:
>>>>> + * a. Read data_offset from where the user application should write data.
>>>>> + * b. Write migration data starting at the migration region + data_offset for
>>>>> + * the length determined by data_size from the migration source.
>>>>> + * c. Write data_size, which indicates to the vendor driver that data is
>>>>> + * written in the migration region. Vendor driver should apply the
>>>>> + * user-provided migration region data to the device resume state.
>>>>> + *
>>>>> + * For the user application, data is opaque. The user application should write
>>>>> + * data in the same order as the data is received and the data should be of
>>>>> + * same transaction size at the source.
>>>>> + */
>>>>> +
>>>>> +struct vfio_device_migration_info {
>>>>> + __u32 device_state; /* VFIO device state */
>>>>> +#define VFIO_DEVICE_STATE_STOP (0)
>>>>> +#define VFIO_DEVICE_STATE_RUNNING (1 << 0)
>>>>> +#define VFIO_DEVICE_STATE_SAVING (1 << 1)
>>>>> +#define VFIO_DEVICE_STATE_RESUMING (1 << 2)
>>>>> +#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_RUNNING | \
>>>>> + VFIO_DEVICE_STATE_SAVING | \
>>>>> + VFIO_DEVICE_STATE_RESUMING)
>>>>> +
>>>>> +#define VFIO_DEVICE_STATE_VALID(state) \
>>>>> + (state & VFIO_DEVICE_STATE_RESUMING ? \
>>>>> + (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
>>>>> +
>>>>> +#define VFIO_DEVICE_STATE_IS_ERROR(state) \
>>>>> + ((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \
>>>>> + VFIO_DEVICE_STATE_RESUMING))
>>>>> +
>>>>> +#define VFIO_DEVICE_STATE_SET_ERROR(state) \
>>>>> + ((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
>>>>> + VFIO_DEVICE_STATE_RESUMING)
>>>>> +
>>>>> + __u32 reserved;
>>>>> + __u64 pending_bytes;
>>>>> + __u64 data_offset;
>>>>> + __u64 data_size;
>>>>> +} __attribute__((packed));
>>>>> +
>>>>> /*
>>>>> * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>>>>> * which allows direct access to non-MSIX registers which happened to be within
>>>>> --
>>>>> 2.7.0
>>>>>
>>>>
>>>
>>
>
next prev parent reply other threads:[~2020-03-23 14:45 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-18 19:41 [PATCH v14 Kernel 0/7] KABIs to support migration for VFIO devices Kirti Wankhede
2020-03-18 19:41 ` [PATCH v14 Kernel 1/7] vfio: KABI for migration interface for device state Kirti Wankhede
2020-03-19 1:17 ` Yan Zhao
2020-03-19 3:49 ` Alex Williamson
2020-03-19 5:05 ` Yan Zhao
2020-03-19 13:09 ` Alex Williamson
2020-03-20 1:30 ` Yan Zhao
2020-03-20 2:34 ` Alex Williamson
2020-03-20 3:06 ` Yan Zhao
2020-03-20 4:09 ` Alex Williamson
2020-03-20 4:20 ` Yan Zhao
2020-03-23 14:45 ` Auger Eric [this message]
2020-03-23 11:45 ` Auger Eric
2020-03-24 19:14 ` Kirti Wankhede
2020-03-18 19:41 ` [PATCH v14 Kernel 2/7] vfio iommu: Remove atomicity of ref_count of pinned pages Kirti Wankhede
2020-03-23 11:59 ` Auger Eric
2020-03-18 19:41 ` [PATCH v14 Kernel 3/7] vfio iommu: Add ioctl definition for dirty pages tracking Kirti Wankhede
2020-03-19 3:44 ` Alex Williamson
2020-03-18 19:41 ` [PATCH v14 Kernel 4/7] vfio iommu: Implementation of ioctl " Kirti Wankhede
2020-03-19 3:06 ` Yan Zhao
2020-03-19 4:01 ` Alex Williamson
2020-03-19 4:15 ` Yan Zhao
2020-03-19 4:40 ` Alex Williamson
2020-03-19 6:15 ` Yan Zhao
2020-03-19 13:06 ` Alex Williamson
2020-03-19 16:57 ` Kirti Wankhede
2020-03-20 0:51 ` Yan Zhao
2020-03-19 3:45 ` Alex Williamson
2020-03-19 14:52 ` Kirti Wankhede
2020-03-19 16:22 ` Alex Williamson
2020-03-19 20:25 ` Kirti Wankhede
2020-03-19 20:54 ` Alex Williamson
2020-03-19 18:57 ` Kirti Wankhede
2020-03-18 19:41 ` [PATCH v14 Kernel 5/7] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
2020-03-19 3:45 ` Alex Williamson
2020-03-20 8:35 ` Yan Zhao
2020-03-20 15:40 ` Alex Williamson
2020-03-20 15:47 ` Alex Williamson
2020-03-20 19:14 ` Kirti Wankhede
2020-03-20 19:28 ` Alex Williamson
2020-03-23 1:10 ` Yan Zhao
2020-03-18 19:41 ` [PATCH v14 Kernel 6/7] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
2020-03-18 19:41 ` [PATCH v14 Kernel 7/7] vfio: Selective dirty page tracking if IOMMU backed device pins pages Kirti Wankhede
2020-03-19 3:45 ` Alex Williamson
2020-03-19 6:24 ` Yan Zhao
2020-03-20 19:41 ` Alex Williamson
2020-03-23 2:43 ` Yan Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=70f46a97-1bdc-e7f9-637d-327c54471ffc@redhat.com \
--to=eric.auger@redhat.com \
--cc=Ken.Xue@amd.com \
--cc=Zhengxiao.zx@Alibaba-inc.com \
--cc=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=changpeng.liu@intel.com \
--cc=cjia@nvidia.com \
--cc=cohuck@redhat.com \
--cc=dgilbert@redhat.com \
--cc=eauger@redhat.com \
--cc=eskultet@redhat.com \
--cc=felipe@nutanix.com \
--cc=jonathan.davies@nutanix.com \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=mlevitsk@redhat.com \
--cc=pasic@linux.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=shuangtai.tst@alibaba-inc.com \
--cc=yan.y.zhao@intel.com \
--cc=yi.l.liu@intel.com \
--cc=zhi.a.wang@intel.com \
--cc=ziye.yang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).