From: Kirti Wankhede <kwankhede@nvidia.com>
To: Yan Zhao <yan.y.zhao@intel.com>,
Alex Williamson <alex.williamson@redhat.com>
Cc: "Zhengxiao.zx@Alibaba-inc.com" <Zhengxiao.zx@Alibaba-inc.com>,
"Tian, Kevin" <kevin.tian@intel.com>,
"Liu, Yi L" <yi.l.liu@intel.com>,
"cjia@nvidia.com" <cjia@nvidia.com>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"eskultet@redhat.com" <eskultet@redhat.com>,
"Yang, Ziye" <ziye.yang@intel.com>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
"cohuck@redhat.com" <cohuck@redhat.com>,
"shuangtai.tst@alibaba-inc.com" <shuangtai.tst@alibaba-inc.com>,
"dgilbert@redhat.com" <dgilbert@redhat.com>,
"Wang, Zhi A" <zhi.a.wang@intel.com>,
"mlevitsk@redhat.com" <mlevitsk@redhat.com>,
"pasic@linux.ibm.com" <pasic@linux.ibm.com>,
"aik@ozlabs.ru" <aik@ozlabs.ru>,
"eauger@redhat.com" <eauger@redhat.com>,
"felipe@nutanix.com" <felipe@nutanix.com>,
"jonathan.davies@nutanix.com" <jonathan.davies@nutanix.com>,
"Liu, Changpeng" <changpeng.liu@intel.com>,
"Ken.Xue@amd.com" <Ken.Xue@amd.com>
Subject: Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
Date: Thu, 14 Nov 2019 00:32:55 +0530 [thread overview]
Message-ID: <d0be166b-9ffe-645d-de74-b48855995326@nvidia.com> (raw)
In-Reply-To: <20191113032332.GB18001@joy-OptiPlex-7040>
On 11/13/2019 8:53 AM, Yan Zhao wrote:
> On Wed, Nov 13, 2019 at 06:30:05AM +0800, Alex Williamson wrote:
>> On Tue, 12 Nov 2019 22:33:36 +0530
>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>
>>> - Defined MIGRATION region type and sub-type.
>>> - Used 3 bits to define VFIO device states.
>>> Bit 0 => _RUNNING
>>> Bit 1 => _SAVING
>>> Bit 2 => _RESUMING
>>> Combination of these bits defines VFIO device's state during migration
>>> _RUNNING => Normal VFIO device running state. When its reset, it
>>> indicates _STOPPED state. when device is changed to
>>> _STOPPED, driver should stop device before write()
>>> returns.
>>> _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
>>> start saving state of device i.e. pre-copy state
>>> _SAVING => vCPUs are stopped, VFIO device should be stopped, and
>>
>> s/should/must/
>>
>>> save device state,i.e. stop-n-copy state
>>> _RESUMING => VFIO device resuming state.
>>> _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
>>
>> A table might be useful here and in the uapi header to indicate valid
>> states:
>>
>> | _RESUMING | _SAVING | _RUNNING | Description
>> +-----------+---------+----------+------------------------------------------
>> | 0 | 0 | 0 | Stopped, not saving or resuming (a)
>> +-----------+---------+----------+------------------------------------------
>> | 0 | 0 | 1 | Running, default state
>> +-----------+---------+----------+------------------------------------------
>> | 0 | 1 | 0 | Stopped, migration interface in save mode
>> +-----------+---------+----------+------------------------------------------
>> | 0 | 1 | 1 | Running, save mode interface, iterative
>> +-----------+---------+----------+------------------------------------------
>> | 1 | 0 | 0 | Stopped, migration resume interface active
>> +-----------+---------+----------+------------------------------------------
>> | 1 | 0 | 1 | Invalid (b)
>> +-----------+---------+----------+------------------------------------------
>> | 1 | 1 | 0 | Invalid (c)
>> +-----------+---------+----------+------------------------------------------
>> | 1 | 1 | 1 | Invalid (d)
>>
>> I think we need to consider whether we define (a) as generally
>> available, for instance we might want to use it for diagnostics or a
>> fatal error condition outside of migration.
>>
We have to set it as init state. I'll add this it.
>> Are there hidden assumptions between state transitions here or are
>> there specific next possible state diagrams that we need to include as
>> well?
>>
>> I'm curious if Intel agrees with the states marked invalid with their
>> push for post-copy support.
>>
> hi Alex and Kirti,
> Actually, for postcopy, I think we anyway need an extra POSTCOPY state
> introduced. Reasons as below:
> - in the target side, _RSESUMING state is set in the beginning of
> migration. It cannot be used as a state to inform device of that
> currently it's in postcopy state and device DMAs are to be trapped and
> pre-faulted.
> we also cannot use state (_RESUMING + _RUNNING) as an indicator of
> postcopy, because before device & vm running in target side, some device
> state are already loaded (e.g. page tables, pending workloads),
> target side can do pre-pagefault at that period before target vm up.
> - in the source side, after device is stopped, postcopy needs saving
> device state only (as compared to device state + remaining dirty
> pages in precopy). state (!_RUNNING + _SAVING) here again cannot
> differentiate precopy and postcopy here.
>
>>> Bits 3 - 31 are reserved for future use. User should perform
>>> read-modify-write operation on this field.
>>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>> offset of migration region to get/set VFIO device related information.
>>> Defined members of structure and usage on read/write access:
>>> * device_state: (read/write)
>>> To convey VFIO device state to be transitioned to. Only 3 bits are
>>> used as of now, Bits 3 - 31 are reserved for future use.
>>> * pending bytes: (read only)
>>> To get pending bytes yet to be migrated for VFIO device.
>>> * data_offset: (read only)
>>> To get data offset in migration region from where data exist
>>> during _SAVING and from where data should be written by user space
>>> application during _RESUMING state.
>>> * data_size: (read/write)
>>> To get and set size in bytes of data copied in migration region
>>> during _SAVING and _RESUMING state.
>>>
>>> Migration region looks like:
>>> ------------------------------------------------------------------
>>> |vfio_device_migration_info| data section |
>>> | | /////////////////////////////// |
>>> ------------------------------------------------------------------
>>> ^ ^
>>> offset 0-trapped part data_offset
>>>
>>> Structure vfio_device_migration_info is always followed by data section
>>> in the region, so data_offset will always be non-0. Offset from where data
>>> to be copied is decided by kernel driver, data section can be trapped or
>>> mapped depending on how kernel driver defines data section.
>>> Data section partition can be defined as mapped by sparse mmap capability.
>>> If mmapped, then data_offset should be page aligned, where as initial
>>> section which contain vfio_device_migration_info structure might not end
>>> at offset which is page aligned.
>>> Vendor driver should decide whether to partition data section and how to
>>> partition the data section. Vendor driver should return data_offset
>>> accordingly.
>>>
>>> For user application, data is opaque. User should write data in the same
>>> order as received.
>>>
>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>> ---
>>> include/uapi/linux/vfio.h | 108 ++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 108 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>> index 9e843a147ead..35b09427ad9f 100644
>>> --- a/include/uapi/linux/vfio.h
>>> +++ b/include/uapi/linux/vfio.h
>>> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
>>> #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0xffff)
>>> #define VFIO_REGION_TYPE_GFX (1)
>>> #define VFIO_REGION_TYPE_CCW (2)
>>> +#define VFIO_REGION_TYPE_MIGRATION (3)
>>>
>>> /* sub-types for VFIO_REGION_TYPE_PCI_* */
>>>
>>> @@ -379,6 +380,113 @@ struct vfio_region_gfx_edid {
>>> /* sub-types for VFIO_REGION_TYPE_CCW */
>>> #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD (1)
>>>
>>> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
>>> +#define VFIO_REGION_SUBTYPE_MIGRATION (1)
>>> +
>>> +/*
>>> + * Structure vfio_device_migration_info is placed at 0th offset of
>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>>> + * information. Field accesses from this structure are only supported at their
>>> + * native width and alignment, otherwise the result is undefined and vendor
>>> + * drivers should return an error.
>>> + *
>>> + * device_state: (read/write)
>>> + * To indicate vendor driver the state VFIO device should be transitioned
>>> + * to. If device state transition fails, write on this field return error.
>>> + * It consists of 3 bits:
>>> + * - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
>>
>> Let's use set/cleared or 1/0 to indicate bit values, 'reset' is somewhat
>> ambiguous.
Ok. Updating it.
>>
>>> + * _STOPPED state. When device is changed to _STOPPED, driver should stop
>>> + * device before write() returns.
>>> + * - If bit 1 set, indicates _SAVING state. When set, that indicates driver
>>> + * should start gathering device state information which will be provided
>>> + * to VFIO user space application to save device's state.
>>> + * - If bit 2 set, indicates _RESUMING state. When set, that indicates
>>> + * prepare to resume device, data provided through migration region
>>> + * should be used to resume device.
>>> + * Bits 3 - 31 are reserved for future use. User should perform
>>> + * read-modify-write operation on this field.
>>> + * _SAVING and _RESUMING bits set at the same time is invalid state.
>>> + * Similarly _RUNNING and _RESUMING bits set is invalid state.
>>> + *
>>> + * pending bytes: (read only)
>>> + * Number of pending bytes yet to be migrated from vendor driver
>>> + *
>>> + * data_offset: (read only)
>>> + * User application should read data_offset in migration region from where
>>> + * user application should read device data during _SAVING state or write
>>> + * device data during _RESUMING state. See below for detail of sequence to
>>> + * be followed.
>>> + *
>>> + * data_size: (read/write)
>>> + * User application should read data_size to get size of data copied in
>>> + * bytes in migration region during _SAVING state and write size of data
>>> + * copied in bytes in migration region during _RESUMING state.
>>> + *
>>> + * Migration region looks like:
>>> + * ------------------------------------------------------------------
>>> + * |vfio_device_migration_info| data section |
>>> + * | | /////////////////////////////// |
>>> + * ------------------------------------------------------------------
>>> + * ^ ^
>>> + * offset 0-trapped part data_offset
>>> + *
>>> + * Structure vfio_device_migration_info is always followed by data section in
>>> + * the region, so data_offset will always be non-0. Offset from where data is
>>> + * copied is decided by kernel driver, data section can be trapped or mapped
>>> + * or partitioned, depending on how kernel driver defines data section.
>>> + * Data section partition can be defined as mapped by sparse mmap capability.
>>> + * If mmapped, then data_offset should be page aligned, where as initial section
>>> + * which contain vfio_device_migration_info structure might not end at offset
>>> + * which is page aligned.
>>
>> "The user is not required to to access via mmap regardless of the
>> region mmap capabilities."
>>
> But once the user decides to access via mmap, it has to read data of
> data_size each time, otherwise the vendor driver has no idea of how many
> data are already read from user. Agree?
>
that's right.
>>> + * Vendor driver should decide whether to partition data section and how to
>>> + * partition the data section. Vendor driver should return data_offset
>>> + * accordingly.
>>> + *
>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>>> + * and for _SAVING device state or stop-and-copy phase:
>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
>>> + * Kernel driver should return this read operation only after writing data to
>>> + * staging buffer is done.
> May I know under what condition this data_offset will be changed per
> each iteration from a-f ?
>
Its upto vendor driver, if vendor driver maintains multiple partitions
in data section.
>>
>> "staging buffer" implies a vendor driver implementation, perhaps we
>> could just state that data is available from (region + data_offset) to
>> (region + data_offset + data_size) upon return of this read operation.
>>
Makes sense. Updating it.
>>> + * c. read data_size, amount of data in bytes written by vendor driver in
>>> + * migration region.
>>> + * d. read data_size bytes of data from data_offset in the migration region.
>>> + * e. process data.
>>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
>>> + * operation from migration region for previous iteration is done.
>>
>> I think this indicate that step (f) should be to read pending_bytes, the
>> read sequence is not complete until this step. Optionally the user can
>> then proceed to step (b). There are no read side-effects of (a) afaict.
>>
I tried to reword this sequence to be more specific:
* Sequence to be followed for _SAVING|_RUNNING device state or pre-copy
* phase and for _SAVING device state or stop-and-copy phase:
* a. read pending_bytes, indicates start of new iteration to get device
* data. If there was previous iteration, then this read operation
* indicates previous iteration is done. If pending_bytes > 0, go
* through below steps.
* b. read data_offset, indicates kernel driver to make data available
* through data section. Kernel driver should return this read
* operation only after data is available from (region + data_offset)
* to (region + data_offset + data_size).
* c. read data_size, amount of data in bytes available through migration
* region.
* d. read data of data_size bytes from (region + data_offset) from
* migration region.
* e. process data.
* f. Loop through a to e.
Hope this is more clear.
>> Is the use required to reach pending_bytes == 0 before changing
>> device_state, particularly transitioning to !_RUNNING?
No, its not necessary to reach till pending_bytes==0 in pre-copy phase.
>> Presumably the
>> user can exit this sequence at any time by clearing _SAVING.
In that case device state data is not complete, which will result in not
able to resume device with that data.
In stop-and-copy phase, user should iterate till pending_bytes is 0.
>>
>>> + *
>>> + * Sequence to be followed while _RESUMING device state:
>>> + * While data for this device is available, repeat below steps:
>>> + * a. read data_offset from where user application should write data.
> before proceed to step b, need to read data_size from vendor driver to determine
> the max len of data to write. I think it's necessary in such a condition
> that source vendor driver and target vendor driver do not offer data sections of
> the same size. e.g. in source side, the data section is of size 100M,
> but in target side, the data section is only of 50M size.
> rather than simply fail, loop and write seems better.
>
Makes sense. Doing this change for next version.
> Thanks
> Yan
>>> + * b. write data of data_size to migration region from data_offset.
>>> + * c. write data_size which indicates vendor driver that data is written in
>>> + * staging buffer. Vendor driver should read this data from migration
>>> + * region and resume device's state.
>>
>> The device defaults to _RUNNING state, so a prerequisite is to set
>> _RESUMING and clear _RUNNING, right?
Yes.
>>
>>> + *
>>> + * For user application, data is opaque. User should write data in the same
>>> + * order as received.
>>> + */
>>> +
>>> +struct vfio_device_migration_info {
>>> + __u32 device_state; /* VFIO device state */
>>> +#define VFIO_DEVICE_STATE_RUNNING (1 << 0)
>>> +#define VFIO_DEVICE_STATE_SAVING (1 << 1)
>>> +#define VFIO_DEVICE_STATE_RESUMING (1 << 2)
>>> +#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_RUNNING | \
>>> + VFIO_DEVICE_STATE_SAVING | \
>>> + VFIO_DEVICE_STATE_RESUMING)
>>> +
>>> +#define VFIO_DEVICE_STATE_INVALID_CASE1 (VFIO_DEVICE_STATE_SAVING | \
>>> + VFIO_DEVICE_STATE_RESUMING)
>>> +
>>> +#define VFIO_DEVICE_STATE_INVALID_CASE2 (VFIO_DEVICE_STATE_RUNNING | \
>>> + VFIO_DEVICE_STATE_RESUMING)
>>
>> These seem difficult to use, maybe we just need a
>> VFIO_DEVICE_STATE_VALID macro?
>>
>> #define VFIO_DEVICE_STATE_VALID(state) \
>> (state & VFIO_DEVICE_STATE_RESUMING ? \
>> (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
>>
This will not be work when use of other bits gets added in future.
That's the reason I preferred to add individual invalid states which
user should check.
Thanks,
Kirti
>> Thanks,
>> Alex
>>
>>> + __u32 reserved;
>>> + __u64 pending_bytes;
>>> + __u64 data_offset;
>>> + __u64 data_size;
>>> +} __attribute__((packed));
>>> +
>>> /*
>>> * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>>> * which allows direct access to non-MSIX registers which happened to be within
>>
next prev parent reply other threads:[~2019-11-13 19:06 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-12 17:03 [PATCH v9 Kernel 0/5] Add KABIs to support migration for VFIO devices Kirti Wankhede
2019-11-12 17:03 ` [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state Kirti Wankhede
2019-11-12 22:30 ` Alex Williamson
2019-11-13 3:23 ` Yan Zhao
2019-11-13 19:02 ` Kirti Wankhede [this message]
2019-11-14 0:36 ` Yan Zhao
2019-11-14 18:55 ` Kirti Wankhede
2019-11-13 10:24 ` Cornelia Huck
2019-11-13 18:27 ` Alex Williamson
2019-11-13 19:29 ` Kirti Wankhede
2019-11-13 19:48 ` Alex Williamson
2019-11-13 20:17 ` Kirti Wankhede
2019-11-13 20:40 ` Alex Williamson
2019-11-14 18:49 ` Kirti Wankhede
2019-11-12 17:03 ` [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap Kirti Wankhede
2019-11-12 22:30 ` Alex Williamson
2019-11-13 19:37 ` Kirti Wankhede
2019-11-13 20:07 ` Alex Williamson
2019-11-14 18:56 ` Kirti Wankhede
2019-11-14 21:06 ` Alex Williamson
2019-11-15 2:40 ` Yan Zhao
2019-11-15 3:21 ` Alex Williamson
2019-11-15 5:10 ` Tian, Kevin
2019-11-19 23:16 ` Alex Williamson
2019-11-20 1:04 ` Tian, Kevin
2019-11-20 1:51 ` Yan Zhao
2019-11-26 0:57 ` Yan Zhao
2019-12-03 18:04 ` Alex Williamson
2019-12-04 18:10 ` Kirti Wankhede
2019-12-04 18:34 ` Alex Williamson
2019-12-05 1:28 ` Yan Zhao
2019-12-05 5:42 ` Kirti Wankhede
2019-12-05 5:47 ` Yan Zhao
2019-12-05 5:56 ` Alex Williamson
2019-12-05 6:19 ` Kirti Wankhede
2019-12-05 6:40 ` Alex Williamson
2019-11-12 17:03 ` [PATCH v9 Kernel 3/5] vfio iommu: Add ioctl defination to unmap IOVA and return dirty bitmap Kirti Wankhede
2019-11-12 22:30 ` Alex Williamson
2019-11-13 19:52 ` Kirti Wankhede
2019-11-13 20:22 ` Alex Williamson
2019-11-14 18:56 ` Kirti Wankhede
2019-11-14 21:08 ` Alex Williamson
2019-11-12 17:03 ` [PATCH v9 Kernel 4/5] vfio iommu: Implementation of ioctl to get dirty pages bitmap Kirti Wankhede
2019-11-12 22:30 ` Alex Williamson
2019-11-12 17:03 ` [PATCH v9 Kernel 5/5] vfio iommu: Implementation of ioctl to get dirty bitmap before unmap Kirti Wankhede
2019-11-12 22:30 ` Alex Williamson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d0be166b-9ffe-645d-de74-b48855995326@nvidia.com \
--to=kwankhede@nvidia.com \
--cc=Ken.Xue@amd.com \
--cc=Zhengxiao.zx@Alibaba-inc.com \
--cc=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=changpeng.liu@intel.com \
--cc=cjia@nvidia.com \
--cc=cohuck@redhat.com \
--cc=dgilbert@redhat.com \
--cc=eauger@redhat.com \
--cc=eskultet@redhat.com \
--cc=felipe@nutanix.com \
--cc=jonathan.davies@nutanix.com \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=mlevitsk@redhat.com \
--cc=pasic@linux.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=shuangtai.tst@alibaba-inc.com \
--cc=yan.y.zhao@intel.com \
--cc=yi.l.liu@intel.com \
--cc=zhi.a.wang@intel.com \
--cc=ziye.yang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).