From: Yan Zhao <yan.y.zhao@intel.com>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: "Zhengxiao.zx@Alibaba-inc.com" <Zhengxiao.zx@Alibaba-inc.com>,
"Tian, Kevin" <kevin.tian@intel.com>,
"Liu, Yi L" <yi.l.liu@intel.com>,
"cjia@nvidia.com" <cjia@nvidia.com>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"eskultet@redhat.com" <eskultet@redhat.com>,
"Yang, Ziye" <ziye.yang@intel.com>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
"cohuck@redhat.com" <cohuck@redhat.com>,
"shuangtai.tst@alibaba-inc.com" <shuangtai.tst@alibaba-inc.com>,
"dgilbert@redhat.com" <dgilbert@redhat.com>,
"Wang, Zhi A" <zhi.a.wang@intel.com>,
"mlevitsk@redhat.com" <mlevitsk@redhat.com>,
"pasic@linux.ibm.com" <pasic@linux.ibm.com>,
"aik@ozlabs.ru" <aik@ozlabs.ru>,
Alex Williamson <alex.williamson@redhat.com>,
"eauger@redhat.com" <eauger@redhat.com>,
"felipe@nutanix.com" <felipe@nutanix.com>,
"jonathan.davies@nutanix.com" <jonathan.davies@nutanix.com>,
"Liu, Changpeng" <changpeng.liu@intel.com>,
"Ken.Xue@amd.com" <Ken.Xue@amd.com>
Subject: Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state
Date: Wed, 13 Nov 2019 19:36:15 -0500 [thread overview]
Message-ID: <20191114003615.GD18001@joy-OptiPlex-7040> (raw)
In-Reply-To: <d0be166b-9ffe-645d-de74-b48855995326@nvidia.com>
On Thu, Nov 14, 2019 at 03:02:55AM +0800, Kirti Wankhede wrote:
>
>
> On 11/13/2019 8:53 AM, Yan Zhao wrote:
> > On Wed, Nov 13, 2019 at 06:30:05AM +0800, Alex Williamson wrote:
> >> On Tue, 12 Nov 2019 22:33:36 +0530
> >> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>
> >>> - Defined MIGRATION region type and sub-type.
> >>> - Used 3 bits to define VFIO device states.
> >>> Bit 0 => _RUNNING
> >>> Bit 1 => _SAVING
> >>> Bit 2 => _RESUMING
> >>> Combination of these bits defines VFIO device's state during migration
> >>> _RUNNING => Normal VFIO device running state. When its reset, it
> >>> indicates _STOPPED state. when device is changed to
> >>> _STOPPED, driver should stop device before write()
> >>> returns.
> >>> _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> >>> start saving state of device i.e. pre-copy state
> >>> _SAVING => vCPUs are stopped, VFIO device should be stopped, and
> >>
> >> s/should/must/
> >>
> >>> save device state,i.e. stop-n-copy state
> >>> _RESUMING => VFIO device resuming state.
> >>> _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
> >>
> >> A table might be useful here and in the uapi header to indicate valid
> >> states:
> >>
> >> | _RESUMING | _SAVING | _RUNNING | Description
> >> +-----------+---------+----------+------------------------------------------
> >> | 0 | 0 | 0 | Stopped, not saving or resuming (a)
> >> +-----------+---------+----------+------------------------------------------
> >> | 0 | 0 | 1 | Running, default state
> >> +-----------+---------+----------+------------------------------------------
> >> | 0 | 1 | 0 | Stopped, migration interface in save mode
> >> +-----------+---------+----------+------------------------------------------
> >> | 0 | 1 | 1 | Running, save mode interface, iterative
> >> +-----------+---------+----------+------------------------------------------
> >> | 1 | 0 | 0 | Stopped, migration resume interface active
> >> +-----------+---------+----------+------------------------------------------
> >> | 1 | 0 | 1 | Invalid (b)
> >> +-----------+---------+----------+------------------------------------------
> >> | 1 | 1 | 0 | Invalid (c)
> >> +-----------+---------+----------+------------------------------------------
> >> | 1 | 1 | 1 | Invalid (d)
> >>
> >> I think we need to consider whether we define (a) as generally
> >> available, for instance we might want to use it for diagnostics or a
> >> fatal error condition outside of migration.
> >>
>
> We have to set it as init state. I'll add this it.
>
> >> Are there hidden assumptions between state transitions here or are
> >> there specific next possible state diagrams that we need to include as
> >> well?
> >>
> >> I'm curious if Intel agrees with the states marked invalid with their
> >> push for post-copy support.
> >>
> > hi Alex and Kirti,
> > Actually, for postcopy, I think we anyway need an extra POSTCOPY state
> > introduced. Reasons as below:
> > - in the target side, _RSESUMING state is set in the beginning of
> > migration. It cannot be used as a state to inform device of that
> > currently it's in postcopy state and device DMAs are to be trapped and
> > pre-faulted.
> > we also cannot use state (_RESUMING + _RUNNING) as an indicator of
> > postcopy, because before device & vm running in target side, some device
> > state are already loaded (e.g. page tables, pending workloads),
> > target side can do pre-pagefault at that period before target vm up.
> > - in the source side, after device is stopped, postcopy needs saving
> > device state only (as compared to device state + remaining dirty
> > pages in precopy). state (!_RUNNING + _SAVING) here again cannot
> > differentiate precopy and postcopy here.
> >
> >>> Bits 3 - 31 are reserved for future use. User should perform
> >>> read-modify-write operation on this field.
> >>> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>> offset of migration region to get/set VFIO device related information.
> >>> Defined members of structure and usage on read/write access:
> >>> * device_state: (read/write)
> >>> To convey VFIO device state to be transitioned to. Only 3 bits are
> >>> used as of now, Bits 3 - 31 are reserved for future use.
> >>> * pending bytes: (read only)
> >>> To get pending bytes yet to be migrated for VFIO device.
> >>> * data_offset: (read only)
> >>> To get data offset in migration region from where data exist
> >>> during _SAVING and from where data should be written by user space
> >>> application during _RESUMING state.
> >>> * data_size: (read/write)
> >>> To get and set size in bytes of data copied in migration region
> >>> during _SAVING and _RESUMING state.
> >>>
> >>> Migration region looks like:
> >>> ------------------------------------------------------------------
> >>> |vfio_device_migration_info| data section |
> >>> | | /////////////////////////////// |
> >>> ------------------------------------------------------------------
> >>> ^ ^
> >>> offset 0-trapped part data_offset
> >>>
> >>> Structure vfio_device_migration_info is always followed by data section
> >>> in the region, so data_offset will always be non-0. Offset from where data
> >>> to be copied is decided by kernel driver, data section can be trapped or
> >>> mapped depending on how kernel driver defines data section.
> >>> Data section partition can be defined as mapped by sparse mmap capability.
> >>> If mmapped, then data_offset should be page aligned, where as initial
> >>> section which contain vfio_device_migration_info structure might not end
> >>> at offset which is page aligned.
> >>> Vendor driver should decide whether to partition data section and how to
> >>> partition the data section. Vendor driver should return data_offset
> >>> accordingly.
> >>>
> >>> For user application, data is opaque. User should write data in the same
> >>> order as received.
> >>>
> >>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>> ---
> >>> include/uapi/linux/vfio.h | 108 ++++++++++++++++++++++++++++++++++++++++++++++
> >>> 1 file changed, 108 insertions(+)
> >>>
> >>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>> index 9e843a147ead..35b09427ad9f 100644
> >>> --- a/include/uapi/linux/vfio.h
> >>> +++ b/include/uapi/linux/vfio.h
> >>> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
> >>> #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0xffff)
> >>> #define VFIO_REGION_TYPE_GFX (1)
> >>> #define VFIO_REGION_TYPE_CCW (2)
> >>> +#define VFIO_REGION_TYPE_MIGRATION (3)
> >>>
> >>> /* sub-types for VFIO_REGION_TYPE_PCI_* */
> >>>
> >>> @@ -379,6 +380,113 @@ struct vfio_region_gfx_edid {
> >>> /* sub-types for VFIO_REGION_TYPE_CCW */
> >>> #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD (1)
> >>>
> >>> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> >>> +#define VFIO_REGION_SUBTYPE_MIGRATION (1)
> >>> +
> >>> +/*
> >>> + * Structure vfio_device_migration_info is placed at 0th offset of
> >>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> >>> + * information. Field accesses from this structure are only supported at their
> >>> + * native width and alignment, otherwise the result is undefined and vendor
> >>> + * drivers should return an error.
> >>> + *
> >>> + * device_state: (read/write)
> >>> + * To indicate vendor driver the state VFIO device should be transitioned
> >>> + * to. If device state transition fails, write on this field return error.
> >>> + * It consists of 3 bits:
> >>> + * - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> >>
> >> Let's use set/cleared or 1/0 to indicate bit values, 'reset' is somewhat
> >> ambiguous.
>
> Ok. Updating it.
>
> >>
> >>> + * _STOPPED state. When device is changed to _STOPPED, driver should stop
> >>> + * device before write() returns.
> >>> + * - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> >>> + * should start gathering device state information which will be provided
> >>> + * to VFIO user space application to save device's state.
> >>> + * - If bit 2 set, indicates _RESUMING state. When set, that indicates
> >>> + * prepare to resume device, data provided through migration region
> >>> + * should be used to resume device.
> >>> + * Bits 3 - 31 are reserved for future use. User should perform
> >>> + * read-modify-write operation on this field.
> >>> + * _SAVING and _RESUMING bits set at the same time is invalid state.
> >>> + * Similarly _RUNNING and _RESUMING bits set is invalid state.
> >>> + *
> >>> + * pending bytes: (read only)
> >>> + * Number of pending bytes yet to be migrated from vendor driver
> >>> + *
> >>> + * data_offset: (read only)
> >>> + * User application should read data_offset in migration region from where
> >>> + * user application should read device data during _SAVING state or write
> >>> + * device data during _RESUMING state. See below for detail of sequence to
> >>> + * be followed.
> >>> + *
> >>> + * data_size: (read/write)
> >>> + * User application should read data_size to get size of data copied in
> >>> + * bytes in migration region during _SAVING state and write size of data
> >>> + * copied in bytes in migration region during _RESUMING state.
> >>> + *
> >>> + * Migration region looks like:
> >>> + * ------------------------------------------------------------------
> >>> + * |vfio_device_migration_info| data section |
> >>> + * | | /////////////////////////////// |
> >>> + * ------------------------------------------------------------------
> >>> + * ^ ^
> >>> + * offset 0-trapped part data_offset
> >>> + *
> >>> + * Structure vfio_device_migration_info is always followed by data section in
> >>> + * the region, so data_offset will always be non-0. Offset from where data is
> >>> + * copied is decided by kernel driver, data section can be trapped or mapped
> >>> + * or partitioned, depending on how kernel driver defines data section.
> >>> + * Data section partition can be defined as mapped by sparse mmap capability.
> >>> + * If mmapped, then data_offset should be page aligned, where as initial section
> >>> + * which contain vfio_device_migration_info structure might not end at offset
> >>> + * which is page aligned.
> >>
> >> "The user is not required to to access via mmap regardless of the
> >> region mmap capabilities."
> >>
> > But once the user decides to access via mmap, it has to read data of
> > data_size each time, otherwise the vendor driver has no idea of how many
> > data are already read from user. Agree?
> >
>
> that's right.
>
> >>> + * Vendor driver should decide whether to partition data section and how to
> >>> + * partition the data section. Vendor driver should return data_offset
> >>> + * accordingly.
> >>> + *
> >>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> >>> + * and for _SAVING device state or stop-and-copy phase:
> >>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> >>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> >>> + * Kernel driver should return this read operation only after writing data to
> >>> + * staging buffer is done.
> > May I know under what condition this data_offset will be changed per
> > each iteration from a-f ?
> >
>
> Its upto vendor driver, if vendor driver maintains multiple partitions
> in data section.
>
So, do you mean each time before doing b (reading data_offset), step a
(reading pending_bytes) is mandatory, otherwise the vendor driver does
not know which data_offset is?
Then, any lock to wrap step a and b to ensure atomic?
Thanks
Yan
> >>
> >> "staging buffer" implies a vendor driver implementation, perhaps we
> >> could just state that data is available from (region + data_offset) to
> >> (region + data_offset + data_size) upon return of this read operation.
> >>
>
> Makes sense. Updating it.
>
> >>> + * c. read data_size, amount of data in bytes written by vendor driver in
> >>> + * migration region.
> >>> + * d. read data_size bytes of data from data_offset in the migration region.
> >>> + * e. process data.
> >>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
> >>> + * operation from migration region for previous iteration is done.
> >>
> >> I think this indicate that step (f) should be to read pending_bytes, the
> >> read sequence is not complete until this step. Optionally the user can
> >> then proceed to step (b). There are no read side-effects of (a) afaict.
> >>
>
> I tried to reword this sequence to be more specific:
>
> * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy
> * phase and for _SAVING device state or stop-and-copy phase:
> * a. read pending_bytes, indicates start of new iteration to get device
> * data. If there was previous iteration, then this read operation
> * indicates previous iteration is done. If pending_bytes > 0, go
> * through below steps.
> * b. read data_offset, indicates kernel driver to make data available
> * through data section. Kernel driver should return this read
> * operation only after data is available from (region + data_offset)
> * to (region + data_offset + data_size).
> * c. read data_size, amount of data in bytes available through migration
> * region.
> * d. read data of data_size bytes from (region + data_offset) from
> * migration region.
> * e. process data.
> * f. Loop through a to e.
>
> Hope this is more clear.
>
> >> Is the use required to reach pending_bytes == 0 before changing
> >> device_state, particularly transitioning to !_RUNNING?
>
> No, its not necessary to reach till pending_bytes==0 in pre-copy phase.
>
> >> Presumably the
> >> user can exit this sequence at any time by clearing _SAVING.
>
> In that case device state data is not complete, which will result in not
> able to resume device with that data.
> In stop-and-copy phase, user should iterate till pending_bytes is 0.
>
> >>
> >>> + *
> >>> + * Sequence to be followed while _RESUMING device state:
> >>> + * While data for this device is available, repeat below steps:
> >>> + * a. read data_offset from where user application should write data.
> > before proceed to step b, need to read data_size from vendor driver to determine
> > the max len of data to write. I think it's necessary in such a condition
> > that source vendor driver and target vendor driver do not offer data sections of
> > the same size. e.g. in source side, the data section is of size 100M,
> > but in target side, the data section is only of 50M size.
> > rather than simply fail, loop and write seems better.
> >
>
> Makes sense. Doing this change for next version.
>
> > Thanks
> > Yan
> >>> + * b. write data of data_size to migration region from data_offset.
> >>> + * c. write data_size which indicates vendor driver that data is written in
> >>> + * staging buffer. Vendor driver should read this data from migration
> >>> + * region and resume device's state.
> >>
> >> The device defaults to _RUNNING state, so a prerequisite is to set
> >> _RESUMING and clear _RUNNING, right?
>
> Yes.
>
> >>
> >>> + *
> >>> + * For user application, data is opaque. User should write data in the same
> >>> + * order as received.
> >>> + */
> >>> +
> >>> +struct vfio_device_migration_info {
> >>> + __u32 device_state; /* VFIO device state */
> >>> +#define VFIO_DEVICE_STATE_RUNNING (1 << 0)
> >>> +#define VFIO_DEVICE_STATE_SAVING (1 << 1)
> >>> +#define VFIO_DEVICE_STATE_RESUMING (1 << 2)
> >>> +#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_RUNNING | \
> >>> + VFIO_DEVICE_STATE_SAVING | \
> >>> + VFIO_DEVICE_STATE_RESUMING)
> >>> +
> >>> +#define VFIO_DEVICE_STATE_INVALID_CASE1 (VFIO_DEVICE_STATE_SAVING | \
> >>> + VFIO_DEVICE_STATE_RESUMING)
> >>> +
> >>> +#define VFIO_DEVICE_STATE_INVALID_CASE2 (VFIO_DEVICE_STATE_RUNNING | \
> >>> + VFIO_DEVICE_STATE_RESUMING)
> >>
> >> These seem difficult to use, maybe we just need a
> >> VFIO_DEVICE_STATE_VALID macro?
> >>
> >> #define VFIO_DEVICE_STATE_VALID(state) \
> >> (state & VFIO_DEVICE_STATE_RESUMING ? \
> >> (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
> >>
>
> This will not be work when use of other bits gets added in future.
> That's the reason I preferred to add individual invalid states which
> user should check.
>
> Thanks,
> Kirti
>
> >> Thanks,
> >> Alex
> >>
> >>> + __u32 reserved;
> >>> + __u64 pending_bytes;
> >>> + __u64 data_offset;
> >>> + __u64 data_size;
> >>> +} __attribute__((packed));
> >>> +
> >>> /*
> >>> * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> >>> * which allows direct access to non-MSIX registers which happened to be within
> >>
next prev parent reply other threads:[~2019-11-14 0:48 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-12 17:03 [PATCH v9 Kernel 0/5] Add KABIs to support migration for VFIO devices Kirti Wankhede
2019-11-12 17:03 ` [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state Kirti Wankhede
2019-11-12 22:30 ` Alex Williamson
2019-11-13 3:23 ` Yan Zhao
2019-11-13 19:02 ` Kirti Wankhede
2019-11-14 0:36 ` Yan Zhao [this message]
2019-11-14 18:55 ` Kirti Wankhede
2019-11-13 10:24 ` Cornelia Huck
2019-11-13 18:27 ` Alex Williamson
2019-11-13 19:29 ` Kirti Wankhede
2019-11-13 19:48 ` Alex Williamson
2019-11-13 20:17 ` Kirti Wankhede
2019-11-13 20:40 ` Alex Williamson
2019-11-14 18:49 ` Kirti Wankhede
2019-11-12 17:03 ` [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap Kirti Wankhede
2019-11-12 22:30 ` Alex Williamson
2019-11-13 19:37 ` Kirti Wankhede
2019-11-13 20:07 ` Alex Williamson
2019-11-14 18:56 ` Kirti Wankhede
2019-11-14 21:06 ` Alex Williamson
2019-11-15 2:40 ` Yan Zhao
2019-11-15 3:21 ` Alex Williamson
2019-11-15 5:10 ` Tian, Kevin
2019-11-19 23:16 ` Alex Williamson
2019-11-20 1:04 ` Tian, Kevin
2019-11-20 1:51 ` Yan Zhao
2019-11-26 0:57 ` Yan Zhao
2019-12-03 18:04 ` Alex Williamson
2019-12-04 18:10 ` Kirti Wankhede
2019-12-04 18:34 ` Alex Williamson
2019-12-05 1:28 ` Yan Zhao
2019-12-05 5:42 ` Kirti Wankhede
2019-12-05 5:47 ` Yan Zhao
2019-12-05 5:56 ` Alex Williamson
2019-12-05 6:19 ` Kirti Wankhede
2019-12-05 6:40 ` Alex Williamson
2019-11-12 17:03 ` [PATCH v9 Kernel 3/5] vfio iommu: Add ioctl defination to unmap IOVA and return dirty bitmap Kirti Wankhede
2019-11-12 22:30 ` Alex Williamson
2019-11-13 19:52 ` Kirti Wankhede
2019-11-13 20:22 ` Alex Williamson
2019-11-14 18:56 ` Kirti Wankhede
2019-11-14 21:08 ` Alex Williamson
2019-11-12 17:03 ` [PATCH v9 Kernel 4/5] vfio iommu: Implementation of ioctl to get dirty pages bitmap Kirti Wankhede
2019-11-12 22:30 ` Alex Williamson
2019-11-12 17:03 ` [PATCH v9 Kernel 5/5] vfio iommu: Implementation of ioctl to get dirty bitmap before unmap Kirti Wankhede
2019-11-12 22:30 ` Alex Williamson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191114003615.GD18001@joy-OptiPlex-7040 \
--to=yan.y.zhao@intel.com \
--cc=Ken.Xue@amd.com \
--cc=Zhengxiao.zx@Alibaba-inc.com \
--cc=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=changpeng.liu@intel.com \
--cc=cjia@nvidia.com \
--cc=cohuck@redhat.com \
--cc=dgilbert@redhat.com \
--cc=eauger@redhat.com \
--cc=eskultet@redhat.com \
--cc=felipe@nutanix.com \
--cc=jonathan.davies@nutanix.com \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=mlevitsk@redhat.com \
--cc=pasic@linux.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=shuangtai.tst@alibaba-inc.com \
--cc=yi.l.liu@intel.com \
--cc=zhi.a.wang@intel.com \
--cc=ziye.yang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).