All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-19  8:50 ` Yan Zhao
  0 siblings, 0 replies; 133+ messages in thread
From: Yan Zhao @ 2019-02-19  8:50 UTC (permalink / raw)
  To: alex.williamson, qemu-devel
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, kwankhede, eauger,
	yi.l.liu, eskultet, ziye.yang, mlevitsk, pasic, arei.gonglei,
	felipe, Ken.Xue, kevin.tian, Yan Zhao, dgilbert, intel-gvt-dev,
	changpeng.liu, cohuck, zhi.a.wang, jonathan.davies

This patchset enables VFIO devices to have live migration capability.
Currently it does not support post-copy phase.

It follows Alex's comments on last version of VFIO live migration patches,
including device states, VFIO device state region layout, dirty bitmap's
query.

Device Data
-----------
Device data is divided into three types: device memory, device config,
and system memory dirty pages produced by device.

Device config: data like MMIOs, page tables...
        Every device is supposed to possess device config data.
    	Usually device config's size is small (no big than 10M), and it
        needs to be loaded in certain strict order.
        Therefore, device config only needs to be saved/loaded in
        stop-and-copy phase.
        The data of device config is held in device config region.
        Size of device config data is smaller than or equal to that of
        device config region.

Device Memory: device's internal memory, standalone and outside system
        memory. It is usually very big.
        This kind of data needs to be saved / loaded in pre-copy and
        stop-and-copy phase.
        The data of device memory is held in device memory region.
        Size of devie memory is usually larger than that of device
        memory region. qemu needs to save/load it in chunks of size of
        device memory region.
        Not all device has device memory. Like IGD only uses system memory.

System memory dirty pages: If a device produces dirty pages in system
        memory, it is able to get dirty bitmap for certain range of system
        memory. This dirty bitmap is queried in pre-copy and stop-and-copy
        phase in .log_sync callback. By setting dirty bitmap in .log_sync
        callback, dirty pages in system memory will be save/loaded by ram's
        live migration code.
        The dirty bitmap of system memory is held in dirty bitmap region.
        If system memory range is larger than that dirty bitmap region can
        hold, qemu will cut it into several chunks and get dirty bitmap in
        succession.


Device State Regions
--------------------
Vendor driver is required to expose two mandatory regions and another two
optional regions if it plans to support device state management.

So, there are up to four regions in total.
One control region: mandatory.
        Get access via read/write system call.
        Its layout is defined in struct vfio_device_state_ctl
Three data regions: mmaped into qemu.
        device config region: mandatory, holding data of device config
        device memory region: optional, holding data of device memory
        dirty bitmap region: optional, holding bitmap of system memory
                            dirty pages

(The reason why four seperate regions are defined is that the unit of mmap
system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
control and three mmaped regions for data seems better than one big region
padded and sparse mmaped).


kernel device state interface [1]
--------------------------------------
#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
#define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
#define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2

#define VFIO_DEVICE_STATE_RUNNING 0 
#define VFIO_DEVICE_STATE_STOP 1
#define VFIO_DEVICE_STATE_LOGGING 2

#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
#define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3

struct vfio_device_state_ctl {
	__u32 version;		  /* ro */
	__u32 device_state;       /* VFIO device state, wo */
	__u32 caps;		 /* ro */
        struct {
		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
		__u64 size;    /*rw*/
	} device_config;
	struct {
		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
		__u64 size;     /* rw */  
                __u64 pos; /*the offset in total buffer of device memory*/
	} device_memory;
	struct {
		__u64 start_addr; /* wo */
		__u64 page_nr;   /* wo */
	} system_memory;
};

Devcie States
------------- 
After migration is initialzed, it will set device state via writing to
device_state field of control region.

Four states are defined for a VFIO device:
        RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 

RUNNING: In this state, a VFIO device is in active state ready to receive
        commands from device driver.
        It is the default state that a VFIO device enters initially.

STOP:  In this state, a VFIO device is deactivated to interact with
       device driver.

LOGGING: a special state that it CANNOT exist independently. It must be
       set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
       STOP & LOGGING).
       Qemu will set LOGGING state on in .save_setup callbacks, then vendor
       driver can start dirty data logging for device memory and system
       memory.
       LOGGING only impacts device/system memory. They return whole
       snapshot outside LOGGING and dirty data since last get operation
       inside LOGGING.
       Device config should be always accessible and return whole config
       snapshot regardless of LOGGING state.
       
Note:
The reason why RUNNING is the default state is that device's active state
must not depend on device state interface.
It is possible that region vfio_device_state_ctl fails to get registered.
In that condition, a device needs be in active state by default. 

Get Version & Get Caps
----------------------
On migration init phase, qemu will probe the existence of device state
regions of vendor driver, then get version of the device state interface
from the r/w control region.

Then it will probe VFIO device's data capability by reading caps field of
control region.
        #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
        #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
        device memory in pre-copy and stop-and-copy phase. The data of
        device memory is held in device memory region.
If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
        produced by VFIO device during pre-copy and stop-and-copy phase.
        The dirty bitmap of system memory is held in dirty bitmap region.

If failing to find two mandatory regions and optional data regions
corresponding to data caps or version mismatching, it will setup a
migration blocker and disable live migration for VFIO device.


Flows to call device state interface for VFIO live migration
------------------------------------------------------------

Live migration save path:

(QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)

MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
 |
MIGRATION_STATUS_SAVE_SETUP
 |
 .save_setup callback -->
 get device memory size (whole snapshot size)
 get device memory buffer (whole snapshot data)
 set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
 |
MIGRATION_STATUS_ACTIVE
 |
 .save_live_pending callback --> get device memory size (dirty data)
 .save_live_iteration callback --> get device memory buffer (dirty data)
 .log_sync callback --> get system memory dirty bitmap
 |
(vcpu stops) --> set device state -->
 VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
 |
.save_live_complete_precopy callback -->
 get device memory size (dirty data)
 get device memory buffer (dirty data)
 get device config size (whole snapshot size)
 get device config buffer (whole snapshot data)
 |
.save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
MIGRATION_STATUS_COMPLETED

MIGRATION_STATUS_CANCELLED or
MIGRATION_STATUS_FAILED
 |
(vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING


Live migration load path:

(QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)

MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
 |
(vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
 |
MIGRATION_STATUS_ACTIVE
 |
.load state callback -->
 set device memory size, set device memory buffer, set device config size,
 set device config buffer
 |
(vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
 |
MIGRATION_STATUS_COMPLETED



In source VM side,
In precopy phase,
if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
qemu will first get whole snapshot of device memory in .save_setup
callback, and then it will get total size of dirty data in device memory in
.save_live_pending callback by reading device_memory.size field of control
region.
Then in .save_live_iteration callback, it will get buffer of device memory's
dirty data chunk by chunk from device memory region by writing pos &
action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
control region. (size of each chunk is the size of device memory data
region).
.save_live_pending and .save_live_iteration may be called several times in
precopy phase to get dirty data in device memory.

If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
like .save_setup, .save_live_pending, .save_live_iteration will not call
vendor driver's device state interface to get data from devcie memory.

In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
.log_sync callback will get system memory dirty bitmap from dirty bitmap
region by writing system memory's start address, page count and action 
(GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
"system_memory.action" fields of control region.
If page count passed in .log_sync callback is larger than the bitmap size
the dirty bitmap region supports, Qemu will cut it into chunks and call
vendor driver's get system memory dirty bitmap interface.
If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
returns without call to vendor driver.

In stop-and-copy phase, device state will be set to STOP & LOGGING first.
in save_live_complete_precopy callback,
If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
get device memory size and get device memory buffer will be called again.
After that,
device config data is get from device config region by reading
devcie_config.size of control region and writing action (GET_BITMAP) to
device_config.action of control region.
Then after migration completes, in cleanup handler, LOGGING state will be
cleared (i.e. deivce state is set to STOP).
Clearing LOGGING state in cleanup handler is in consideration of the case
of "migration failed" and "migration cancelled". They can also leverage
the cleanup handler to unset LOGGING state.


References
----------
1. kernel side implementation of Device state interfaces:
https://patchwork.freedesktop.org/series/56876/


Yan Zhao (5):
  vfio/migration: define kernel interfaces
  vfio/migration: support device of device config capability
  vfio/migration: tracking of dirty page in system memory
  vfio/migration: turn on migration
  vfio/migration: support device memory capability

 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/common.c              |  26 ++
 hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 |  10 +-
 hw/vfio/pci.h                 |  26 +-
 include/hw/vfio/vfio-common.h |   1 +
 linux-headers/linux/vfio.h    | 260 +++++++++++++
 7 files changed, 1174 insertions(+), 9 deletions(-)
 create mode 100644 hw/vfio/migration.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-19  8:50 ` Yan Zhao
  0 siblings, 0 replies; 133+ messages in thread
From: Yan Zhao @ 2019-02-19  8:50 UTC (permalink / raw)
  To: alex.williamson, qemu-devel
  Cc: intel-gvt-dev, Zhengxiao.zx, yi.l.liu, eskultet, ziye.yang,
	cohuck, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk, pasic,
	aik, eauger, felipe, jonathan.davies, changpeng.liu, Ken.Xue,
	kwankhede, kevin.tian, cjia, arei.gonglei, kvm, Yan Zhao

This patchset enables VFIO devices to have live migration capability.
Currently it does not support post-copy phase.

It follows Alex's comments on last version of VFIO live migration patches,
including device states, VFIO device state region layout, dirty bitmap's
query.

Device Data
-----------
Device data is divided into three types: device memory, device config,
and system memory dirty pages produced by device.

Device config: data like MMIOs, page tables...
        Every device is supposed to possess device config data.
    	Usually device config's size is small (no big than 10M), and it
        needs to be loaded in certain strict order.
        Therefore, device config only needs to be saved/loaded in
        stop-and-copy phase.
        The data of device config is held in device config region.
        Size of device config data is smaller than or equal to that of
        device config region.

Device Memory: device's internal memory, standalone and outside system
        memory. It is usually very big.
        This kind of data needs to be saved / loaded in pre-copy and
        stop-and-copy phase.
        The data of device memory is held in device memory region.
        Size of devie memory is usually larger than that of device
        memory region. qemu needs to save/load it in chunks of size of
        device memory region.
        Not all device has device memory. Like IGD only uses system memory.

System memory dirty pages: If a device produces dirty pages in system
        memory, it is able to get dirty bitmap for certain range of system
        memory. This dirty bitmap is queried in pre-copy and stop-and-copy
        phase in .log_sync callback. By setting dirty bitmap in .log_sync
        callback, dirty pages in system memory will be save/loaded by ram's
        live migration code.
        The dirty bitmap of system memory is held in dirty bitmap region.
        If system memory range is larger than that dirty bitmap region can
        hold, qemu will cut it into several chunks and get dirty bitmap in
        succession.


Device State Regions
--------------------
Vendor driver is required to expose two mandatory regions and another two
optional regions if it plans to support device state management.

So, there are up to four regions in total.
One control region: mandatory.
        Get access via read/write system call.
        Its layout is defined in struct vfio_device_state_ctl
Three data regions: mmaped into qemu.
        device config region: mandatory, holding data of device config
        device memory region: optional, holding data of device memory
        dirty bitmap region: optional, holding bitmap of system memory
                            dirty pages

(The reason why four seperate regions are defined is that the unit of mmap
system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
control and three mmaped regions for data seems better than one big region
padded and sparse mmaped).


kernel device state interface [1]
--------------------------------------
#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
#define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
#define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2

#define VFIO_DEVICE_STATE_RUNNING 0 
#define VFIO_DEVICE_STATE_STOP 1
#define VFIO_DEVICE_STATE_LOGGING 2

#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
#define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3

struct vfio_device_state_ctl {
	__u32 version;		  /* ro */
	__u32 device_state;       /* VFIO device state, wo */
	__u32 caps;		 /* ro */
        struct {
		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
		__u64 size;    /*rw*/
	} device_config;
	struct {
		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
		__u64 size;     /* rw */  
                __u64 pos; /*the offset in total buffer of device memory*/
	} device_memory;
	struct {
		__u64 start_addr; /* wo */
		__u64 page_nr;   /* wo */
	} system_memory;
};

Devcie States
------------- 
After migration is initialzed, it will set device state via writing to
device_state field of control region.

Four states are defined for a VFIO device:
        RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 

RUNNING: In this state, a VFIO device is in active state ready to receive
        commands from device driver.
        It is the default state that a VFIO device enters initially.

STOP:  In this state, a VFIO device is deactivated to interact with
       device driver.

LOGGING: a special state that it CANNOT exist independently. It must be
       set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
       STOP & LOGGING).
       Qemu will set LOGGING state on in .save_setup callbacks, then vendor
       driver can start dirty data logging for device memory and system
       memory.
       LOGGING only impacts device/system memory. They return whole
       snapshot outside LOGGING and dirty data since last get operation
       inside LOGGING.
       Device config should be always accessible and return whole config
       snapshot regardless of LOGGING state.
       
Note:
The reason why RUNNING is the default state is that device's active state
must not depend on device state interface.
It is possible that region vfio_device_state_ctl fails to get registered.
In that condition, a device needs be in active state by default. 

Get Version & Get Caps
----------------------
On migration init phase, qemu will probe the existence of device state
regions of vendor driver, then get version of the device state interface
from the r/w control region.

Then it will probe VFIO device's data capability by reading caps field of
control region.
        #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
        #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
        device memory in pre-copy and stop-and-copy phase. The data of
        device memory is held in device memory region.
If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
        produced by VFIO device during pre-copy and stop-and-copy phase.
        The dirty bitmap of system memory is held in dirty bitmap region.

If failing to find two mandatory regions and optional data regions
corresponding to data caps or version mismatching, it will setup a
migration blocker and disable live migration for VFIO device.


Flows to call device state interface for VFIO live migration
------------------------------------------------------------

Live migration save path:

(QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)

MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
 |
MIGRATION_STATUS_SAVE_SETUP
 |
 .save_setup callback -->
 get device memory size (whole snapshot size)
 get device memory buffer (whole snapshot data)
 set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
 |
MIGRATION_STATUS_ACTIVE
 |
 .save_live_pending callback --> get device memory size (dirty data)
 .save_live_iteration callback --> get device memory buffer (dirty data)
 .log_sync callback --> get system memory dirty bitmap
 |
(vcpu stops) --> set device state -->
 VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
 |
.save_live_complete_precopy callback -->
 get device memory size (dirty data)
 get device memory buffer (dirty data)
 get device config size (whole snapshot size)
 get device config buffer (whole snapshot data)
 |
.save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
MIGRATION_STATUS_COMPLETED

MIGRATION_STATUS_CANCELLED or
MIGRATION_STATUS_FAILED
 |
(vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING


Live migration load path:

(QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)

MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
 |
(vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
 |
MIGRATION_STATUS_ACTIVE
 |
.load state callback -->
 set device memory size, set device memory buffer, set device config size,
 set device config buffer
 |
(vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
 |
MIGRATION_STATUS_COMPLETED



In source VM side,
In precopy phase,
if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
qemu will first get whole snapshot of device memory in .save_setup
callback, and then it will get total size of dirty data in device memory in
.save_live_pending callback by reading device_memory.size field of control
region.
Then in .save_live_iteration callback, it will get buffer of device memory's
dirty data chunk by chunk from device memory region by writing pos &
action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
control region. (size of each chunk is the size of device memory data
region).
.save_live_pending and .save_live_iteration may be called several times in
precopy phase to get dirty data in device memory.

If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
like .save_setup, .save_live_pending, .save_live_iteration will not call
vendor driver's device state interface to get data from devcie memory.

In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
.log_sync callback will get system memory dirty bitmap from dirty bitmap
region by writing system memory's start address, page count and action 
(GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
"system_memory.action" fields of control region.
If page count passed in .log_sync callback is larger than the bitmap size
the dirty bitmap region supports, Qemu will cut it into chunks and call
vendor driver's get system memory dirty bitmap interface.
If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
returns without call to vendor driver.

In stop-and-copy phase, device state will be set to STOP & LOGGING first.
in save_live_complete_precopy callback,
If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
get device memory size and get device memory buffer will be called again.
After that,
device config data is get from device config region by reading
devcie_config.size of control region and writing action (GET_BITMAP) to
device_config.action of control region.
Then after migration completes, in cleanup handler, LOGGING state will be
cleared (i.e. deivce state is set to STOP).
Clearing LOGGING state in cleanup handler is in consideration of the case
of "migration failed" and "migration cancelled". They can also leverage
the cleanup handler to unset LOGGING state.


References
----------
1. kernel side implementation of Device state interfaces:
https://patchwork.freedesktop.org/series/56876/


Yan Zhao (5):
  vfio/migration: define kernel interfaces
  vfio/migration: support device of device config capability
  vfio/migration: tracking of dirty page in system memory
  vfio/migration: turn on migration
  vfio/migration: support device memory capability

 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/common.c              |  26 ++
 hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 |  10 +-
 hw/vfio/pci.h                 |  26 +-
 include/hw/vfio/vfio-common.h |   1 +
 linux-headers/linux/vfio.h    | 260 +++++++++++++
 7 files changed, 1174 insertions(+), 9 deletions(-)
 create mode 100644 hw/vfio/migration.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH 1/5] vfio/migration: define kernel interfaces
  2019-02-19  8:50 ` [Qemu-devel] " Yan Zhao
@ 2019-02-19  8:52   ` Yan Zhao
  -1 siblings, 0 replies; 133+ messages in thread
From: Yan Zhao @ 2019-02-19  8:52 UTC (permalink / raw)
  To: alex.williamson, qemu-devel
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, kwankhede, eauger,
	yi.l.liu, eskultet, ziye.yang, mlevitsk, pasic, arei.gonglei,
	felipe, Ken.Xue, kevin.tian, Yan Zhao, dgilbert, intel-gvt-dev,
	changpeng.liu, cohuck, zhi.a.wang, jonathan.davies

- defined 4 device states regions: one control region and 3 data regions
- defined layout of control region in struct vfio_device_state_ctl
- defined 4 device states: running, stop, running&logging, stop&logging
- define 3 device data categories: device config, device memory, system
  memory
- defined 2 device data capabilities: device memory and system memory
- defined device state interfaces' version and 12 device state interfaces

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
---
 linux-headers/linux/vfio.h | 260 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 260 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index ceb6453..a124fc1 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -303,6 +303,56 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
 
+/* Device State region type and sub-type
+ *
+ * A VFIO device driver needs to register up to four device state regions in
+ * total: two mandatory and another two optional, if it plans to support device
+ * state management.
+ *
+ * 1. region CTL :
+ *          Mandatory.
+ *          This is a control region.
+ *          Its layout is defined in struct vfio_device_state_ctl.
+ *          Reading from this region can get version, capabilities and data
+ *          size of device state interfaces.
+ *          Writing to this region can set device state, data size and
+ *          choose which interface to use.
+ * 2. region DEVICE_CONFIG
+ *          Mandatory.
+ *          This is a data region that holds device config data.
+ *          Device config is such kind of data like MMIOs, page tables...
+ *          Every device is supposed to possess device config data.
+ *          Usually the size of device config data is small (no big
+ *          than 10M), and it needs to be loaded in certain strict
+ *          order.
+ *          Therefore no dirty data logging is enabled for device
+ *          config and it must be got/set as a whole.
+ *          Size of device config data is smaller than or equal to that of
+ *          device config region.
+ *          It is able to be mmaped into user space.
+ * 3. region DEVICE_MEMORY
+ *          Optional.
+ *          This is a data region that holds device memory data.
+ *          Device memory is device's internal memory, standalone and outside
+ *          system memory.  It is usually very big.
+ *          Not all device has device memory. Like IGD only uses system
+ *          memory and has no device memory.
+ *          Size of devie memory is usually larger than that of device
+ *          memory region. qemu needs to save/load it in chunks of size of
+ *          device memory region.
+ *          It is able to be mmaped into user space.
+ * 4. region DIRTY_BITMAP
+ *          Optional.
+ *          This is a data region that holds bitmap of dirty pages in system
+ *          memory that a VFIO devices produces.
+ *          It is able to be mmaped into user space.
+ */
+#define VFIO_REGION_TYPE_DEVICE_STATE           (1 << 1)
+#define VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL       (1)
+#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG      (2)
+#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY      (3)
+#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP (4)
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
@@ -816,6 +866,216 @@ struct vfio_iommu_spapr_tce_remove {
 };
 #define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
 
+/* version number of the device state interface */
+#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
+
+/*
+ * For devices that have devcie memory, it is required to expose
+ * DEVICE_MEMORY capability.
+ *
+ * For devices producing dirty pages in system memory, it is required to
+ * expose cap SYSTEM_MEMORY in order to get dirty bitmap in certain range
+ * of system memory.
+ */
+#define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
+#define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
+
+/*
+ * DEVICE STATES
+ *
+ * Four states are defined for a VFIO device:
+ * RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP.
+ * They can be set by writing to device_state field of
+ * vfio_device_state_ctl region.
+ *
+ * RUNNING: In this state, a VFIO device is in active state ready to
+ * receive commands from device driver.
+ * It is the default state that a VFIO device enters initially.
+ *
+ * STOP: In this state, a VFIO device is deactivated to interact with
+ * device driver.
+ *
+ * LOGGING state is a special state that it CANNOT exist
+ * independently.
+ * It must be set alongside with state RUNNING or STOP, i.e,
+ * RUNNING & LOGGING, STOP & LOGGING.
+ * It is used for dirty data logging both for device memory
+ * and system memory.
+ *
+ * LOGGING only impacts device/system memory. In LOGGING state, get buffer
+ * of device memory returns dirty pages since last call; outside LOGGING
+ * state, get buffer of device memory returns whole snapshot of device
+ * memory. system memory's dirty page is only available in LOGGING state.
+ *
+ * Device config should be always accessible and return whole config snapshot
+ * regardless of LOGGING state.
+ * */
+#define VFIO_DEVICE_STATE_RUNNING 0
+#define VFIO_DEVICE_STATE_STOP 1
+#define VFIO_DEVICE_STATE_LOGGING 2
+
+/* action to get data from device memory or device config
+ * the action is write to device state's control region, and data is read
+ * from device memory region or device config region.
+ * Each time before read device memory region or device config region,
+ * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
+ * field in control region. That is because device memory and devie config
+ * region is mmaped into user space. vendor driver has to be notified of
+ * the the GET_BUFFER action in advance.
+ */
+#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
+
+/* action to set data to device memory or device config
+ * the action is write to device state's control region, and data is
+ * written to device memory region or device config region.
+ * Each time after write to device memory region or device config region,
+ * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
+ * field in control region. That is because device memory and devie config
+ * region is mmaped into user space. vendor driver has to be notified of
+ * the the SET_BUFFER action after data written.
+ */
+#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
+
+/* layout of device state interfaces' control region
+ * By reading to control region and reading/writing data from device config
+ * region, device memory region, system memory regions, below interface can
+ * be implemented:
+ *
+ * 1. get version
+ *   (1) user space calls read system call on "version" field of control
+ *   region.
+ *   (2) vendor driver writes version number of device state interfaces
+ *   to the "version" field of control region.
+ *
+ * 2. get caps
+ *   (1) user space calls read system call on "caps" field of control region.
+ *   (2) if a VFIO device has huge device memory, vendor driver reports
+ *      VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY in "caps" field of control region.
+ *      if a VFIO device produces dirty pages in system memory, vendor driver
+ *      reports VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY in "caps" field of
+ *      control region.
+ *
+ * 3. set device state
+ *    (1) user space calls write system call on "device_state" field of
+ *    control region.
+ *    (2) device state transitions as:
+ *
+ *    RUNNING -- start dirty data logging --> RUNNING & LOGGING
+ *    RUNNING -- deactivate --> STOP
+ *    RUNNING -- deactivate & start dirty data longging --> STOP & LOGGING
+ *    RUNNING & LOGGING -- stop dirty data logging --> RUNNING
+ *    RUNNING & LOGGING -- deactivate --> STOP & LOGGING
+ *    RUNNING & LOGGING -- deactivate & stop dirty data logging --> STOP
+ *    STOP -- activate --> RUNNING
+ *    STOP -- start dirty data logging --> STOP & LOGGING
+ *    STOP -- activate & start dirty data logging --> RUNNING & LOGGING
+ *    STOP & LOGGING -- stop dirty data logging --> STOP
+ *    STOP & LOGGING -- activate --> RUNNING & LOGGING
+ *    STOP & LOGGING -- activate & stop dirty data logging --> RUNNING
+ *
+ * 4. get device config size
+ *   (1) user space calls read system call on "device_config.size" field of
+ *       control region for the total size of device config snapshot.
+ *   (2) vendor driver writes device config data's total size in
+ *       "device_config.size" field of control region.
+ *
+ * 5. set device config size
+ *   (1) user space calls write system call.
+ *       total size of device config snapshot --> "device_config.size" field
+ *       of control region.
+ *   (2) vendor driver reads device config data's total size from
+ *       "device_config.size" field of control region.
+ *
+ * 6 get device config buffer
+ *   (1) user space calls write system call.
+ *       "GET_BUFFER" --> "device_config.action" field of control region.
+ *   (2) vendor driver
+ *       a. gets whole snapshot for device config
+ *       b. writes whole device config snapshot to region
+ *       DEVICE_CONFIG.
+ *   (3) user space reads the whole of device config snapshot from region
+ *       DEVICE_CONFIG.
+ *
+ * 7. set device config buffer
+ *   (1) user space writes whole of device config data to region
+ *       DEVICE_CONFIG.
+ *   (2) user space calls write system call.
+ *       "SET_BUFFER" --> "device_config.action" field of control region.
+ *   (3) vendor driver loads whole of device config from region DEVICE_CONFIG.
+ *
+ * 8. get device memory size
+ *   (1) user space calls read system call on "device_memory.size" field of
+ *       control region for device memory size.
+ *   (2) vendor driver
+ *       a. gets device memory snapshot (in state RUNNING or STOP), or
+ *          gets device memory dirty data (in state RUNNING & LOGGING or
+ *          state STOP & LOGGING)
+ *       b. writes size in "device_memory.size" field of control region
+ *
+ * 9. set device memory size
+ *   (1) user space calls write system call on "device_memory.size" field of
+ *       control region to set total size of device memory snapshot.
+ *   (2) vendor driver reads device memory's size from "device_memory.size"
+ *       field of control region.
+ *
+ *
+ * 10. get device memory buffer
+ *   (1) user space calls write system.
+ *       pos --> "device_memory.pos" field of control region,
+ *       "GET_BUFFER" --> "device_memory.action" field of control region.
+ *       (pos must be 0 or multiples of length of region DEVICE_MEMORY).
+ *   (2) vendor driver writes N'th chunk of device memory snapshot/dirty data
+ *       to region DEVICE_MEMORY.
+ *       (N equals to pos/(region length of DEVICE_MEMORY))
+ *   (3) user space reads the N'th chunk of device memory snapshot/dirty data
+ *       from region DEVICE_MEMORY.
+ *
+ * 11. set device memory buffer
+ *   (1) user space writes N'th chunk of device memory snapshot/dirty data to
+ *       region DEVICE_MEMORY.
+ *       (N equals to pos/(region length of DEVICE_MEMORY))
+ *   (2) user space writes pos to "device_memory.pos" field and writes
+ *       "SET_BUFFER" to "device_memory.action" field of control region.
+ *   (3) vendor driver loads N'th chunk of device memory snapshot/dirty data
+ *       from region DEVICE_MEMORY.
+ *
+ * 12. get system memory dirty bitmap
+ *   (1) user space calls write system call to specify a range of system
+ *       memory that querying dirty pages.
+ *       system memory's start address --> "system_memory.start_addr" field
+ *       of control region,
+ *       system memory's page count --> "system_memory.page_nr" field of
+ *       control region.
+ *   (2) if device state is not in RUNNING or STOP & LOGGING,
+ *       vendor driver returns empty bitmap; otherwise,
+ *       vendor driver checks the page_nr,
+ *       if it's larger than the size that region DIRTY_BITMAP can support,
+ *       error returns; if not,
+ *       vendor driver returns as bitmap to specify dirty pages that
+ *       device produces since last query in this range of system memory .
+ *   (3) usespace reads back the dirty bitmap from region DIRTY_BITMAP.
+ *
+ */
+
+struct vfio_device_state_ctl {
+	__u32 version;		  /* ro versio of devcie state interfaces*/
+	__u32 device_state;       /* VFIO device state, wo */
+	__u32 caps;		 /* ro */
+        struct {
+		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
+		__u64 size;    /*rw, total size of device config*/
+	} device_config;
+	struct {
+		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
+		__u64 size;     /* rw, total size of device memory*/
+        __u64 pos;/*chunk offset in total buffer of device memory*/
+	} device_memory;
+	struct {
+		__u64 start_addr; /* wo */
+		__u64 page_nr;   /* wo */
+	} system_memory;
+}__attribute__((packed));
+
 /* ***************************************************************** */
 
 #endif /* VFIO_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [Qemu-devel] [PATCH 1/5] vfio/migration: define kernel interfaces
@ 2019-02-19  8:52   ` Yan Zhao
  0 siblings, 0 replies; 133+ messages in thread
From: Yan Zhao @ 2019-02-19  8:52 UTC (permalink / raw)
  To: alex.williamson, qemu-devel
  Cc: intel-gvt-dev, Zhengxiao.zx, yi.l.liu, eskultet, ziye.yang,
	cohuck, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk, pasic,
	aik, eauger, felipe, jonathan.davies, changpeng.liu, Ken.Xue,
	kwankhede, kevin.tian, cjia, arei.gonglei, kvm, Yan Zhao

- defined 4 device states regions: one control region and 3 data regions
- defined layout of control region in struct vfio_device_state_ctl
- defined 4 device states: running, stop, running&logging, stop&logging
- define 3 device data categories: device config, device memory, system
  memory
- defined 2 device data capabilities: device memory and system memory
- defined device state interfaces' version and 12 device state interfaces

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
---
 linux-headers/linux/vfio.h | 260 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 260 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index ceb6453..a124fc1 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -303,6 +303,56 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
 
+/* Device State region type and sub-type
+ *
+ * A VFIO device driver needs to register up to four device state regions in
+ * total: two mandatory and another two optional, if it plans to support device
+ * state management.
+ *
+ * 1. region CTL :
+ *          Mandatory.
+ *          This is a control region.
+ *          Its layout is defined in struct vfio_device_state_ctl.
+ *          Reading from this region can get version, capabilities and data
+ *          size of device state interfaces.
+ *          Writing to this region can set device state, data size and
+ *          choose which interface to use.
+ * 2. region DEVICE_CONFIG
+ *          Mandatory.
+ *          This is a data region that holds device config data.
+ *          Device config is such kind of data like MMIOs, page tables...
+ *          Every device is supposed to possess device config data.
+ *          Usually the size of device config data is small (no big
+ *          than 10M), and it needs to be loaded in certain strict
+ *          order.
+ *          Therefore no dirty data logging is enabled for device
+ *          config and it must be got/set as a whole.
+ *          Size of device config data is smaller than or equal to that of
+ *          device config region.
+ *          It is able to be mmaped into user space.
+ * 3. region DEVICE_MEMORY
+ *          Optional.
+ *          This is a data region that holds device memory data.
+ *          Device memory is device's internal memory, standalone and outside
+ *          system memory.  It is usually very big.
+ *          Not all device has device memory. Like IGD only uses system
+ *          memory and has no device memory.
+ *          Size of devie memory is usually larger than that of device
+ *          memory region. qemu needs to save/load it in chunks of size of
+ *          device memory region.
+ *          It is able to be mmaped into user space.
+ * 4. region DIRTY_BITMAP
+ *          Optional.
+ *          This is a data region that holds bitmap of dirty pages in system
+ *          memory that a VFIO devices produces.
+ *          It is able to be mmaped into user space.
+ */
+#define VFIO_REGION_TYPE_DEVICE_STATE           (1 << 1)
+#define VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL       (1)
+#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG      (2)
+#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY      (3)
+#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP (4)
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
@@ -816,6 +866,216 @@ struct vfio_iommu_spapr_tce_remove {
 };
 #define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
 
+/* version number of the device state interface */
+#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
+
+/*
+ * For devices that have devcie memory, it is required to expose
+ * DEVICE_MEMORY capability.
+ *
+ * For devices producing dirty pages in system memory, it is required to
+ * expose cap SYSTEM_MEMORY in order to get dirty bitmap in certain range
+ * of system memory.
+ */
+#define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
+#define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
+
+/*
+ * DEVICE STATES
+ *
+ * Four states are defined for a VFIO device:
+ * RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP.
+ * They can be set by writing to device_state field of
+ * vfio_device_state_ctl region.
+ *
+ * RUNNING: In this state, a VFIO device is in active state ready to
+ * receive commands from device driver.
+ * It is the default state that a VFIO device enters initially.
+ *
+ * STOP: In this state, a VFIO device is deactivated to interact with
+ * device driver.
+ *
+ * LOGGING state is a special state that it CANNOT exist
+ * independently.
+ * It must be set alongside with state RUNNING or STOP, i.e,
+ * RUNNING & LOGGING, STOP & LOGGING.
+ * It is used for dirty data logging both for device memory
+ * and system memory.
+ *
+ * LOGGING only impacts device/system memory. In LOGGING state, get buffer
+ * of device memory returns dirty pages since last call; outside LOGGING
+ * state, get buffer of device memory returns whole snapshot of device
+ * memory. system memory's dirty page is only available in LOGGING state.
+ *
+ * Device config should be always accessible and return whole config snapshot
+ * regardless of LOGGING state.
+ * */
+#define VFIO_DEVICE_STATE_RUNNING 0
+#define VFIO_DEVICE_STATE_STOP 1
+#define VFIO_DEVICE_STATE_LOGGING 2
+
+/* action to get data from device memory or device config
+ * the action is write to device state's control region, and data is read
+ * from device memory region or device config region.
+ * Each time before read device memory region or device config region,
+ * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
+ * field in control region. That is because device memory and devie config
+ * region is mmaped into user space. vendor driver has to be notified of
+ * the the GET_BUFFER action in advance.
+ */
+#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
+
+/* action to set data to device memory or device config
+ * the action is write to device state's control region, and data is
+ * written to device memory region or device config region.
+ * Each time after write to device memory region or device config region,
+ * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
+ * field in control region. That is because device memory and devie config
+ * region is mmaped into user space. vendor driver has to be notified of
+ * the the SET_BUFFER action after data written.
+ */
+#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
+
+/* layout of device state interfaces' control region
+ * By reading to control region and reading/writing data from device config
+ * region, device memory region, system memory regions, below interface can
+ * be implemented:
+ *
+ * 1. get version
+ *   (1) user space calls read system call on "version" field of control
+ *   region.
+ *   (2) vendor driver writes version number of device state interfaces
+ *   to the "version" field of control region.
+ *
+ * 2. get caps
+ *   (1) user space calls read system call on "caps" field of control region.
+ *   (2) if a VFIO device has huge device memory, vendor driver reports
+ *      VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY in "caps" field of control region.
+ *      if a VFIO device produces dirty pages in system memory, vendor driver
+ *      reports VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY in "caps" field of
+ *      control region.
+ *
+ * 3. set device state
+ *    (1) user space calls write system call on "device_state" field of
+ *    control region.
+ *    (2) device state transitions as:
+ *
+ *    RUNNING -- start dirty data logging --> RUNNING & LOGGING
+ *    RUNNING -- deactivate --> STOP
+ *    RUNNING -- deactivate & start dirty data longging --> STOP & LOGGING
+ *    RUNNING & LOGGING -- stop dirty data logging --> RUNNING
+ *    RUNNING & LOGGING -- deactivate --> STOP & LOGGING
+ *    RUNNING & LOGGING -- deactivate & stop dirty data logging --> STOP
+ *    STOP -- activate --> RUNNING
+ *    STOP -- start dirty data logging --> STOP & LOGGING
+ *    STOP -- activate & start dirty data logging --> RUNNING & LOGGING
+ *    STOP & LOGGING -- stop dirty data logging --> STOP
+ *    STOP & LOGGING -- activate --> RUNNING & LOGGING
+ *    STOP & LOGGING -- activate & stop dirty data logging --> RUNNING
+ *
+ * 4. get device config size
+ *   (1) user space calls read system call on "device_config.size" field of
+ *       control region for the total size of device config snapshot.
+ *   (2) vendor driver writes device config data's total size in
+ *       "device_config.size" field of control region.
+ *
+ * 5. set device config size
+ *   (1) user space calls write system call.
+ *       total size of device config snapshot --> "device_config.size" field
+ *       of control region.
+ *   (2) vendor driver reads device config data's total size from
+ *       "device_config.size" field of control region.
+ *
+ * 6 get device config buffer
+ *   (1) user space calls write system call.
+ *       "GET_BUFFER" --> "device_config.action" field of control region.
+ *   (2) vendor driver
+ *       a. gets whole snapshot for device config
+ *       b. writes whole device config snapshot to region
+ *       DEVICE_CONFIG.
+ *   (3) user space reads the whole of device config snapshot from region
+ *       DEVICE_CONFIG.
+ *
+ * 7. set device config buffer
+ *   (1) user space writes whole of device config data to region
+ *       DEVICE_CONFIG.
+ *   (2) user space calls write system call.
+ *       "SET_BUFFER" --> "device_config.action" field of control region.
+ *   (3) vendor driver loads whole of device config from region DEVICE_CONFIG.
+ *
+ * 8. get device memory size
+ *   (1) user space calls read system call on "device_memory.size" field of
+ *       control region for device memory size.
+ *   (2) vendor driver
+ *       a. gets device memory snapshot (in state RUNNING or STOP), or
+ *          gets device memory dirty data (in state RUNNING & LOGGING or
+ *          state STOP & LOGGING)
+ *       b. writes size in "device_memory.size" field of control region
+ *
+ * 9. set device memory size
+ *   (1) user space calls write system call on "device_memory.size" field of
+ *       control region to set total size of device memory snapshot.
+ *   (2) vendor driver reads device memory's size from "device_memory.size"
+ *       field of control region.
+ *
+ *
+ * 10. get device memory buffer
+ *   (1) user space calls write system.
+ *       pos --> "device_memory.pos" field of control region,
+ *       "GET_BUFFER" --> "device_memory.action" field of control region.
+ *       (pos must be 0 or multiples of length of region DEVICE_MEMORY).
+ *   (2) vendor driver writes N'th chunk of device memory snapshot/dirty data
+ *       to region DEVICE_MEMORY.
+ *       (N equals to pos/(region length of DEVICE_MEMORY))
+ *   (3) user space reads the N'th chunk of device memory snapshot/dirty data
+ *       from region DEVICE_MEMORY.
+ *
+ * 11. set device memory buffer
+ *   (1) user space writes N'th chunk of device memory snapshot/dirty data to
+ *       region DEVICE_MEMORY.
+ *       (N equals to pos/(region length of DEVICE_MEMORY))
+ *   (2) user space writes pos to "device_memory.pos" field and writes
+ *       "SET_BUFFER" to "device_memory.action" field of control region.
+ *   (3) vendor driver loads N'th chunk of device memory snapshot/dirty data
+ *       from region DEVICE_MEMORY.
+ *
+ * 12. get system memory dirty bitmap
+ *   (1) user space calls write system call to specify a range of system
+ *       memory that querying dirty pages.
+ *       system memory's start address --> "system_memory.start_addr" field
+ *       of control region,
+ *       system memory's page count --> "system_memory.page_nr" field of
+ *       control region.
+ *   (2) if device state is not in RUNNING or STOP & LOGGING,
+ *       vendor driver returns empty bitmap; otherwise,
+ *       vendor driver checks the page_nr,
+ *       if it's larger than the size that region DIRTY_BITMAP can support,
+ *       error returns; if not,
+ *       vendor driver returns as bitmap to specify dirty pages that
+ *       device produces since last query in this range of system memory .
+ *   (3) usespace reads back the dirty bitmap from region DIRTY_BITMAP.
+ *
+ */
+
+struct vfio_device_state_ctl {
+	__u32 version;		  /* ro versio of devcie state interfaces*/
+	__u32 device_state;       /* VFIO device state, wo */
+	__u32 caps;		 /* ro */
+        struct {
+		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
+		__u64 size;    /*rw, total size of device config*/
+	} device_config;
+	struct {
+		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
+		__u64 size;     /* rw, total size of device memory*/
+        __u64 pos;/*chunk offset in total buffer of device memory*/
+	} device_memory;
+	struct {
+		__u64 start_addr; /* wo */
+		__u64 page_nr;   /* wo */
+	} system_memory;
+}__attribute__((packed));
+
 /* ***************************************************************** */
 
 #endif /* VFIO_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 2/5] vfio/migration: support device of device config capability
  2019-02-19  8:50 ` [Qemu-devel] " Yan Zhao
@ 2019-02-19  8:52   ` Yan Zhao
  -1 siblings, 0 replies; 133+ messages in thread
From: Yan Zhao @ 2019-02-19  8:52 UTC (permalink / raw)
  To: alex.williamson, qemu-devel
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, kwankhede, eauger,
	yi.l.liu, eskultet, ziye.yang, mlevitsk, pasic, arei.gonglei,
	felipe, Ken.Xue, kevin.tian, Yan Zhao, dgilbert, intel-gvt-dev,
	changpeng.liu, cohuck, zhi.a.wang, jonathan.davies

Device config is the default data that every device should have. so
device config capability is by default on, no need to set.

- Currently two type of resources are saved/loaded for device of device
  config capability:
  General PCI config data, and Device config data.
  They are copies as a whole when precopy is stopped.

Migration setup flow:
- Setup device state regions, check its device state version and capabilities.
  Mmap Device Config Region and Dirty Bitmap Region, if available.
- If device state regions are failed to get setup, a migration blocker is
  registered instead.
- Added SaveVMHandlers to register device state save/load handlers.
- Register VM state change handler to set device's running/stop states.
- On migration startup on source machine, set device's state to
  VFIO_DEVICE_STATE_LOGGING

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
---
 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/migration.c           | 633 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 |   1 -
 hw/vfio/pci.h                 |  25 +-
 include/hw/vfio/vfio-common.h |   1 +
 5 files changed, 659 insertions(+), 3 deletions(-)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index 8b3f664..f32ff19 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,6 +1,6 @@
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
-obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
+obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o
 obj-$(CONFIG_VFIO_CCW) += ccw.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index 0000000..16d6395
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,633 @@
+#include "qemu/osdep.h"
+
+#include "hw/vfio/vfio-common.h"
+#include "migration/blocker.h"
+#include "migration/register.h"
+#include "qapi/error.h"
+#include "pci.h"
+#include "sysemu/kvm.h"
+#include "exec/ram_addr.h"
+
+#define VFIO_SAVE_FLAG_SETUP 0
+#define VFIO_SAVE_FLAG_PCI 1
+#define VFIO_SAVE_FLAG_DEVCONFIG 2
+#define VFIO_SAVE_FLAG_DEVMEMORY 4
+#define VFIO_SAVE_FLAG_CONTINUE 8
+
+static int vfio_device_state_region_setup(VFIOPCIDevice *vdev,
+        VFIORegion *region, uint32_t subtype, const char *name)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    struct vfio_region_info *info;
+    int ret;
+
+    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_DEVICE_STATE,
+            subtype, &info);
+    if (ret) {
+        error_report("Failed to get info of region %s", name);
+        return ret;
+    }
+
+    if (vfio_region_setup(OBJECT(vdev), vbasedev,
+            region, info->index, name)) {
+        error_report("Failed to setup migrtion region %s", name);
+        return ret;
+    }
+
+    if (vfio_region_mmap(region)) {
+        error_report("Failed to mmap migrtion region %s", name);
+    }
+
+    return 0;
+}
+
+bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev)
+{
+   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY);
+}
+
+bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev)
+{
+   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY);
+}
+
+static bool vfio_device_state_region_mmaped(VFIORegion *region)
+{
+    bool mmaped = true;
+    if (region->nr_mmaps != 1 || region->mmaps[0].offset ||
+            (region->size != region->mmaps[0].size) ||
+            (region->mmaps[0].mmap == NULL)) {
+        mmaped = false;
+    }
+
+    return mmaped;
+}
+
+static int vfio_get_device_config_size(VFIOPCIDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    VFIORegion *region_config =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
+    uint64_t len;
+    int sz;
+
+    sz = sizeof(len);
+    if (pread(vbasedev->fd, &len, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_config.size))
+            != sz) {
+        error_report("vfio: Failed to get length of device config");
+        return -1;
+    }
+    if (len > region_config->size) {
+        error_report("vfio: Error device config length");
+        return -1;
+    }
+    vdev->migration->devconfig_size = len;
+
+    return 0;
+}
+
+static int vfio_set_device_config_size(VFIOPCIDevice *vdev, uint64_t size)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    VFIORegion *region_config =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
+    int sz;
+
+    if (size > region_config->size) {
+        return -1;
+    }
+
+    sz = sizeof(size);
+    if (pwrite(vbasedev->fd, &size, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_config.size))
+            != sz) {
+        error_report("vfio: Failed to set length of device config");
+        return -1;
+    }
+    vdev->migration->devconfig_size = size;
+    return 0;
+}
+
+static int vfio_save_data_device_config(VFIOPCIDevice *vdev, QEMUFile *f)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    VFIORegion *region_config =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
+    void *dest;
+    uint32_t sz;
+    uint8_t *buf = NULL;
+    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
+    uint64_t len = vdev->migration->devconfig_size;
+
+    qemu_put_be64(f, len);
+
+    sz = sizeof(action);
+    if (pwrite(vbasedev->fd, &action, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_config.action))
+            != sz) {
+        error_report("vfio: action failure for device config get buffer");
+        return -1;
+    }
+
+    if (!vfio_device_state_region_mmaped(region_config)) {
+        buf = g_malloc(len);
+        if (buf == NULL) {
+            error_report("vfio: Failed to allocate memory for migrate");
+            return -1;
+        }
+        if (pread(vbasedev->fd, buf, len, region_config->fd_offset) != len) {
+            error_report("vfio: Failed read device config buffer");
+            return -1;
+        }
+        qemu_put_buffer(f, buf, len);
+        g_free(buf);
+    } else {
+        dest = region_config->mmaps[0].mmap;
+        qemu_put_buffer(f, dest, len);
+    }
+    return 0;
+}
+
+static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
+                            QEMUFile *f, uint64_t len)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    VFIORegion *region_config =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
+    void *dest;
+    uint32_t sz;
+    uint8_t *buf = NULL;
+    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
+
+    vfio_set_device_config_size(vdev, len);
+
+    if (!vfio_device_state_region_mmaped(region_config)) {
+        buf = g_malloc(len);
+        if (buf == NULL) {
+            error_report("vfio: Failed to allocate memory for migrate");
+            return -1;
+        }
+        qemu_get_buffer(f, buf, len);
+        if (pwrite(vbasedev->fd, buf, len,
+                    region_config->fd_offset) != len) {
+            error_report("vfio: Failed to write devie config buffer");
+            return -1;
+        }
+        g_free(buf);
+    } else {
+        dest = region_config->mmaps[0].mmap;
+        qemu_get_buffer(f, dest, len);
+    }
+
+    sz = sizeof(action);
+    if (pwrite(vbasedev->fd, &action, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_config.action))
+            != sz) {
+        error_report("vfio: action failure for device config set buffer");
+        return -1;
+    }
+
+    return 0;
+}
+
+static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
+        uint64_t start_addr, uint64_t page_nr)
+{
+
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    VFIORegion *region_bitmap =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
+    unsigned long bitmap_size =
+                    BITS_TO_LONGS(page_nr) * sizeof(unsigned long);
+    uint32_t sz;
+
+    struct {
+        __u64 start_addr;
+        __u64 page_nr;
+    } system_memory;
+    system_memory.start_addr = start_addr;
+    system_memory.page_nr = page_nr;
+    sz = sizeof(system_memory);
+    if (pwrite(vbasedev->fd, &system_memory, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, system_memory))
+            != sz) {
+        error_report("vfio: Failed to set system memory range for dirty pages");
+        return -1;
+    }
+
+    if (!vfio_device_state_region_mmaped(region_bitmap)) {
+        void *bitmap = g_malloc0(bitmap_size);
+
+        if (pread(vbasedev->fd, bitmap, bitmap_size,
+                    region_bitmap->fd_offset) != bitmap_size) {
+            error_report("vfio: Failed to read dirty bitmap data");
+            return -1;
+        }
+
+        cpu_physical_memory_set_dirty_lebitmap(bitmap, start_addr, page_nr);
+
+        g_free(bitmap);
+    } else {
+        cpu_physical_memory_set_dirty_lebitmap(
+                    region_bitmap->mmaps[0].mmap,
+                    start_addr, page_nr);
+    }
+   return 0;
+}
+
+int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
+        uint64_t start_addr, uint64_t page_nr)
+{
+    VFIORegion *region_bitmap =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
+    unsigned long chunk_size = region_bitmap->size;
+    uint64_t chunk_pg_nr = (chunk_size / sizeof(unsigned long)) *
+                                BITS_PER_LONG;
+
+    uint64_t cnt_left;
+    int rc = 0;
+
+    cnt_left = page_nr;
+
+    while (cnt_left >= chunk_pg_nr) {
+        rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, chunk_pg_nr);
+        if (rc) {
+            goto exit;
+        }
+        cnt_left -= chunk_pg_nr;
+        start_addr += start_addr;
+   }
+   rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, cnt_left);
+
+exit:
+   return rc;
+}
+
+static int vfio_set_device_state(VFIOPCIDevice *vdev,
+        uint32_t dev_state)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    uint32_t sz = sizeof(dev_state);
+
+    if (!vdev->migration) {
+        return -1;
+    }
+
+    if (pwrite(vbasedev->fd, &dev_state, sz,
+              region->fd_offset +
+              offsetof(struct vfio_device_state_ctl, device_state))
+            != sz) {
+        error_report("vfio: Failed to set device state %d", dev_state);
+        return -1;
+    }
+    vdev->migration->device_state = dev_state;
+    return 0;
+}
+
+static int vfio_get_device_data_caps(VFIOPCIDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+
+    uint32_t caps;
+    uint32_t size = sizeof(caps);
+
+    if (pread(vbasedev->fd, &caps, size,
+                region->fd_offset +
+                offsetof(struct vfio_device_state_ctl, caps))
+            != size) {
+        error_report("%s Failed to read data caps of device states",
+                vbasedev->name);
+        return -1;
+    }
+    vdev->migration->data_caps = caps;
+    return 0;
+}
+
+
+static int vfio_check_devstate_version(VFIOPCIDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+
+    uint32_t version;
+    uint32_t size = sizeof(version);
+
+    if (pread(vbasedev->fd, &version, size,
+                region->fd_offset +
+                offsetof(struct vfio_device_state_ctl, version))
+            != size) {
+        error_report("%s Failed to read version of device state interfaces",
+                vbasedev->name);
+        return -1;
+    }
+
+    if (version != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
+        error_report("%s migration version mismatch, right version is %d",
+                vbasedev->name, VFIO_DEVICE_STATE_INTERFACE_VERSION);
+        return -1;
+    }
+
+    return 0;
+}
+
+static void vfio_vm_change_state_handler(void *pv, int running, RunState state)
+{
+    VFIOPCIDevice *vdev = pv;
+    uint32_t dev_state = vdev->migration->device_state;
+
+    if (!running) {
+        dev_state |= VFIO_DEVICE_STATE_STOP;
+    } else {
+        dev_state &= ~VFIO_DEVICE_STATE_STOP;
+    }
+
+    vfio_set_device_state(vdev, dev_state);
+}
+
+static void vfio_save_live_pending(QEMUFile *f, void *opaque,
+                                   uint64_t max_size,
+                                   uint64_t *res_precopy_only,
+                                   uint64_t *res_compatible,
+                                   uint64_t *res_post_copy_only)
+{
+    VFIOPCIDevice *vdev = opaque;
+
+    if (!vfio_device_data_cap_device_memory(vdev)) {
+        return;
+    }
+
+    return;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+
+    if (!vfio_device_data_cap_device_memory(vdev)) {
+        return 0;
+    }
+
+    return 0;
+}
+
+static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
+    bool msi_64bit;
+
+    /* retore pci bar configuration */
+    ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+            ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        bar_cfg = qemu_get_be32(f);
+        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar_cfg, 4);
+    }
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+            ctl | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
+
+    /* restore msi configuration */
+    ctl = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
+    msi_64bit = !!(ctl & PCI_MSI_FLAGS_64BIT);
+
+    vfio_pci_write_config(&vdev->pdev,
+            pdev->msi_cap + PCI_MSI_FLAGS,
+            ctl & (!PCI_MSI_FLAGS_ENABLE), 2);
+
+    msi_lo = qemu_get_be32(f);
+    vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO, msi_lo, 4);
+
+    if (msi_64bit) {
+        msi_hi = qemu_get_be32(f);
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                msi_hi, 4);
+    }
+    msi_data = qemu_get_be32(f);
+    vfio_pci_write_config(pdev,
+            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+            msi_data, 2);
+
+    vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+            ctl | PCI_MSI_FLAGS_ENABLE, 2);
+
+}
+
+static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
+{
+    VFIOPCIDevice *vdev = opaque;
+    int flag;
+    uint64_t len;
+    int ret = 0;
+
+    if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
+        return -EINVAL;
+    }
+
+    do {
+        flag = qemu_get_byte(f);
+
+        switch (flag & ~VFIO_SAVE_FLAG_CONTINUE) {
+        case VFIO_SAVE_FLAG_SETUP:
+            break;
+        case VFIO_SAVE_FLAG_PCI:
+            vfio_pci_load_config(vdev, f);
+            break;
+        case VFIO_SAVE_FLAG_DEVCONFIG:
+            len = qemu_get_be64(f);
+            vfio_load_data_device_config(vdev, f, len);
+            break;
+        default:
+            ret = -EINVAL;
+        }
+    } while (flag & VFIO_SAVE_FLAG_CONTINUE);
+
+    return ret;
+}
+
+static void vfio_pci_save_config(VFIOPCIDevice *vdev, QEMUFile *f)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    uint32_t msi_cfg, msi_lo, msi_hi, msi_data, bar_cfg, i;
+    bool msi_64bit;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        bar_cfg = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
+        qemu_put_be32(f, bar_cfg);
+    }
+
+    msi_cfg = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
+    msi_64bit = !!(msi_cfg & PCI_MSI_FLAGS_64BIT);
+
+    msi_lo = pci_default_read_config(pdev,
+            pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
+    qemu_put_be32(f, msi_lo);
+
+    if (msi_64bit) {
+        msi_hi = pci_default_read_config(pdev,
+                pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                4);
+        qemu_put_be32(f, msi_hi);
+    }
+
+    msi_data = pci_default_read_config(pdev,
+            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+            2);
+    qemu_put_be32(f, msi_data);
+
+}
+
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    int rc = 0;
+
+    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
+    vfio_pci_save_config(vdev, f);
+
+    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVCONFIG);
+    rc += vfio_get_device_config_size(vdev);
+    rc += vfio_save_data_device_config(vdev, f);
+
+    return rc;
+}
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
+
+    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
+            VFIO_DEVICE_STATE_LOGGING);
+    return 0;
+}
+
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+    return 0;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    uint32_t dev_state = vdev->migration->device_state;
+
+    dev_state &= ~VFIO_DEVICE_STATE_LOGGING;
+
+    vfio_set_device_state(vdev, dev_state);
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_live_pending = vfio_save_live_pending,
+    .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_cleanup = vfio_save_cleanup,
+    .load_setup = vfio_load_setup,
+    .load_state = vfio_load_state,
+};
+
+int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
+{
+    int ret;
+    Error *local_err = NULL;
+    vdev->migration = g_new0(VFIOMigration, 1);
+
+    if (vfio_device_state_region_setup(vdev,
+              &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL],
+              VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL,
+              "device-state-ctl")) {
+        goto error;
+    }
+
+    if (vfio_check_devstate_version(vdev)) {
+        goto error;
+    }
+
+    if (vfio_get_device_data_caps(vdev)) {
+        goto error;
+    }
+
+    if (vfio_device_state_region_setup(vdev,
+              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG],
+              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG,
+              "device-state-data-device-config")) {
+        goto error;
+    }
+
+    if (vfio_device_data_cap_device_memory(vdev)) {
+        error_report("No suppport of data cap device memory Yet");
+        goto error;
+    }
+
+    if (vfio_device_data_cap_system_memory(vdev) &&
+            vfio_device_state_region_setup(vdev,
+              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP],
+              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP,
+              "device-state-data-dirtybitmap")) {
+        goto error;
+    }
+
+    vdev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
+
+    register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
+            VFIO_DEVICE_STATE_INTERFACE_VERSION,
+            &savevm_vfio_handlers,
+            vdev);
+
+    vdev->migration->vm_state =
+        qemu_add_vm_change_state_handler(vfio_vm_change_state_handler, vdev);
+
+    return 0;
+error:
+    error_setg(&vdev->migration_blocker,
+            "VFIO device doesn't support migration");
+    ret = migrate_add_blocker(vdev->migration_blocker, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        error_free(vdev->migration_blocker);
+    }
+
+    g_free(vdev->migration);
+    vdev->migration = NULL;
+
+    return ret;
+}
+
+void vfio_migration_finalize(VFIOPCIDevice *vdev)
+{
+    if (vdev->migration) {
+        int i;
+        qemu_del_vm_change_state_handler(vdev->migration->vm_state);
+        unregister_savevm(NULL, TYPE_VFIO_PCI, vdev);
+        for (i = 0; i < VFIO_DEVSTATE_REGION_NUM; i++) {
+            vfio_region_finalize(&vdev->migration->region[i]);
+        }
+        g_free(vdev->migration);
+        vdev->migration = NULL;
+    } else if (vdev->migration_blocker) {
+        migrate_del_blocker(vdev->migration_blocker);
+        error_free(vdev->migration_blocker);
+    }
+}
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c0cb1ec..b8e006b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -37,7 +37,6 @@
 
 #define MSIX_CAP_LENGTH 12
 
-#define TYPE_VFIO_PCI "vfio-pci"
 #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
 
 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index b1ae4c0..4b7b1bb 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -19,6 +19,7 @@
 #include "qemu/event_notifier.h"
 #include "qemu/queue.h"
 #include "qemu/timer.h"
+#include "sysemu/sysemu.h"
 
 #define PCI_ANY_ID (~0)
 
@@ -56,6 +57,21 @@ typedef struct VFIOBAR {
     QLIST_HEAD(, VFIOQuirk) quirks;
 } VFIOBAR;
 
+enum {
+    VFIO_DEVSTATE_REGION_CTL = 0,
+    VFIO_DEVSTATE_REGION_DATA_CONFIG,
+    VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY,
+    VFIO_DEVSTATE_REGION_DATA_BITMAP,
+    VFIO_DEVSTATE_REGION_NUM,
+};
+typedef struct VFIOMigration {
+    VFIORegion region[VFIO_DEVSTATE_REGION_NUM];
+    uint32_t data_caps;
+    uint32_t device_state;
+    uint64_t devconfig_size;
+    VMChangeStateEntry *vm_state;
+} VFIOMigration;
+
 typedef struct VFIOVGARegion {
     MemoryRegion mem;
     off_t offset;
@@ -132,6 +148,8 @@ typedef struct VFIOPCIDevice {
     VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
     VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
     void *igd_opregion;
+    VFIOMigration *migration;
+    Error *migration_blocker;
     PCIHostDeviceAddress host;
     EventNotifier err_notifier;
     EventNotifier req_notifier;
@@ -198,5 +216,10 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
 void vfio_display_reset(VFIOPCIDevice *vdev);
 int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
 void vfio_display_finalize(VFIOPCIDevice *vdev);
-
+bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev);
+bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev);
+int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
+         uint64_t start_addr, uint64_t page_nr);
+int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp);
+void vfio_migration_finalize(VFIOPCIDevice *vdev);
 #endif /* HW_VFIO_VFIO_PCI_H */
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 1b434d0..ed43613 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -32,6 +32,7 @@
 #endif
 
 #define VFIO_MSG_PREFIX "vfio %s: "
+#define TYPE_VFIO_PCI "vfio-pci"
 
 enum {
     VFIO_DEVICE_TYPE_PCI = 0,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [Qemu-devel] [PATCH 2/5] vfio/migration: support device of device config capability
@ 2019-02-19  8:52   ` Yan Zhao
  0 siblings, 0 replies; 133+ messages in thread
From: Yan Zhao @ 2019-02-19  8:52 UTC (permalink / raw)
  To: alex.williamson, qemu-devel
  Cc: intel-gvt-dev, Zhengxiao.zx, yi.l.liu, eskultet, ziye.yang,
	cohuck, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk, pasic,
	aik, eauger, felipe, jonathan.davies, changpeng.liu, Ken.Xue,
	kwankhede, kevin.tian, cjia, arei.gonglei, kvm, Yan Zhao

Device config is the default data that every device should have. so
device config capability is by default on, no need to set.

- Currently two type of resources are saved/loaded for device of device
  config capability:
  General PCI config data, and Device config data.
  They are copies as a whole when precopy is stopped.

Migration setup flow:
- Setup device state regions, check its device state version and capabilities.
  Mmap Device Config Region and Dirty Bitmap Region, if available.
- If device state regions are failed to get setup, a migration blocker is
  registered instead.
- Added SaveVMHandlers to register device state save/load handlers.
- Register VM state change handler to set device's running/stop states.
- On migration startup on source machine, set device's state to
  VFIO_DEVICE_STATE_LOGGING

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
---
 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/migration.c           | 633 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 |   1 -
 hw/vfio/pci.h                 |  25 +-
 include/hw/vfio/vfio-common.h |   1 +
 5 files changed, 659 insertions(+), 3 deletions(-)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index 8b3f664..f32ff19 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,6 +1,6 @@
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
-obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
+obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o
 obj-$(CONFIG_VFIO_CCW) += ccw.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index 0000000..16d6395
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,633 @@
+#include "qemu/osdep.h"
+
+#include "hw/vfio/vfio-common.h"
+#include "migration/blocker.h"
+#include "migration/register.h"
+#include "qapi/error.h"
+#include "pci.h"
+#include "sysemu/kvm.h"
+#include "exec/ram_addr.h"
+
+#define VFIO_SAVE_FLAG_SETUP 0
+#define VFIO_SAVE_FLAG_PCI 1
+#define VFIO_SAVE_FLAG_DEVCONFIG 2
+#define VFIO_SAVE_FLAG_DEVMEMORY 4
+#define VFIO_SAVE_FLAG_CONTINUE 8
+
+static int vfio_device_state_region_setup(VFIOPCIDevice *vdev,
+        VFIORegion *region, uint32_t subtype, const char *name)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    struct vfio_region_info *info;
+    int ret;
+
+    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_DEVICE_STATE,
+            subtype, &info);
+    if (ret) {
+        error_report("Failed to get info of region %s", name);
+        return ret;
+    }
+
+    if (vfio_region_setup(OBJECT(vdev), vbasedev,
+            region, info->index, name)) {
+        error_report("Failed to setup migrtion region %s", name);
+        return ret;
+    }
+
+    if (vfio_region_mmap(region)) {
+        error_report("Failed to mmap migrtion region %s", name);
+    }
+
+    return 0;
+}
+
+bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev)
+{
+   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY);
+}
+
+bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev)
+{
+   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY);
+}
+
+static bool vfio_device_state_region_mmaped(VFIORegion *region)
+{
+    bool mmaped = true;
+    if (region->nr_mmaps != 1 || region->mmaps[0].offset ||
+            (region->size != region->mmaps[0].size) ||
+            (region->mmaps[0].mmap == NULL)) {
+        mmaped = false;
+    }
+
+    return mmaped;
+}
+
+static int vfio_get_device_config_size(VFIOPCIDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    VFIORegion *region_config =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
+    uint64_t len;
+    int sz;
+
+    sz = sizeof(len);
+    if (pread(vbasedev->fd, &len, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_config.size))
+            != sz) {
+        error_report("vfio: Failed to get length of device config");
+        return -1;
+    }
+    if (len > region_config->size) {
+        error_report("vfio: Error device config length");
+        return -1;
+    }
+    vdev->migration->devconfig_size = len;
+
+    return 0;
+}
+
+static int vfio_set_device_config_size(VFIOPCIDevice *vdev, uint64_t size)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    VFIORegion *region_config =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
+    int sz;
+
+    if (size > region_config->size) {
+        return -1;
+    }
+
+    sz = sizeof(size);
+    if (pwrite(vbasedev->fd, &size, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_config.size))
+            != sz) {
+        error_report("vfio: Failed to set length of device config");
+        return -1;
+    }
+    vdev->migration->devconfig_size = size;
+    return 0;
+}
+
+static int vfio_save_data_device_config(VFIOPCIDevice *vdev, QEMUFile *f)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    VFIORegion *region_config =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
+    void *dest;
+    uint32_t sz;
+    uint8_t *buf = NULL;
+    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
+    uint64_t len = vdev->migration->devconfig_size;
+
+    qemu_put_be64(f, len);
+
+    sz = sizeof(action);
+    if (pwrite(vbasedev->fd, &action, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_config.action))
+            != sz) {
+        error_report("vfio: action failure for device config get buffer");
+        return -1;
+    }
+
+    if (!vfio_device_state_region_mmaped(region_config)) {
+        buf = g_malloc(len);
+        if (buf == NULL) {
+            error_report("vfio: Failed to allocate memory for migrate");
+            return -1;
+        }
+        if (pread(vbasedev->fd, buf, len, region_config->fd_offset) != len) {
+            error_report("vfio: Failed read device config buffer");
+            return -1;
+        }
+        qemu_put_buffer(f, buf, len);
+        g_free(buf);
+    } else {
+        dest = region_config->mmaps[0].mmap;
+        qemu_put_buffer(f, dest, len);
+    }
+    return 0;
+}
+
+static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
+                            QEMUFile *f, uint64_t len)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    VFIORegion *region_config =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
+    void *dest;
+    uint32_t sz;
+    uint8_t *buf = NULL;
+    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
+
+    vfio_set_device_config_size(vdev, len);
+
+    if (!vfio_device_state_region_mmaped(region_config)) {
+        buf = g_malloc(len);
+        if (buf == NULL) {
+            error_report("vfio: Failed to allocate memory for migrate");
+            return -1;
+        }
+        qemu_get_buffer(f, buf, len);
+        if (pwrite(vbasedev->fd, buf, len,
+                    region_config->fd_offset) != len) {
+            error_report("vfio: Failed to write devie config buffer");
+            return -1;
+        }
+        g_free(buf);
+    } else {
+        dest = region_config->mmaps[0].mmap;
+        qemu_get_buffer(f, dest, len);
+    }
+
+    sz = sizeof(action);
+    if (pwrite(vbasedev->fd, &action, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_config.action))
+            != sz) {
+        error_report("vfio: action failure for device config set buffer");
+        return -1;
+    }
+
+    return 0;
+}
+
+static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
+        uint64_t start_addr, uint64_t page_nr)
+{
+
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    VFIORegion *region_bitmap =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
+    unsigned long bitmap_size =
+                    BITS_TO_LONGS(page_nr) * sizeof(unsigned long);
+    uint32_t sz;
+
+    struct {
+        __u64 start_addr;
+        __u64 page_nr;
+    } system_memory;
+    system_memory.start_addr = start_addr;
+    system_memory.page_nr = page_nr;
+    sz = sizeof(system_memory);
+    if (pwrite(vbasedev->fd, &system_memory, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, system_memory))
+            != sz) {
+        error_report("vfio: Failed to set system memory range for dirty pages");
+        return -1;
+    }
+
+    if (!vfio_device_state_region_mmaped(region_bitmap)) {
+        void *bitmap = g_malloc0(bitmap_size);
+
+        if (pread(vbasedev->fd, bitmap, bitmap_size,
+                    region_bitmap->fd_offset) != bitmap_size) {
+            error_report("vfio: Failed to read dirty bitmap data");
+            return -1;
+        }
+
+        cpu_physical_memory_set_dirty_lebitmap(bitmap, start_addr, page_nr);
+
+        g_free(bitmap);
+    } else {
+        cpu_physical_memory_set_dirty_lebitmap(
+                    region_bitmap->mmaps[0].mmap,
+                    start_addr, page_nr);
+    }
+   return 0;
+}
+
+int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
+        uint64_t start_addr, uint64_t page_nr)
+{
+    VFIORegion *region_bitmap =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
+    unsigned long chunk_size = region_bitmap->size;
+    uint64_t chunk_pg_nr = (chunk_size / sizeof(unsigned long)) *
+                                BITS_PER_LONG;
+
+    uint64_t cnt_left;
+    int rc = 0;
+
+    cnt_left = page_nr;
+
+    while (cnt_left >= chunk_pg_nr) {
+        rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, chunk_pg_nr);
+        if (rc) {
+            goto exit;
+        }
+        cnt_left -= chunk_pg_nr;
+        start_addr += start_addr;
+   }
+   rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, cnt_left);
+
+exit:
+   return rc;
+}
+
+static int vfio_set_device_state(VFIOPCIDevice *vdev,
+        uint32_t dev_state)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    uint32_t sz = sizeof(dev_state);
+
+    if (!vdev->migration) {
+        return -1;
+    }
+
+    if (pwrite(vbasedev->fd, &dev_state, sz,
+              region->fd_offset +
+              offsetof(struct vfio_device_state_ctl, device_state))
+            != sz) {
+        error_report("vfio: Failed to set device state %d", dev_state);
+        return -1;
+    }
+    vdev->migration->device_state = dev_state;
+    return 0;
+}
+
+static int vfio_get_device_data_caps(VFIOPCIDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+
+    uint32_t caps;
+    uint32_t size = sizeof(caps);
+
+    if (pread(vbasedev->fd, &caps, size,
+                region->fd_offset +
+                offsetof(struct vfio_device_state_ctl, caps))
+            != size) {
+        error_report("%s Failed to read data caps of device states",
+                vbasedev->name);
+        return -1;
+    }
+    vdev->migration->data_caps = caps;
+    return 0;
+}
+
+
+static int vfio_check_devstate_version(VFIOPCIDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+
+    uint32_t version;
+    uint32_t size = sizeof(version);
+
+    if (pread(vbasedev->fd, &version, size,
+                region->fd_offset +
+                offsetof(struct vfio_device_state_ctl, version))
+            != size) {
+        error_report("%s Failed to read version of device state interfaces",
+                vbasedev->name);
+        return -1;
+    }
+
+    if (version != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
+        error_report("%s migration version mismatch, right version is %d",
+                vbasedev->name, VFIO_DEVICE_STATE_INTERFACE_VERSION);
+        return -1;
+    }
+
+    return 0;
+}
+
+static void vfio_vm_change_state_handler(void *pv, int running, RunState state)
+{
+    VFIOPCIDevice *vdev = pv;
+    uint32_t dev_state = vdev->migration->device_state;
+
+    if (!running) {
+        dev_state |= VFIO_DEVICE_STATE_STOP;
+    } else {
+        dev_state &= ~VFIO_DEVICE_STATE_STOP;
+    }
+
+    vfio_set_device_state(vdev, dev_state);
+}
+
+static void vfio_save_live_pending(QEMUFile *f, void *opaque,
+                                   uint64_t max_size,
+                                   uint64_t *res_precopy_only,
+                                   uint64_t *res_compatible,
+                                   uint64_t *res_post_copy_only)
+{
+    VFIOPCIDevice *vdev = opaque;
+
+    if (!vfio_device_data_cap_device_memory(vdev)) {
+        return;
+    }
+
+    return;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+
+    if (!vfio_device_data_cap_device_memory(vdev)) {
+        return 0;
+    }
+
+    return 0;
+}
+
+static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
+    bool msi_64bit;
+
+    /* retore pci bar configuration */
+    ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+            ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        bar_cfg = qemu_get_be32(f);
+        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar_cfg, 4);
+    }
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+            ctl | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
+
+    /* restore msi configuration */
+    ctl = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
+    msi_64bit = !!(ctl & PCI_MSI_FLAGS_64BIT);
+
+    vfio_pci_write_config(&vdev->pdev,
+            pdev->msi_cap + PCI_MSI_FLAGS,
+            ctl & (!PCI_MSI_FLAGS_ENABLE), 2);
+
+    msi_lo = qemu_get_be32(f);
+    vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO, msi_lo, 4);
+
+    if (msi_64bit) {
+        msi_hi = qemu_get_be32(f);
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                msi_hi, 4);
+    }
+    msi_data = qemu_get_be32(f);
+    vfio_pci_write_config(pdev,
+            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+            msi_data, 2);
+
+    vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+            ctl | PCI_MSI_FLAGS_ENABLE, 2);
+
+}
+
+static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
+{
+    VFIOPCIDevice *vdev = opaque;
+    int flag;
+    uint64_t len;
+    int ret = 0;
+
+    if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
+        return -EINVAL;
+    }
+
+    do {
+        flag = qemu_get_byte(f);
+
+        switch (flag & ~VFIO_SAVE_FLAG_CONTINUE) {
+        case VFIO_SAVE_FLAG_SETUP:
+            break;
+        case VFIO_SAVE_FLAG_PCI:
+            vfio_pci_load_config(vdev, f);
+            break;
+        case VFIO_SAVE_FLAG_DEVCONFIG:
+            len = qemu_get_be64(f);
+            vfio_load_data_device_config(vdev, f, len);
+            break;
+        default:
+            ret = -EINVAL;
+        }
+    } while (flag & VFIO_SAVE_FLAG_CONTINUE);
+
+    return ret;
+}
+
+static void vfio_pci_save_config(VFIOPCIDevice *vdev, QEMUFile *f)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    uint32_t msi_cfg, msi_lo, msi_hi, msi_data, bar_cfg, i;
+    bool msi_64bit;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        bar_cfg = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
+        qemu_put_be32(f, bar_cfg);
+    }
+
+    msi_cfg = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
+    msi_64bit = !!(msi_cfg & PCI_MSI_FLAGS_64BIT);
+
+    msi_lo = pci_default_read_config(pdev,
+            pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
+    qemu_put_be32(f, msi_lo);
+
+    if (msi_64bit) {
+        msi_hi = pci_default_read_config(pdev,
+                pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                4);
+        qemu_put_be32(f, msi_hi);
+    }
+
+    msi_data = pci_default_read_config(pdev,
+            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+            2);
+    qemu_put_be32(f, msi_data);
+
+}
+
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    int rc = 0;
+
+    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
+    vfio_pci_save_config(vdev, f);
+
+    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVCONFIG);
+    rc += vfio_get_device_config_size(vdev);
+    rc += vfio_save_data_device_config(vdev, f);
+
+    return rc;
+}
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
+
+    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
+            VFIO_DEVICE_STATE_LOGGING);
+    return 0;
+}
+
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+    return 0;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    uint32_t dev_state = vdev->migration->device_state;
+
+    dev_state &= ~VFIO_DEVICE_STATE_LOGGING;
+
+    vfio_set_device_state(vdev, dev_state);
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_live_pending = vfio_save_live_pending,
+    .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_cleanup = vfio_save_cleanup,
+    .load_setup = vfio_load_setup,
+    .load_state = vfio_load_state,
+};
+
+int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
+{
+    int ret;
+    Error *local_err = NULL;
+    vdev->migration = g_new0(VFIOMigration, 1);
+
+    if (vfio_device_state_region_setup(vdev,
+              &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL],
+              VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL,
+              "device-state-ctl")) {
+        goto error;
+    }
+
+    if (vfio_check_devstate_version(vdev)) {
+        goto error;
+    }
+
+    if (vfio_get_device_data_caps(vdev)) {
+        goto error;
+    }
+
+    if (vfio_device_state_region_setup(vdev,
+              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG],
+              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG,
+              "device-state-data-device-config")) {
+        goto error;
+    }
+
+    if (vfio_device_data_cap_device_memory(vdev)) {
+        error_report("No suppport of data cap device memory Yet");
+        goto error;
+    }
+
+    if (vfio_device_data_cap_system_memory(vdev) &&
+            vfio_device_state_region_setup(vdev,
+              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP],
+              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP,
+              "device-state-data-dirtybitmap")) {
+        goto error;
+    }
+
+    vdev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
+
+    register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
+            VFIO_DEVICE_STATE_INTERFACE_VERSION,
+            &savevm_vfio_handlers,
+            vdev);
+
+    vdev->migration->vm_state =
+        qemu_add_vm_change_state_handler(vfio_vm_change_state_handler, vdev);
+
+    return 0;
+error:
+    error_setg(&vdev->migration_blocker,
+            "VFIO device doesn't support migration");
+    ret = migrate_add_blocker(vdev->migration_blocker, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        error_free(vdev->migration_blocker);
+    }
+
+    g_free(vdev->migration);
+    vdev->migration = NULL;
+
+    return ret;
+}
+
+void vfio_migration_finalize(VFIOPCIDevice *vdev)
+{
+    if (vdev->migration) {
+        int i;
+        qemu_del_vm_change_state_handler(vdev->migration->vm_state);
+        unregister_savevm(NULL, TYPE_VFIO_PCI, vdev);
+        for (i = 0; i < VFIO_DEVSTATE_REGION_NUM; i++) {
+            vfio_region_finalize(&vdev->migration->region[i]);
+        }
+        g_free(vdev->migration);
+        vdev->migration = NULL;
+    } else if (vdev->migration_blocker) {
+        migrate_del_blocker(vdev->migration_blocker);
+        error_free(vdev->migration_blocker);
+    }
+}
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c0cb1ec..b8e006b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -37,7 +37,6 @@
 
 #define MSIX_CAP_LENGTH 12
 
-#define TYPE_VFIO_PCI "vfio-pci"
 #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
 
 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index b1ae4c0..4b7b1bb 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -19,6 +19,7 @@
 #include "qemu/event_notifier.h"
 #include "qemu/queue.h"
 #include "qemu/timer.h"
+#include "sysemu/sysemu.h"
 
 #define PCI_ANY_ID (~0)
 
@@ -56,6 +57,21 @@ typedef struct VFIOBAR {
     QLIST_HEAD(, VFIOQuirk) quirks;
 } VFIOBAR;
 
+enum {
+    VFIO_DEVSTATE_REGION_CTL = 0,
+    VFIO_DEVSTATE_REGION_DATA_CONFIG,
+    VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY,
+    VFIO_DEVSTATE_REGION_DATA_BITMAP,
+    VFIO_DEVSTATE_REGION_NUM,
+};
+typedef struct VFIOMigration {
+    VFIORegion region[VFIO_DEVSTATE_REGION_NUM];
+    uint32_t data_caps;
+    uint32_t device_state;
+    uint64_t devconfig_size;
+    VMChangeStateEntry *vm_state;
+} VFIOMigration;
+
 typedef struct VFIOVGARegion {
     MemoryRegion mem;
     off_t offset;
@@ -132,6 +148,8 @@ typedef struct VFIOPCIDevice {
     VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
     VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
     void *igd_opregion;
+    VFIOMigration *migration;
+    Error *migration_blocker;
     PCIHostDeviceAddress host;
     EventNotifier err_notifier;
     EventNotifier req_notifier;
@@ -198,5 +216,10 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
 void vfio_display_reset(VFIOPCIDevice *vdev);
 int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
 void vfio_display_finalize(VFIOPCIDevice *vdev);
-
+bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev);
+bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev);
+int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
+         uint64_t start_addr, uint64_t page_nr);
+int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp);
+void vfio_migration_finalize(VFIOPCIDevice *vdev);
 #endif /* HW_VFIO_VFIO_PCI_H */
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 1b434d0..ed43613 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -32,6 +32,7 @@
 #endif
 
 #define VFIO_MSG_PREFIX "vfio %s: "
+#define TYPE_VFIO_PCI "vfio-pci"
 
 enum {
     VFIO_DEVICE_TYPE_PCI = 0,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 3/5] vfio/migration: tracking of dirty page in system memory
  2019-02-19  8:50 ` [Qemu-devel] " Yan Zhao
@ 2019-02-19  8:52   ` Yan Zhao
  -1 siblings, 0 replies; 133+ messages in thread
From: Yan Zhao @ 2019-02-19  8:52 UTC (permalink / raw)
  To: alex.williamson, qemu-devel
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, kwankhede, eauger,
	yi.l.liu, eskultet, ziye.yang, mlevitsk, pasic, arei.gonglei,
	felipe, Ken.Xue, kevin.tian, Yan Zhao, dgilbert, intel-gvt-dev,
	changpeng.liu, cohuck, zhi.a.wang, jonathan.davies

register the log_sync interface to hook into ram's live migration
callbacks.

ram_save_pending
   |->migration_bitmap_sync
       |->memory_global_dirty_log_sync
           |->memory_region_sync_dirty_bitmap
               |->listener->log_sync(listener, &mrs);

So, the dirty page produced by vfio device in system memory will be
save/load by ram's live migration code iteratively.

Bitmap of device's dirty page in system memory is retrieved from Dirty Bitmap
Region

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
---
 hw/vfio/common.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7c185e5a..719e750 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -27,6 +27,7 @@
 
 #include "hw/vfio/vfio-common.h"
 #include "hw/vfio/vfio.h"
+#include "hw/vfio/pci.h"
 #include "exec/address-spaces.h"
 #include "exec/memory.h"
 #include "hw/hw.h"
@@ -698,9 +699,34 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
+static void vfio_log_sync(MemoryListener *listener,
+                          MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    VFIOGroup *group = QLIST_FIRST(&container->group_list);
+    VFIODevice *vbasedev;
+    VFIOPCIDevice *vdev;
+
+    ram_addr_t size = int128_get64(section->size);
+    uint64_t page_nr = size >> TARGET_PAGE_BITS;
+    uint64_t start_addr = section->offset_within_address_space;
+
+    QLIST_FOREACH(vbasedev, &group->device_list, next) {
+        vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+        if (!vdev->migration ||
+                !vfio_device_data_cap_system_memory(vdev) ||
+                !(vdev->migration->device_state & VFIO_DEVICE_STATE_LOGGING)) {
+            continue;
+        }
+
+        vfio_set_dirty_page_bitmap(vdev, start_addr, page_nr);
+    }
+}
+
 static const MemoryListener vfio_memory_listener = {
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
+    .log_sync = vfio_log_sync,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [Qemu-devel] [PATCH 3/5] vfio/migration: tracking of dirty page in system memory
@ 2019-02-19  8:52   ` Yan Zhao
  0 siblings, 0 replies; 133+ messages in thread
From: Yan Zhao @ 2019-02-19  8:52 UTC (permalink / raw)
  To: alex.williamson, qemu-devel
  Cc: intel-gvt-dev, Zhengxiao.zx, yi.l.liu, eskultet, ziye.yang,
	cohuck, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk, pasic,
	aik, eauger, felipe, jonathan.davies, changpeng.liu, Ken.Xue,
	kwankhede, kevin.tian, cjia, arei.gonglei, kvm, Yan Zhao

register the log_sync interface to hook into ram's live migration
callbacks.

ram_save_pending
   |->migration_bitmap_sync
       |->memory_global_dirty_log_sync
           |->memory_region_sync_dirty_bitmap
               |->listener->log_sync(listener, &mrs);

So, the dirty page produced by vfio device in system memory will be
save/load by ram's live migration code iteratively.

Bitmap of device's dirty page in system memory is retrieved from Dirty Bitmap
Region

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
---
 hw/vfio/common.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7c185e5a..719e750 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -27,6 +27,7 @@
 
 #include "hw/vfio/vfio-common.h"
 #include "hw/vfio/vfio.h"
+#include "hw/vfio/pci.h"
 #include "exec/address-spaces.h"
 #include "exec/memory.h"
 #include "hw/hw.h"
@@ -698,9 +699,34 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
+static void vfio_log_sync(MemoryListener *listener,
+                          MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    VFIOGroup *group = QLIST_FIRST(&container->group_list);
+    VFIODevice *vbasedev;
+    VFIOPCIDevice *vdev;
+
+    ram_addr_t size = int128_get64(section->size);
+    uint64_t page_nr = size >> TARGET_PAGE_BITS;
+    uint64_t start_addr = section->offset_within_address_space;
+
+    QLIST_FOREACH(vbasedev, &group->device_list, next) {
+        vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+        if (!vdev->migration ||
+                !vfio_device_data_cap_system_memory(vdev) ||
+                !(vdev->migration->device_state & VFIO_DEVICE_STATE_LOGGING)) {
+            continue;
+        }
+
+        vfio_set_dirty_page_bitmap(vdev, start_addr, page_nr);
+    }
+}
+
 static const MemoryListener vfio_memory_listener = {
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
+    .log_sync = vfio_log_sync,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 4/5] vfio/migration: turn on migration
  2019-02-19  8:50 ` [Qemu-devel] " Yan Zhao
@ 2019-02-19  8:52   ` Yan Zhao
  -1 siblings, 0 replies; 133+ messages in thread
From: Yan Zhao @ 2019-02-19  8:52 UTC (permalink / raw)
  To: alex.williamson, qemu-devel
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, kwankhede, eauger,
	yi.l.liu, eskultet, ziye.yang, mlevitsk, pasic, arei.gonglei,
	felipe, Ken.Xue, kevin.tian, Yan Zhao, dgilbert, intel-gvt-dev,
	changpeng.liu, cohuck, zhi.a.wang, jonathan.davies

init vfio migration in vfio_realize() and register migraton blocker if
failure met.
finalize all migration resources when vfio_instance_finalize().

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
---
 hw/vfio/pci.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index b8e006b..8bf625e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3068,6 +3068,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         goto out_teardown;
     }
 
+    vfio_migration_init(vdev, errp);
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
@@ -3089,6 +3091,7 @@ static void vfio_instance_finalize(Object *obj)
 
     vfio_display_finalize(vdev);
     vfio_bars_finalize(vdev);
+    vfio_migration_finalize(vdev);
     g_free(vdev->emulated_config_bits);
     g_free(vdev->rom);
     /*
@@ -3221,11 +3224,6 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static const VMStateDescription vfio_pci_vmstate = {
-    .name = "vfio-pci",
-    .unmigratable = 1,
-};
-
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3233,7 +3231,6 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     dc->props = vfio_pci_dev_properties;
-    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [Qemu-devel] [PATCH 4/5] vfio/migration: turn on migration
@ 2019-02-19  8:52   ` Yan Zhao
  0 siblings, 0 replies; 133+ messages in thread
From: Yan Zhao @ 2019-02-19  8:52 UTC (permalink / raw)
  To: alex.williamson, qemu-devel
  Cc: intel-gvt-dev, Zhengxiao.zx, yi.l.liu, eskultet, ziye.yang,
	cohuck, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk, pasic,
	aik, eauger, felipe, jonathan.davies, changpeng.liu, Ken.Xue,
	kwankhede, kevin.tian, cjia, arei.gonglei, kvm, Yan Zhao

init vfio migration in vfio_realize() and register migraton blocker if
failure met.
finalize all migration resources when vfio_instance_finalize().

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
---
 hw/vfio/pci.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index b8e006b..8bf625e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3068,6 +3068,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         goto out_teardown;
     }
 
+    vfio_migration_init(vdev, errp);
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
@@ -3089,6 +3091,7 @@ static void vfio_instance_finalize(Object *obj)
 
     vfio_display_finalize(vdev);
     vfio_bars_finalize(vdev);
+    vfio_migration_finalize(vdev);
     g_free(vdev->emulated_config_bits);
     g_free(vdev->rom);
     /*
@@ -3221,11 +3224,6 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static const VMStateDescription vfio_pci_vmstate = {
-    .name = "vfio-pci",
-    .unmigratable = 1,
-};
-
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3233,7 +3231,6 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     dc->props = vfio_pci_dev_properties;
-    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 5/5] vfio/migration: support device memory capability
  2019-02-19  8:50 ` [Qemu-devel] " Yan Zhao
@ 2019-02-19  8:53   ` Yan Zhao
  -1 siblings, 0 replies; 133+ messages in thread
From: Yan Zhao @ 2019-02-19  8:53 UTC (permalink / raw)
  To: alex.williamson, qemu-devel
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, kwankhede, eauger,
	yi.l.liu, eskultet, ziye.yang, mlevitsk, pasic, arei.gonglei,
	felipe, Ken.Xue, kevin.tian, Yan Zhao, dgilbert, intel-gvt-dev,
	changpeng.liu, cohuck, zhi.a.wang, jonathan.davies

If a device has device memory capability, save/load data from device memory
in pre-copy and stop-and-copy phases.

LOGGING state is set for device memory for dirty page logging:
in LOGGING state, get device memory returns whole device memory snapshot;
outside LOGGING state, get device memory returns dirty data since last get
operation.

Usually, device memory is very big, qemu needs to chunk it into several
pieces each with size of device memory region.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 hw/vfio/pci.h       |   1 +
 2 files changed, 231 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 16d6395..f1e9309 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
     return 0;
 }
 
+static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    uint64_t len;
+    int sz;
+
+    sz = sizeof(len);
+    if (pread(vbasedev->fd, &len, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_memory.size))
+            != sz) {
+        error_report("vfio: Failed to get length of device memory");
+        return -1;
+    }
+    vdev->migration->devmem_size = len;
+    return 0;
+}
+
+static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    int sz;
+
+    sz = sizeof(size);
+    if (pwrite(vbasedev->fd, &size, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_memory.size))
+            != sz) {
+        error_report("vfio: Failed to set length of device comemory");
+        return -1;
+    }
+    vdev->migration->devmem_size = size;
+    return 0;
+}
+
+static
+int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
+                                    uint64_t pos, uint64_t len)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    VFIORegion *region_devmem =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
+    void *dest;
+    uint32_t sz;
+    uint8_t *buf = NULL;
+    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
+
+    if (len > region_devmem->size) {
+        return -1;
+    }
+
+    sz = sizeof(pos);
+    if (pwrite(vbasedev->fd, &pos, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_memory.pos))
+            != sz) {
+        error_report("vfio: Failed to set save buffer pos");
+        return -1;
+    }
+    sz = sizeof(action);
+    if (pwrite(vbasedev->fd, &action, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_memory.action))
+            != sz) {
+        error_report("vfio: Failed to set save buffer action");
+        return -1;
+    }
+
+    if (!vfio_device_state_region_mmaped(region_devmem)) {
+        buf = g_malloc(len);
+        if (buf == NULL) {
+            error_report("vfio: Failed to allocate memory for migrate");
+            return -1;
+        }
+        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
+            error_report("vfio: error load device memory buffer");
+            return -1;
+        }
+        qemu_put_be64(f, len);
+        qemu_put_be64(f, pos);
+        qemu_put_buffer(f, buf, len);
+        g_free(buf);
+    } else {
+        dest = region_devmem->mmaps[0].mmap;
+        qemu_put_be64(f, len);
+        qemu_put_be64(f, pos);
+        qemu_put_buffer(f, dest, len);
+    }
+    return 0;
+}
+
+static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
+{
+    VFIORegion *region_devmem =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
+    uint64_t total_len = vdev->migration->devmem_size;
+    uint64_t pos = 0;
+
+    qemu_put_be64(f, total_len);
+    while (pos < total_len) {
+        uint64_t len = region_devmem->size;
+
+        if (pos + len >= total_len) {
+            len = total_len - pos;
+        }
+        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
+            return -1;
+        }
+    }
+
+    return 0;
+}
+
+static
+int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
+                                uint64_t pos, uint64_t len)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    VFIORegion *region_devmem =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
+
+    void *dest;
+    uint32_t sz;
+    uint8_t *buf = NULL;
+    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
+
+    if (len > region_devmem->size) {
+        return -1;
+    }
+
+    sz = sizeof(pos);
+    if (pwrite(vbasedev->fd, &pos, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_memory.pos))
+            != sz) {
+        error_report("vfio: Failed to set device memory buffer pos");
+        return -1;
+    }
+    if (!vfio_device_state_region_mmaped(region_devmem)) {
+        buf = g_malloc(len);
+        if (buf == NULL) {
+            error_report("vfio: Failed to allocate memory for migrate");
+            return -1;
+        }
+        qemu_get_buffer(f, buf, len);
+        if (pwrite(vbasedev->fd, buf, len,
+                    region_devmem->fd_offset) != len) {
+            error_report("vfio: Failed to load devie memory buffer");
+            return -1;
+        }
+        g_free(buf);
+    } else {
+        dest = region_devmem->mmaps[0].mmap;
+        qemu_get_buffer(f, dest, len);
+    }
+
+    sz = sizeof(action);
+    if (pwrite(vbasedev->fd, &action, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_memory.action))
+            != sz) {
+        error_report("vfio: Failed to set load device memory buffer action");
+        return -1;
+    }
+
+    return 0;
+
+}
+
+static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
+                        QEMUFile *f, uint64_t total_len)
+{
+    uint64_t pos = 0, len = 0;
+
+    vfio_set_device_memory_size(vdev, total_len);
+
+    while (pos + len < total_len) {
+        len = qemu_get_be64(f);
+        pos = qemu_get_be64(f);
+
+        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
+    }
+
+    return 0;
+}
+
+
 static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
         uint64_t start_addr, uint64_t page_nr)
 {
@@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
         return;
     }
 
+    /* get dirty data size of device memory */
+    vfio_get_device_memory_size(vdev);
+
+    *res_precopy_only += vdev->migration->devmem_size;
     return;
 }
 
@@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
         return 0;
     }
 
-    return 0;
+    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
+    /* get dirty data of device memory */
+    return vfio_save_data_device_memory(vdev, f);
 }
 
 static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
@@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             len = qemu_get_be64(f);
             vfio_load_data_device_config(vdev, f, len);
             break;
+        case VFIO_SAVE_FLAG_DEVMEMORY:
+            len = qemu_get_be64(f);
+            vfio_load_data_device_memory(vdev, f, len);
+            break;
         default:
             ret = -EINVAL;
         }
@@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     VFIOPCIDevice *vdev = opaque;
     int rc = 0;
 
+    if (vfio_device_data_cap_device_memory(vdev)) {
+        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
+        /* get dirty data of device memory */
+        vfio_get_device_memory_size(vdev);
+        rc = vfio_save_data_device_memory(vdev, f);
+    }
+
     qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
     vfio_pci_save_config(vdev, f);
 
@@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
 {
+    int rc = 0;
     VFIOPCIDevice *vdev = opaque;
-    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
+
+    if (vfio_device_data_cap_device_memory(vdev)) {
+        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
+        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
+        /* get whole snapshot of device memory */
+        vfio_get_device_memory_size(vdev);
+        rc = vfio_save_data_device_memory(vdev, f);
+    } else {
+        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
+    }
 
     vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
             VFIO_DEVICE_STATE_LOGGING);
-    return 0;
+    return rc;
 }
 
 static int vfio_load_setup(QEMUFile *f, void *opaque)
@@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
         goto error;
     }
 
-    if (vfio_device_data_cap_device_memory(vdev)) {
-        error_report("No suppport of data cap device memory Yet");
+    if (vfio_device_data_cap_device_memory(vdev) &&
+            vfio_device_state_region_setup(vdev,
+              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
+              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
+              "device-state-data-device-memory")) {
         goto error;
     }
 
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 4b7b1bb..a2cc64b 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -69,6 +69,7 @@ typedef struct VFIOMigration {
     uint32_t data_caps;
     uint32_t device_state;
     uint64_t devconfig_size;
+    uint64_t devmem_size;
     VMChangeStateEntry *vm_state;
 } VFIOMigration;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [Qemu-devel] [PATCH 5/5] vfio/migration: support device memory capability
@ 2019-02-19  8:53   ` Yan Zhao
  0 siblings, 0 replies; 133+ messages in thread
From: Yan Zhao @ 2019-02-19  8:53 UTC (permalink / raw)
  To: alex.williamson, qemu-devel
  Cc: intel-gvt-dev, Zhengxiao.zx, yi.l.liu, eskultet, ziye.yang,
	cohuck, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk, pasic,
	aik, eauger, felipe, jonathan.davies, changpeng.liu, Ken.Xue,
	kwankhede, kevin.tian, cjia, arei.gonglei, kvm, Yan Zhao

If a device has device memory capability, save/load data from device memory
in pre-copy and stop-and-copy phases.

LOGGING state is set for device memory for dirty page logging:
in LOGGING state, get device memory returns whole device memory snapshot;
outside LOGGING state, get device memory returns dirty data since last get
operation.

Usually, device memory is very big, qemu needs to chunk it into several
pieces each with size of device memory region.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 hw/vfio/pci.h       |   1 +
 2 files changed, 231 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 16d6395..f1e9309 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
     return 0;
 }
 
+static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    uint64_t len;
+    int sz;
+
+    sz = sizeof(len);
+    if (pread(vbasedev->fd, &len, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_memory.size))
+            != sz) {
+        error_report("vfio: Failed to get length of device memory");
+        return -1;
+    }
+    vdev->migration->devmem_size = len;
+    return 0;
+}
+
+static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    int sz;
+
+    sz = sizeof(size);
+    if (pwrite(vbasedev->fd, &size, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_memory.size))
+            != sz) {
+        error_report("vfio: Failed to set length of device comemory");
+        return -1;
+    }
+    vdev->migration->devmem_size = size;
+    return 0;
+}
+
+static
+int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
+                                    uint64_t pos, uint64_t len)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    VFIORegion *region_devmem =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
+    void *dest;
+    uint32_t sz;
+    uint8_t *buf = NULL;
+    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
+
+    if (len > region_devmem->size) {
+        return -1;
+    }
+
+    sz = sizeof(pos);
+    if (pwrite(vbasedev->fd, &pos, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_memory.pos))
+            != sz) {
+        error_report("vfio: Failed to set save buffer pos");
+        return -1;
+    }
+    sz = sizeof(action);
+    if (pwrite(vbasedev->fd, &action, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_memory.action))
+            != sz) {
+        error_report("vfio: Failed to set save buffer action");
+        return -1;
+    }
+
+    if (!vfio_device_state_region_mmaped(region_devmem)) {
+        buf = g_malloc(len);
+        if (buf == NULL) {
+            error_report("vfio: Failed to allocate memory for migrate");
+            return -1;
+        }
+        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
+            error_report("vfio: error load device memory buffer");
+            return -1;
+        }
+        qemu_put_be64(f, len);
+        qemu_put_be64(f, pos);
+        qemu_put_buffer(f, buf, len);
+        g_free(buf);
+    } else {
+        dest = region_devmem->mmaps[0].mmap;
+        qemu_put_be64(f, len);
+        qemu_put_be64(f, pos);
+        qemu_put_buffer(f, dest, len);
+    }
+    return 0;
+}
+
+static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
+{
+    VFIORegion *region_devmem =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
+    uint64_t total_len = vdev->migration->devmem_size;
+    uint64_t pos = 0;
+
+    qemu_put_be64(f, total_len);
+    while (pos < total_len) {
+        uint64_t len = region_devmem->size;
+
+        if (pos + len >= total_len) {
+            len = total_len - pos;
+        }
+        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
+            return -1;
+        }
+    }
+
+    return 0;
+}
+
+static
+int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
+                                uint64_t pos, uint64_t len)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    VFIORegion *region_ctl =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
+    VFIORegion *region_devmem =
+        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
+
+    void *dest;
+    uint32_t sz;
+    uint8_t *buf = NULL;
+    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
+
+    if (len > region_devmem->size) {
+        return -1;
+    }
+
+    sz = sizeof(pos);
+    if (pwrite(vbasedev->fd, &pos, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_memory.pos))
+            != sz) {
+        error_report("vfio: Failed to set device memory buffer pos");
+        return -1;
+    }
+    if (!vfio_device_state_region_mmaped(region_devmem)) {
+        buf = g_malloc(len);
+        if (buf == NULL) {
+            error_report("vfio: Failed to allocate memory for migrate");
+            return -1;
+        }
+        qemu_get_buffer(f, buf, len);
+        if (pwrite(vbasedev->fd, buf, len,
+                    region_devmem->fd_offset) != len) {
+            error_report("vfio: Failed to load devie memory buffer");
+            return -1;
+        }
+        g_free(buf);
+    } else {
+        dest = region_devmem->mmaps[0].mmap;
+        qemu_get_buffer(f, dest, len);
+    }
+
+    sz = sizeof(action);
+    if (pwrite(vbasedev->fd, &action, sz,
+                region_ctl->fd_offset +
+                offsetof(struct vfio_device_state_ctl, device_memory.action))
+            != sz) {
+        error_report("vfio: Failed to set load device memory buffer action");
+        return -1;
+    }
+
+    return 0;
+
+}
+
+static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
+                        QEMUFile *f, uint64_t total_len)
+{
+    uint64_t pos = 0, len = 0;
+
+    vfio_set_device_memory_size(vdev, total_len);
+
+    while (pos + len < total_len) {
+        len = qemu_get_be64(f);
+        pos = qemu_get_be64(f);
+
+        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
+    }
+
+    return 0;
+}
+
+
 static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
         uint64_t start_addr, uint64_t page_nr)
 {
@@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
         return;
     }
 
+    /* get dirty data size of device memory */
+    vfio_get_device_memory_size(vdev);
+
+    *res_precopy_only += vdev->migration->devmem_size;
     return;
 }
 
@@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
         return 0;
     }
 
-    return 0;
+    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
+    /* get dirty data of device memory */
+    return vfio_save_data_device_memory(vdev, f);
 }
 
 static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
@@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             len = qemu_get_be64(f);
             vfio_load_data_device_config(vdev, f, len);
             break;
+        case VFIO_SAVE_FLAG_DEVMEMORY:
+            len = qemu_get_be64(f);
+            vfio_load_data_device_memory(vdev, f, len);
+            break;
         default:
             ret = -EINVAL;
         }
@@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     VFIOPCIDevice *vdev = opaque;
     int rc = 0;
 
+    if (vfio_device_data_cap_device_memory(vdev)) {
+        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
+        /* get dirty data of device memory */
+        vfio_get_device_memory_size(vdev);
+        rc = vfio_save_data_device_memory(vdev, f);
+    }
+
     qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
     vfio_pci_save_config(vdev, f);
 
@@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
 {
+    int rc = 0;
     VFIOPCIDevice *vdev = opaque;
-    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
+
+    if (vfio_device_data_cap_device_memory(vdev)) {
+        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
+        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
+        /* get whole snapshot of device memory */
+        vfio_get_device_memory_size(vdev);
+        rc = vfio_save_data_device_memory(vdev, f);
+    } else {
+        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
+    }
 
     vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
             VFIO_DEVICE_STATE_LOGGING);
-    return 0;
+    return rc;
 }
 
 static int vfio_load_setup(QEMUFile *f, void *opaque)
@@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
         goto error;
     }
 
-    if (vfio_device_data_cap_device_memory(vdev)) {
-        error_report("No suppport of data cap device memory Yet");
+    if (vfio_device_data_cap_device_memory(vdev) &&
+            vfio_device_state_region_setup(vdev,
+              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
+              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
+              "device-state-data-device-memory")) {
         goto error;
     }
 
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 4b7b1bb..a2cc64b 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -69,6 +69,7 @@ typedef struct VFIOMigration {
     uint32_t data_caps;
     uint32_t device_state;
     uint64_t devconfig_size;
+    uint64_t devmem_size;
     VMChangeStateEntry *vm_state;
 } VFIOMigration;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* Re: [PATCH 2/5] vfio/migration: support device of device config capability
  2019-02-19  8:52   ` [Qemu-devel] " Yan Zhao
@ 2019-02-19 11:01     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 133+ messages in thread
From: Dr. David Alan Gilbert @ 2019-02-19 11:01 UTC (permalink / raw)
  To: Yan Zhao
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck,
	zhi.a.wang, jonathan.davies

* Yan Zhao (yan.y.zhao@intel.com) wrote:
> Device config is the default data that every device should have. so
> device config capability is by default on, no need to set.
> 
> - Currently two type of resources are saved/loaded for device of device
>   config capability:
>   General PCI config data, and Device config data.
>   They are copies as a whole when precopy is stopped.
> 
> Migration setup flow:
> - Setup device state regions, check its device state version and capabilities.
>   Mmap Device Config Region and Dirty Bitmap Region, if available.
> - If device state regions are failed to get setup, a migration blocker is
>   registered instead.
> - Added SaveVMHandlers to register device state save/load handlers.
> - Register VM state change handler to set device's running/stop states.
> - On migration startup on source machine, set device's state to
>   VFIO_DEVICE_STATE_LOGGING
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 633 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |   1 -
>  hw/vfio/pci.h                 |  25 +-
>  include/hw/vfio/vfio-common.h |   1 +
>  5 files changed, 659 insertions(+), 3 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index 8b3f664..f32ff19 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,6 +1,6 @@
>  ifeq ($(CONFIG_LINUX), y)
>  obj-$(CONFIG_SOFTMMU) += common.o
> -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o
>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> new file mode 100644
> index 0000000..16d6395
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,633 @@
> +#include "qemu/osdep.h"
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "migration/blocker.h"
> +#include "migration/register.h"
> +#include "qapi/error.h"
> +#include "pci.h"
> +#include "sysemu/kvm.h"
> +#include "exec/ram_addr.h"
> +
> +#define VFIO_SAVE_FLAG_SETUP 0
> +#define VFIO_SAVE_FLAG_PCI 1
> +#define VFIO_SAVE_FLAG_DEVCONFIG 2
> +#define VFIO_SAVE_FLAG_DEVMEMORY 4
> +#define VFIO_SAVE_FLAG_CONTINUE 8
> +
> +static int vfio_device_state_region_setup(VFIOPCIDevice *vdev,
> +        VFIORegion *region, uint32_t subtype, const char *name)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    struct vfio_region_info *info;
> +    int ret;
> +
> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_DEVICE_STATE,
> +            subtype, &info);
> +    if (ret) {
> +        error_report("Failed to get info of region %s", name);
> +        return ret;
> +    }
> +
> +    if (vfio_region_setup(OBJECT(vdev), vbasedev,
> +            region, info->index, name)) {
> +        error_report("Failed to setup migrtion region %s", name);
> +        return ret;
> +    }
> +
> +    if (vfio_region_mmap(region)) {
> +        error_report("Failed to mmap migrtion region %s", name);
> +    }
> +
> +    return 0;
> +}
> +
> +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev)
> +{
> +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY);
> +}
> +
> +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev)
> +{
> +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY);
> +}
> +
> +static bool vfio_device_state_region_mmaped(VFIORegion *region)
> +{
> +    bool mmaped = true;
> +    if (region->nr_mmaps != 1 || region->mmaps[0].offset ||
> +            (region->size != region->mmaps[0].size) ||
> +            (region->mmaps[0].mmap == NULL)) {
> +        mmaped = false;
> +    }
> +
> +    return mmaped;
> +}
> +
> +static int vfio_get_device_config_size(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> +    uint64_t len;
> +    int sz;
> +
> +    sz = sizeof(len);
> +    if (pread(vbasedev->fd, &len, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.size))
> +            != sz) {
> +        error_report("vfio: Failed to get length of device config");
> +        return -1;
> +    }
> +    if (len > region_config->size) {
> +        error_report("vfio: Error device config length");
> +        return -1;
> +    }
> +    vdev->migration->devconfig_size = len;
> +
> +    return 0;
> +}
> +
> +static int vfio_set_device_config_size(VFIOPCIDevice *vdev, uint64_t size)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> +    int sz;
> +
> +    if (size > region_config->size) {
> +        return -1;
> +    }
> +
> +    sz = sizeof(size);
> +    if (pwrite(vbasedev->fd, &size, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.size))
> +            != sz) {
> +        error_report("vfio: Failed to set length of device config");
> +        return -1;
> +    }
> +    vdev->migration->devconfig_size = size;
> +    return 0;
> +}
> +
> +static int vfio_save_data_device_config(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> +    uint64_t len = vdev->migration->devconfig_size;
> +
> +    qemu_put_be64(f, len);
> +
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.action))
> +            != sz) {
> +        error_report("vfio: action failure for device config get buffer");
> +        return -1;
> +    }
> +
> +    if (!vfio_device_state_region_mmaped(region_config)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }

g_malloc never returns NULL; it aborts on failure to allocate.
So you can either drop the check, or my preference is to use
g_try_malloc for large/unknown areas, and it can return NULL.

> +        if (pread(vbasedev->fd, buf, len, region_config->fd_offset) != len) {
> +            error_report("vfio: Failed read device config buffer");
> +            return -1;
> +        }
> +        qemu_put_buffer(f, buf, len);
> +        g_free(buf);
> +    } else {
> +        dest = region_config->mmaps[0].mmap;
> +        qemu_put_buffer(f, dest, len);
> +    }
> +    return 0;
> +}
> +
> +static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> +                            QEMUFile *f, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> +
> +    vfio_set_device_config_size(vdev, len);
> +
> +    if (!vfio_device_state_region_mmaped(region_config)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }
> +        qemu_get_buffer(f, buf, len);
> +        if (pwrite(vbasedev->fd, buf, len,
> +                    region_config->fd_offset) != len) {
> +            error_report("vfio: Failed to write devie config buffer");
> +            return -1;
> +        }
> +        g_free(buf);
> +    } else {
> +        dest = region_config->mmaps[0].mmap;
> +        qemu_get_buffer(f, dest, len);
> +    }
> +
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.action))
> +            != sz) {
> +        error_report("vfio: action failure for device config set buffer");
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> +        uint64_t start_addr, uint64_t page_nr)
> +{
> +
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_bitmap =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> +    unsigned long bitmap_size =
> +                    BITS_TO_LONGS(page_nr) * sizeof(unsigned long);
> +    uint32_t sz;
> +
> +    struct {
> +        __u64 start_addr;
> +        __u64 page_nr;
> +    } system_memory;
> +    system_memory.start_addr = start_addr;
> +    system_memory.page_nr = page_nr;
> +    sz = sizeof(system_memory);
> +    if (pwrite(vbasedev->fd, &system_memory, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, system_memory))
> +            != sz) {
> +        error_report("vfio: Failed to set system memory range for dirty pages");
> +        return -1;
> +    }
> +
> +    if (!vfio_device_state_region_mmaped(region_bitmap)) {
> +        void *bitmap = g_malloc0(bitmap_size);
> +
> +        if (pread(vbasedev->fd, bitmap, bitmap_size,
> +                    region_bitmap->fd_offset) != bitmap_size) {
> +            error_report("vfio: Failed to read dirty bitmap data");
> +            return -1;
> +        }
> +
> +        cpu_physical_memory_set_dirty_lebitmap(bitmap, start_addr, page_nr);
> +
> +        g_free(bitmap);
> +    } else {
> +        cpu_physical_memory_set_dirty_lebitmap(
> +                    region_bitmap->mmaps[0].mmap,
> +                    start_addr, page_nr);
> +    }
> +   return 0;
> +}
> +
> +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> +        uint64_t start_addr, uint64_t page_nr)
> +{
> +    VFIORegion *region_bitmap =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> +    unsigned long chunk_size = region_bitmap->size;
> +    uint64_t chunk_pg_nr = (chunk_size / sizeof(unsigned long)) *
> +                                BITS_PER_LONG;
> +
> +    uint64_t cnt_left;
> +    int rc = 0;
> +
> +    cnt_left = page_nr;
> +
> +    while (cnt_left >= chunk_pg_nr) {
> +        rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, chunk_pg_nr);
> +        if (rc) {
> +            goto exit;
> +        }
> +        cnt_left -= chunk_pg_nr;
> +        start_addr += start_addr;
> +   }
> +   rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, cnt_left);
> +
> +exit:
> +   return rc;
> +}
> +
> +static int vfio_set_device_state(VFIOPCIDevice *vdev,
> +        uint32_t dev_state)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    uint32_t sz = sizeof(dev_state);
> +
> +    if (!vdev->migration) {
> +        return -1;
> +    }
> +
> +    if (pwrite(vbasedev->fd, &dev_state, sz,
> +              region->fd_offset +
> +              offsetof(struct vfio_device_state_ctl, device_state))
> +            != sz) {
> +        error_report("vfio: Failed to set device state %d", dev_state);
> +        return -1;
> +    }
> +    vdev->migration->device_state = dev_state;
> +    return 0;
> +}
> +
> +static int vfio_get_device_data_caps(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +
> +    uint32_t caps;
> +    uint32_t size = sizeof(caps);
> +
> +    if (pread(vbasedev->fd, &caps, size,
> +                region->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, caps))
> +            != size) {
> +        error_report("%s Failed to read data caps of device states",
> +                vbasedev->name);
> +        return -1;
> +    }
> +    vdev->migration->data_caps = caps;
> +    return 0;
> +}
> +
> +
> +static int vfio_check_devstate_version(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +
> +    uint32_t version;
> +    uint32_t size = sizeof(version);
> +
> +    if (pread(vbasedev->fd, &version, size,
> +                region->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, version))
> +            != size) {
> +        error_report("%s Failed to read version of device state interfaces",
> +                vbasedev->name);
> +        return -1;
> +    }
> +
> +    if (version != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> +        error_report("%s migration version mismatch, right version is %d",
> +                vbasedev->name, VFIO_DEVICE_STATE_INTERFACE_VERSION);
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_vm_change_state_handler(void *pv, int running, RunState state)
> +{
> +    VFIOPCIDevice *vdev = pv;
> +    uint32_t dev_state = vdev->migration->device_state;
> +
> +    if (!running) {
> +        dev_state |= VFIO_DEVICE_STATE_STOP;
> +    } else {
> +        dev_state &= ~VFIO_DEVICE_STATE_STOP;
> +    }
> +
> +    vfio_set_device_state(vdev, dev_state);
> +}
> +
> +static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> +                                   uint64_t max_size,
> +                                   uint64_t *res_precopy_only,
> +                                   uint64_t *res_compatible,
> +                                   uint64_t *res_post_copy_only)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +
> +    if (!vfio_device_data_cap_device_memory(vdev)) {
> +        return;
> +    }
> +
> +    return;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +
> +    if (!vfio_device_data_cap_device_memory(vdev)) {
> +        return 0;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
> +    bool msi_64bit;
> +
> +    /* retore pci bar configuration */
> +    ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +            ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        bar_cfg = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar_cfg, 4);
> +    }
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +            ctl | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> +
> +    /* restore msi configuration */
> +    ctl = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +    msi_64bit = !!(ctl & PCI_MSI_FLAGS_64BIT);
> +
> +    vfio_pci_write_config(&vdev->pdev,
> +            pdev->msi_cap + PCI_MSI_FLAGS,
> +            ctl & (!PCI_MSI_FLAGS_ENABLE), 2);
> +
> +    msi_lo = qemu_get_be32(f);
> +    vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO, msi_lo, 4);
> +
> +    if (msi_64bit) {
> +        msi_hi = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                msi_hi, 4);
> +    }
> +    msi_data = qemu_get_be32(f);
> +    vfio_pci_write_config(pdev,
> +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +            msi_data, 2);
> +
> +    vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +            ctl | PCI_MSI_FLAGS_ENABLE, 2);

It would probably be best to use a VMStateDescription and the macros
for this if possible; I bet you'll want to add more fields in the future
for example.
Also what happens if the data read from the migration stream is bad or
doesn't agree with this devices hardware? How does this fail?

> +}
> +
> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    int flag;
> +    uint64_t len;
> +    int ret = 0;
> +
> +    if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> +        return -EINVAL;
> +    }
> +
> +    do {
> +        flag = qemu_get_byte(f);
> +
> +        switch (flag & ~VFIO_SAVE_FLAG_CONTINUE) {
> +        case VFIO_SAVE_FLAG_SETUP:
> +            break;
> +        case VFIO_SAVE_FLAG_PCI:
> +            vfio_pci_load_config(vdev, f);
> +            break;
> +        case VFIO_SAVE_FLAG_DEVCONFIG:
> +            len = qemu_get_be64(f);
> +            vfio_load_data_device_config(vdev, f, len);
> +            break;
> +        default:
> +            ret = -EINVAL;
> +        }
> +    } while (flag & VFIO_SAVE_FLAG_CONTINUE);
> +
> +    return ret;
> +}
> +
> +static void vfio_pci_save_config(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t msi_cfg, msi_lo, msi_hi, msi_data, bar_cfg, i;
> +    bool msi_64bit;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        bar_cfg = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> +        qemu_put_be32(f, bar_cfg);
> +    }
> +
> +    msi_cfg = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +    msi_64bit = !!(msi_cfg & PCI_MSI_FLAGS_64BIT);
> +
> +    msi_lo = pci_default_read_config(pdev,
> +            pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +    qemu_put_be32(f, msi_lo);
> +
> +    if (msi_64bit) {
> +        msi_hi = pci_default_read_config(pdev,
> +                pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                4);
> +        qemu_put_be32(f, msi_hi);
> +    }
> +
> +    msi_data = pci_default_read_config(pdev,
> +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +            2);
> +    qemu_put_be32(f, msi_data);
> +
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    int rc = 0;
> +
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> +    vfio_pci_save_config(vdev, f);
> +
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVCONFIG);
> +    rc += vfio_get_device_config_size(vdev);
> +    rc += vfio_save_data_device_config(vdev, f);
> +
> +    return rc;
> +}
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +
> +    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> +            VFIO_DEVICE_STATE_LOGGING);
> +    return 0;
> +}
> +
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +    return 0;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    uint32_t dev_state = vdev->migration->device_state;
> +
> +    dev_state &= ~VFIO_DEVICE_STATE_LOGGING;
> +
> +    vfio_set_device_state(vdev, dev_state);
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_live_pending = vfio_save_live_pending,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_cleanup = vfio_save_cleanup,
> +    .load_setup = vfio_load_setup,
> +    .load_state = vfio_load_state,
> +};
> +
> +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> +{
> +    int ret;
> +    Error *local_err = NULL;
> +    vdev->migration = g_new0(VFIOMigration, 1);
> +
> +    if (vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL,
> +              "device-state-ctl")) {
> +        goto error;
> +    }
> +
> +    if (vfio_check_devstate_version(vdev)) {
> +        goto error;
> +    }
> +
> +    if (vfio_get_device_data_caps(vdev)) {
> +        goto error;
> +    }
> +
> +    if (vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG,
> +              "device-state-data-device-config")) {
> +        goto error;
> +    }
> +
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        error_report("No suppport of data cap device memory Yet");
> +        goto error;
> +    }
> +
> +    if (vfio_device_data_cap_system_memory(vdev) &&
> +            vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP,
> +              "device-state-data-dirtybitmap")) {
> +        goto error;
> +    }
> +
> +    vdev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> +
> +    register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
> +            VFIO_DEVICE_STATE_INTERFACE_VERSION,
> +            &savevm_vfio_handlers,
> +            vdev);
> +
> +    vdev->migration->vm_state =
> +        qemu_add_vm_change_state_handler(vfio_vm_change_state_handler, vdev);
> +
> +    return 0;
> +error:
> +    error_setg(&vdev->migration_blocker,
> +            "VFIO device doesn't support migration");
> +    ret = migrate_add_blocker(vdev->migration_blocker, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        error_free(vdev->migration_blocker);
> +    }
> +
> +    g_free(vdev->migration);
> +    vdev->migration = NULL;
> +
> +    return ret;
> +}
> +
> +void vfio_migration_finalize(VFIOPCIDevice *vdev)
> +{
> +    if (vdev->migration) {
> +        int i;
> +        qemu_del_vm_change_state_handler(vdev->migration->vm_state);
> +        unregister_savevm(NULL, TYPE_VFIO_PCI, vdev);
> +        for (i = 0; i < VFIO_DEVSTATE_REGION_NUM; i++) {
> +            vfio_region_finalize(&vdev->migration->region[i]);
> +        }
> +        g_free(vdev->migration);
> +        vdev->migration = NULL;
> +    } else if (vdev->migration_blocker) {
> +        migrate_del_blocker(vdev->migration_blocker);
> +        error_free(vdev->migration_blocker);
> +    }
> +}
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index c0cb1ec..b8e006b 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -37,7 +37,6 @@
>  
>  #define MSIX_CAP_LENGTH 12
>  
> -#define TYPE_VFIO_PCI "vfio-pci"
>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
>  
>  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index b1ae4c0..4b7b1bb 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -19,6 +19,7 @@
>  #include "qemu/event_notifier.h"
>  #include "qemu/queue.h"
>  #include "qemu/timer.h"
> +#include "sysemu/sysemu.h"
>  
>  #define PCI_ANY_ID (~0)
>  
> @@ -56,6 +57,21 @@ typedef struct VFIOBAR {
>      QLIST_HEAD(, VFIOQuirk) quirks;
>  } VFIOBAR;
>  
> +enum {
> +    VFIO_DEVSTATE_REGION_CTL = 0,
> +    VFIO_DEVSTATE_REGION_DATA_CONFIG,
> +    VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY,
> +    VFIO_DEVSTATE_REGION_DATA_BITMAP,
> +    VFIO_DEVSTATE_REGION_NUM,
> +};
> +typedef struct VFIOMigration {
> +    VFIORegion region[VFIO_DEVSTATE_REGION_NUM];
> +    uint32_t data_caps;
> +    uint32_t device_state;
> +    uint64_t devconfig_size;
> +    VMChangeStateEntry *vm_state;
> +} VFIOMigration;
> +
>  typedef struct VFIOVGARegion {
>      MemoryRegion mem;
>      off_t offset;
> @@ -132,6 +148,8 @@ typedef struct VFIOPCIDevice {
>      VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
>      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
>      void *igd_opregion;
> +    VFIOMigration *migration;
> +    Error *migration_blocker;
>      PCIHostDeviceAddress host;
>      EventNotifier err_notifier;
>      EventNotifier req_notifier;
> @@ -198,5 +216,10 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
>  void vfio_display_reset(VFIOPCIDevice *vdev);
>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
>  void vfio_display_finalize(VFIOPCIDevice *vdev);
> -
> +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev);
> +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev);
> +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> +         uint64_t start_addr, uint64_t page_nr);
> +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp);
> +void vfio_migration_finalize(VFIOPCIDevice *vdev);
>  #endif /* HW_VFIO_VFIO_PCI_H */
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 1b434d0..ed43613 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -32,6 +32,7 @@
>  #endif
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
> +#define TYPE_VFIO_PCI "vfio-pci"
>  
>  enum {
>      VFIO_DEVICE_TYPE_PCI = 0,
> -- 
> 2.7.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] vfio/migration: support device of device config capability
@ 2019-02-19 11:01     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 133+ messages in thread
From: Dr. David Alan Gilbert @ 2019-02-19 11:01 UTC (permalink / raw)
  To: Yan Zhao
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, cohuck, shuangtai.tst, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	changpeng.liu, Ken.Xue, kwankhede, kevin.tian, cjia,
	arei.gonglei, kvm

* Yan Zhao (yan.y.zhao@intel.com) wrote:
> Device config is the default data that every device should have. so
> device config capability is by default on, no need to set.
> 
> - Currently two type of resources are saved/loaded for device of device
>   config capability:
>   General PCI config data, and Device config data.
>   They are copies as a whole when precopy is stopped.
> 
> Migration setup flow:
> - Setup device state regions, check its device state version and capabilities.
>   Mmap Device Config Region and Dirty Bitmap Region, if available.
> - If device state regions are failed to get setup, a migration blocker is
>   registered instead.
> - Added SaveVMHandlers to register device state save/load handlers.
> - Register VM state change handler to set device's running/stop states.
> - On migration startup on source machine, set device's state to
>   VFIO_DEVICE_STATE_LOGGING
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 633 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |   1 -
>  hw/vfio/pci.h                 |  25 +-
>  include/hw/vfio/vfio-common.h |   1 +
>  5 files changed, 659 insertions(+), 3 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index 8b3f664..f32ff19 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,6 +1,6 @@
>  ifeq ($(CONFIG_LINUX), y)
>  obj-$(CONFIG_SOFTMMU) += common.o
> -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o
>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> new file mode 100644
> index 0000000..16d6395
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,633 @@
> +#include "qemu/osdep.h"
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "migration/blocker.h"
> +#include "migration/register.h"
> +#include "qapi/error.h"
> +#include "pci.h"
> +#include "sysemu/kvm.h"
> +#include "exec/ram_addr.h"
> +
> +#define VFIO_SAVE_FLAG_SETUP 0
> +#define VFIO_SAVE_FLAG_PCI 1
> +#define VFIO_SAVE_FLAG_DEVCONFIG 2
> +#define VFIO_SAVE_FLAG_DEVMEMORY 4
> +#define VFIO_SAVE_FLAG_CONTINUE 8
> +
> +static int vfio_device_state_region_setup(VFIOPCIDevice *vdev,
> +        VFIORegion *region, uint32_t subtype, const char *name)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    struct vfio_region_info *info;
> +    int ret;
> +
> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_DEVICE_STATE,
> +            subtype, &info);
> +    if (ret) {
> +        error_report("Failed to get info of region %s", name);
> +        return ret;
> +    }
> +
> +    if (vfio_region_setup(OBJECT(vdev), vbasedev,
> +            region, info->index, name)) {
> +        error_report("Failed to setup migrtion region %s", name);
> +        return ret;
> +    }
> +
> +    if (vfio_region_mmap(region)) {
> +        error_report("Failed to mmap migrtion region %s", name);
> +    }
> +
> +    return 0;
> +}
> +
> +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev)
> +{
> +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY);
> +}
> +
> +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev)
> +{
> +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY);
> +}
> +
> +static bool vfio_device_state_region_mmaped(VFIORegion *region)
> +{
> +    bool mmaped = true;
> +    if (region->nr_mmaps != 1 || region->mmaps[0].offset ||
> +            (region->size != region->mmaps[0].size) ||
> +            (region->mmaps[0].mmap == NULL)) {
> +        mmaped = false;
> +    }
> +
> +    return mmaped;
> +}
> +
> +static int vfio_get_device_config_size(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> +    uint64_t len;
> +    int sz;
> +
> +    sz = sizeof(len);
> +    if (pread(vbasedev->fd, &len, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.size))
> +            != sz) {
> +        error_report("vfio: Failed to get length of device config");
> +        return -1;
> +    }
> +    if (len > region_config->size) {
> +        error_report("vfio: Error device config length");
> +        return -1;
> +    }
> +    vdev->migration->devconfig_size = len;
> +
> +    return 0;
> +}
> +
> +static int vfio_set_device_config_size(VFIOPCIDevice *vdev, uint64_t size)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> +    int sz;
> +
> +    if (size > region_config->size) {
> +        return -1;
> +    }
> +
> +    sz = sizeof(size);
> +    if (pwrite(vbasedev->fd, &size, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.size))
> +            != sz) {
> +        error_report("vfio: Failed to set length of device config");
> +        return -1;
> +    }
> +    vdev->migration->devconfig_size = size;
> +    return 0;
> +}
> +
> +static int vfio_save_data_device_config(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> +    uint64_t len = vdev->migration->devconfig_size;
> +
> +    qemu_put_be64(f, len);
> +
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.action))
> +            != sz) {
> +        error_report("vfio: action failure for device config get buffer");
> +        return -1;
> +    }
> +
> +    if (!vfio_device_state_region_mmaped(region_config)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }

g_malloc never returns NULL; it aborts on failure to allocate.
So you can either drop the check, or my preference is to use
g_try_malloc for large/unknown areas, and it can return NULL.

> +        if (pread(vbasedev->fd, buf, len, region_config->fd_offset) != len) {
> +            error_report("vfio: Failed read device config buffer");
> +            return -1;
> +        }
> +        qemu_put_buffer(f, buf, len);
> +        g_free(buf);
> +    } else {
> +        dest = region_config->mmaps[0].mmap;
> +        qemu_put_buffer(f, dest, len);
> +    }
> +    return 0;
> +}
> +
> +static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> +                            QEMUFile *f, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> +
> +    vfio_set_device_config_size(vdev, len);
> +
> +    if (!vfio_device_state_region_mmaped(region_config)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }
> +        qemu_get_buffer(f, buf, len);
> +        if (pwrite(vbasedev->fd, buf, len,
> +                    region_config->fd_offset) != len) {
> +            error_report("vfio: Failed to write devie config buffer");
> +            return -1;
> +        }
> +        g_free(buf);
> +    } else {
> +        dest = region_config->mmaps[0].mmap;
> +        qemu_get_buffer(f, dest, len);
> +    }
> +
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.action))
> +            != sz) {
> +        error_report("vfio: action failure for device config set buffer");
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> +        uint64_t start_addr, uint64_t page_nr)
> +{
> +
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_bitmap =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> +    unsigned long bitmap_size =
> +                    BITS_TO_LONGS(page_nr) * sizeof(unsigned long);
> +    uint32_t sz;
> +
> +    struct {
> +        __u64 start_addr;
> +        __u64 page_nr;
> +    } system_memory;
> +    system_memory.start_addr = start_addr;
> +    system_memory.page_nr = page_nr;
> +    sz = sizeof(system_memory);
> +    if (pwrite(vbasedev->fd, &system_memory, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, system_memory))
> +            != sz) {
> +        error_report("vfio: Failed to set system memory range for dirty pages");
> +        return -1;
> +    }
> +
> +    if (!vfio_device_state_region_mmaped(region_bitmap)) {
> +        void *bitmap = g_malloc0(bitmap_size);
> +
> +        if (pread(vbasedev->fd, bitmap, bitmap_size,
> +                    region_bitmap->fd_offset) != bitmap_size) {
> +            error_report("vfio: Failed to read dirty bitmap data");
> +            return -1;
> +        }
> +
> +        cpu_physical_memory_set_dirty_lebitmap(bitmap, start_addr, page_nr);
> +
> +        g_free(bitmap);
> +    } else {
> +        cpu_physical_memory_set_dirty_lebitmap(
> +                    region_bitmap->mmaps[0].mmap,
> +                    start_addr, page_nr);
> +    }
> +   return 0;
> +}
> +
> +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> +        uint64_t start_addr, uint64_t page_nr)
> +{
> +    VFIORegion *region_bitmap =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> +    unsigned long chunk_size = region_bitmap->size;
> +    uint64_t chunk_pg_nr = (chunk_size / sizeof(unsigned long)) *
> +                                BITS_PER_LONG;
> +
> +    uint64_t cnt_left;
> +    int rc = 0;
> +
> +    cnt_left = page_nr;
> +
> +    while (cnt_left >= chunk_pg_nr) {
> +        rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, chunk_pg_nr);
> +        if (rc) {
> +            goto exit;
> +        }
> +        cnt_left -= chunk_pg_nr;
> +        start_addr += start_addr;
> +   }
> +   rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, cnt_left);
> +
> +exit:
> +   return rc;
> +}
> +
> +static int vfio_set_device_state(VFIOPCIDevice *vdev,
> +        uint32_t dev_state)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    uint32_t sz = sizeof(dev_state);
> +
> +    if (!vdev->migration) {
> +        return -1;
> +    }
> +
> +    if (pwrite(vbasedev->fd, &dev_state, sz,
> +              region->fd_offset +
> +              offsetof(struct vfio_device_state_ctl, device_state))
> +            != sz) {
> +        error_report("vfio: Failed to set device state %d", dev_state);
> +        return -1;
> +    }
> +    vdev->migration->device_state = dev_state;
> +    return 0;
> +}
> +
> +static int vfio_get_device_data_caps(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +
> +    uint32_t caps;
> +    uint32_t size = sizeof(caps);
> +
> +    if (pread(vbasedev->fd, &caps, size,
> +                region->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, caps))
> +            != size) {
> +        error_report("%s Failed to read data caps of device states",
> +                vbasedev->name);
> +        return -1;
> +    }
> +    vdev->migration->data_caps = caps;
> +    return 0;
> +}
> +
> +
> +static int vfio_check_devstate_version(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +
> +    uint32_t version;
> +    uint32_t size = sizeof(version);
> +
> +    if (pread(vbasedev->fd, &version, size,
> +                region->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, version))
> +            != size) {
> +        error_report("%s Failed to read version of device state interfaces",
> +                vbasedev->name);
> +        return -1;
> +    }
> +
> +    if (version != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> +        error_report("%s migration version mismatch, right version is %d",
> +                vbasedev->name, VFIO_DEVICE_STATE_INTERFACE_VERSION);
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_vm_change_state_handler(void *pv, int running, RunState state)
> +{
> +    VFIOPCIDevice *vdev = pv;
> +    uint32_t dev_state = vdev->migration->device_state;
> +
> +    if (!running) {
> +        dev_state |= VFIO_DEVICE_STATE_STOP;
> +    } else {
> +        dev_state &= ~VFIO_DEVICE_STATE_STOP;
> +    }
> +
> +    vfio_set_device_state(vdev, dev_state);
> +}
> +
> +static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> +                                   uint64_t max_size,
> +                                   uint64_t *res_precopy_only,
> +                                   uint64_t *res_compatible,
> +                                   uint64_t *res_post_copy_only)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +
> +    if (!vfio_device_data_cap_device_memory(vdev)) {
> +        return;
> +    }
> +
> +    return;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +
> +    if (!vfio_device_data_cap_device_memory(vdev)) {
> +        return 0;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
> +    bool msi_64bit;
> +
> +    /* retore pci bar configuration */
> +    ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +            ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        bar_cfg = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar_cfg, 4);
> +    }
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +            ctl | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> +
> +    /* restore msi configuration */
> +    ctl = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +    msi_64bit = !!(ctl & PCI_MSI_FLAGS_64BIT);
> +
> +    vfio_pci_write_config(&vdev->pdev,
> +            pdev->msi_cap + PCI_MSI_FLAGS,
> +            ctl & (!PCI_MSI_FLAGS_ENABLE), 2);
> +
> +    msi_lo = qemu_get_be32(f);
> +    vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO, msi_lo, 4);
> +
> +    if (msi_64bit) {
> +        msi_hi = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                msi_hi, 4);
> +    }
> +    msi_data = qemu_get_be32(f);
> +    vfio_pci_write_config(pdev,
> +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +            msi_data, 2);
> +
> +    vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +            ctl | PCI_MSI_FLAGS_ENABLE, 2);

It would probably be best to use a VMStateDescription and the macros
for this if possible; I bet you'll want to add more fields in the future
for example.
Also what happens if the data read from the migration stream is bad or
doesn't agree with this devices hardware? How does this fail?

> +}
> +
> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    int flag;
> +    uint64_t len;
> +    int ret = 0;
> +
> +    if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> +        return -EINVAL;
> +    }
> +
> +    do {
> +        flag = qemu_get_byte(f);
> +
> +        switch (flag & ~VFIO_SAVE_FLAG_CONTINUE) {
> +        case VFIO_SAVE_FLAG_SETUP:
> +            break;
> +        case VFIO_SAVE_FLAG_PCI:
> +            vfio_pci_load_config(vdev, f);
> +            break;
> +        case VFIO_SAVE_FLAG_DEVCONFIG:
> +            len = qemu_get_be64(f);
> +            vfio_load_data_device_config(vdev, f, len);
> +            break;
> +        default:
> +            ret = -EINVAL;
> +        }
> +    } while (flag & VFIO_SAVE_FLAG_CONTINUE);
> +
> +    return ret;
> +}
> +
> +static void vfio_pci_save_config(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t msi_cfg, msi_lo, msi_hi, msi_data, bar_cfg, i;
> +    bool msi_64bit;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        bar_cfg = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> +        qemu_put_be32(f, bar_cfg);
> +    }
> +
> +    msi_cfg = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +    msi_64bit = !!(msi_cfg & PCI_MSI_FLAGS_64BIT);
> +
> +    msi_lo = pci_default_read_config(pdev,
> +            pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +    qemu_put_be32(f, msi_lo);
> +
> +    if (msi_64bit) {
> +        msi_hi = pci_default_read_config(pdev,
> +                pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                4);
> +        qemu_put_be32(f, msi_hi);
> +    }
> +
> +    msi_data = pci_default_read_config(pdev,
> +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +            2);
> +    qemu_put_be32(f, msi_data);
> +
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    int rc = 0;
> +
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> +    vfio_pci_save_config(vdev, f);
> +
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVCONFIG);
> +    rc += vfio_get_device_config_size(vdev);
> +    rc += vfio_save_data_device_config(vdev, f);
> +
> +    return rc;
> +}
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +
> +    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> +            VFIO_DEVICE_STATE_LOGGING);
> +    return 0;
> +}
> +
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +    return 0;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    uint32_t dev_state = vdev->migration->device_state;
> +
> +    dev_state &= ~VFIO_DEVICE_STATE_LOGGING;
> +
> +    vfio_set_device_state(vdev, dev_state);
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_live_pending = vfio_save_live_pending,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_cleanup = vfio_save_cleanup,
> +    .load_setup = vfio_load_setup,
> +    .load_state = vfio_load_state,
> +};
> +
> +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> +{
> +    int ret;
> +    Error *local_err = NULL;
> +    vdev->migration = g_new0(VFIOMigration, 1);
> +
> +    if (vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL,
> +              "device-state-ctl")) {
> +        goto error;
> +    }
> +
> +    if (vfio_check_devstate_version(vdev)) {
> +        goto error;
> +    }
> +
> +    if (vfio_get_device_data_caps(vdev)) {
> +        goto error;
> +    }
> +
> +    if (vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG,
> +              "device-state-data-device-config")) {
> +        goto error;
> +    }
> +
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        error_report("No suppport of data cap device memory Yet");
> +        goto error;
> +    }
> +
> +    if (vfio_device_data_cap_system_memory(vdev) &&
> +            vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP,
> +              "device-state-data-dirtybitmap")) {
> +        goto error;
> +    }
> +
> +    vdev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> +
> +    register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
> +            VFIO_DEVICE_STATE_INTERFACE_VERSION,
> +            &savevm_vfio_handlers,
> +            vdev);
> +
> +    vdev->migration->vm_state =
> +        qemu_add_vm_change_state_handler(vfio_vm_change_state_handler, vdev);
> +
> +    return 0;
> +error:
> +    error_setg(&vdev->migration_blocker,
> +            "VFIO device doesn't support migration");
> +    ret = migrate_add_blocker(vdev->migration_blocker, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        error_free(vdev->migration_blocker);
> +    }
> +
> +    g_free(vdev->migration);
> +    vdev->migration = NULL;
> +
> +    return ret;
> +}
> +
> +void vfio_migration_finalize(VFIOPCIDevice *vdev)
> +{
> +    if (vdev->migration) {
> +        int i;
> +        qemu_del_vm_change_state_handler(vdev->migration->vm_state);
> +        unregister_savevm(NULL, TYPE_VFIO_PCI, vdev);
> +        for (i = 0; i < VFIO_DEVSTATE_REGION_NUM; i++) {
> +            vfio_region_finalize(&vdev->migration->region[i]);
> +        }
> +        g_free(vdev->migration);
> +        vdev->migration = NULL;
> +    } else if (vdev->migration_blocker) {
> +        migrate_del_blocker(vdev->migration_blocker);
> +        error_free(vdev->migration_blocker);
> +    }
> +}
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index c0cb1ec..b8e006b 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -37,7 +37,6 @@
>  
>  #define MSIX_CAP_LENGTH 12
>  
> -#define TYPE_VFIO_PCI "vfio-pci"
>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
>  
>  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index b1ae4c0..4b7b1bb 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -19,6 +19,7 @@
>  #include "qemu/event_notifier.h"
>  #include "qemu/queue.h"
>  #include "qemu/timer.h"
> +#include "sysemu/sysemu.h"
>  
>  #define PCI_ANY_ID (~0)
>  
> @@ -56,6 +57,21 @@ typedef struct VFIOBAR {
>      QLIST_HEAD(, VFIOQuirk) quirks;
>  } VFIOBAR;
>  
> +enum {
> +    VFIO_DEVSTATE_REGION_CTL = 0,
> +    VFIO_DEVSTATE_REGION_DATA_CONFIG,
> +    VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY,
> +    VFIO_DEVSTATE_REGION_DATA_BITMAP,
> +    VFIO_DEVSTATE_REGION_NUM,
> +};
> +typedef struct VFIOMigration {
> +    VFIORegion region[VFIO_DEVSTATE_REGION_NUM];
> +    uint32_t data_caps;
> +    uint32_t device_state;
> +    uint64_t devconfig_size;
> +    VMChangeStateEntry *vm_state;
> +} VFIOMigration;
> +
>  typedef struct VFIOVGARegion {
>      MemoryRegion mem;
>      off_t offset;
> @@ -132,6 +148,8 @@ typedef struct VFIOPCIDevice {
>      VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
>      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
>      void *igd_opregion;
> +    VFIOMigration *migration;
> +    Error *migration_blocker;
>      PCIHostDeviceAddress host;
>      EventNotifier err_notifier;
>      EventNotifier req_notifier;
> @@ -198,5 +216,10 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
>  void vfio_display_reset(VFIOPCIDevice *vdev);
>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
>  void vfio_display_finalize(VFIOPCIDevice *vdev);
> -
> +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev);
> +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev);
> +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> +         uint64_t start_addr, uint64_t page_nr);
> +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp);
> +void vfio_migration_finalize(VFIOPCIDevice *vdev);
>  #endif /* HW_VFIO_VFIO_PCI_H */
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 1b434d0..ed43613 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -32,6 +32,7 @@
>  #endif
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
> +#define TYPE_VFIO_PCI "vfio-pci"
>  
>  enum {
>      VFIO_DEVICE_TYPE_PCI = 0,
> -- 
> 2.7.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 5/5] vfio/migration: support device memory capability
  2019-02-19  8:53   ` [Qemu-devel] " Yan Zhao
@ 2019-02-19 11:25     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 133+ messages in thread
From: Dr. David Alan Gilbert @ 2019-02-19 11:25 UTC (permalink / raw)
  To: Yan Zhao
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck,
	zhi.a.wang, jonathan.davies

* Yan Zhao (yan.y.zhao@intel.com) wrote:
> If a device has device memory capability, save/load data from device memory
> in pre-copy and stop-and-copy phases.
> 
> LOGGING state is set for device memory for dirty page logging:
> in LOGGING state, get device memory returns whole device memory snapshot;
> outside LOGGING state, get device memory returns dirty data since last get
> operation.
> 
> Usually, device memory is very big, qemu needs to chunk it into several
> pieces each with size of device memory region.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
>  hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  hw/vfio/pci.h       |   1 +
>  2 files changed, 231 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 16d6395..f1e9309 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
>      return 0;
>  }
>  
> +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    uint64_t len;
> +    int sz;
> +
> +    sz = sizeof(len);
> +    if (pread(vbasedev->fd, &len, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> +            != sz) {
> +        error_report("vfio: Failed to get length of device memory");
> +        return -1;
> +    }
> +    vdev->migration->devmem_size = len;
> +    return 0;
> +}
> +
> +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    int sz;
> +
> +    sz = sizeof(size);
> +    if (pwrite(vbasedev->fd, &size, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> +            != sz) {
> +        error_report("vfio: Failed to set length of device comemory");
> +        return -1;
> +    }
> +    vdev->migration->devmem_size = size;
> +    return 0;
> +}
> +
> +static
> +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> +                                    uint64_t pos, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> +
> +    if (len > region_devmem->size) {
> +        return -1;
> +    }
> +
> +    sz = sizeof(pos);
> +    if (pwrite(vbasedev->fd, &pos, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> +            != sz) {
> +        error_report("vfio: Failed to set save buffer pos");
> +        return -1;
> +    }
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> +            != sz) {
> +        error_report("vfio: Failed to set save buffer action");
> +        return -1;
> +    }
> +
> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }
> +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
> +            error_report("vfio: error load device memory buffer");

That's forgotten to g_free(buf)

> +            return -1;
> +        }
> +        qemu_put_be64(f, len);
> +        qemu_put_be64(f, pos);
> +        qemu_put_buffer(f, buf, len);
> +        g_free(buf);
> +    } else {
> +        dest = region_devmem->mmaps[0].mmap;
> +        qemu_put_be64(f, len);
> +        qemu_put_be64(f, pos);
> +        qemu_put_buffer(f, dest, len);
> +    }
> +    return 0;
> +}
> +
> +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +    uint64_t total_len = vdev->migration->devmem_size;
> +    uint64_t pos = 0;
> +
> +    qemu_put_be64(f, total_len);
> +    while (pos < total_len) {
> +        uint64_t len = region_devmem->size;
> +
> +        if (pos + len >= total_len) {
> +            len = total_len - pos;
> +        }
> +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
> +            return -1;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static
> +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> +                                uint64_t pos, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> +
> +    if (len > region_devmem->size) {
> +        return -1;
> +    }
> +
> +    sz = sizeof(pos);
> +    if (pwrite(vbasedev->fd, &pos, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> +            != sz) {
> +        error_report("vfio: Failed to set device memory buffer pos");
> +        return -1;
> +    }
> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }
> +        qemu_get_buffer(f, buf, len);
> +        if (pwrite(vbasedev->fd, buf, len,
> +                    region_devmem->fd_offset) != len) {
> +            error_report("vfio: Failed to load devie memory buffer");

Again, failed to free buf

> +            return -1;
> +        }
> +        g_free(buf);
> +    } else {
> +        dest = region_devmem->mmaps[0].mmap;
> +        qemu_get_buffer(f, dest, len);
> +    }

You might want to use qemu_file_get_error(f)  before writing the data
to the device, to check for the case of a read error on the migration
stream that happened somewhere in the pevious qemu_get's

> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> +            != sz) {
> +        error_report("vfio: Failed to set load device memory buffer action");
> +        return -1;
> +    }
> +
> +    return 0;
> +
> +}
> +
> +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
> +                        QEMUFile *f, uint64_t total_len)
> +{
> +    uint64_t pos = 0, len = 0;
> +
> +    vfio_set_device_memory_size(vdev, total_len);
> +
> +    while (pos + len < total_len) {
> +        len = qemu_get_be64(f);
> +        pos = qemu_get_be64(f);

Please check len/pos - always assume that the migration stream could
be (maliciously or accidentally) corrupt.

> +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
> +    }
> +
> +    return 0;
> +}
> +
> +
>  static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
>          uint64_t start_addr, uint64_t page_nr)
>  {
> @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
>          return;
>      }
>  
> +    /* get dirty data size of device memory */
> +    vfio_get_device_memory_size(vdev);
> +
> +    *res_precopy_only += vdev->migration->devmem_size;
>      return;
>  }
>  
> @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>          return 0;
>      }
>  
> -    return 0;
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> +    /* get dirty data of device memory */
> +    return vfio_save_data_device_memory(vdev, f);
>  }
>  
>  static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>              len = qemu_get_be64(f);
>              vfio_load_data_device_config(vdev, f, len);
>              break;
> +        case VFIO_SAVE_FLAG_DEVMEMORY:
> +            len = qemu_get_be64(f);
> +            vfio_load_data_device_memory(vdev, f, len);
> +            break;
>          default:
>              ret = -EINVAL;
>          }
> @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>      VFIOPCIDevice *vdev = opaque;
>      int rc = 0;
>  
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
> +        /* get dirty data of device memory */
> +        vfio_get_device_memory_size(vdev);
> +        rc = vfio_save_data_device_memory(vdev, f);
> +    }
> +
>      qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
>      vfio_pci_save_config(vdev, f);
>  
> @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
>  {
> +    int rc = 0;
>      VFIOPCIDevice *vdev = opaque;
> -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> +        /* get whole snapshot of device memory */
> +        vfio_get_device_memory_size(vdev);
> +        rc = vfio_save_data_device_memory(vdev, f);
> +    } else {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +    }
>  
>      vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
>              VFIO_DEVICE_STATE_LOGGING);
> -    return 0;
> +    return rc;
>  }
>  
>  static int vfio_load_setup(QEMUFile *f, void *opaque)
> @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
>          goto error;
>      }
>  
> -    if (vfio_device_data_cap_device_memory(vdev)) {
> -        error_report("No suppport of data cap device memory Yet");
> +    if (vfio_device_data_cap_device_memory(vdev) &&
> +            vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
> +              "device-state-data-device-memory")) {
>          goto error;
>      }
>  
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 4b7b1bb..a2cc64b 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
>      uint32_t data_caps;
>      uint32_t device_state;
>      uint64_t devconfig_size;
> +    uint64_t devmem_size;
>      VMChangeStateEntry *vm_state;
>  } VFIOMigration;
>  
> -- 
> 2.7.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] vfio/migration: support device memory capability
@ 2019-02-19 11:25     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 133+ messages in thread
From: Dr. David Alan Gilbert @ 2019-02-19 11:25 UTC (permalink / raw)
  To: Yan Zhao
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, cohuck, shuangtai.tst, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	changpeng.liu, Ken.Xue, kwankhede, kevin.tian, cjia,
	arei.gonglei, kvm

* Yan Zhao (yan.y.zhao@intel.com) wrote:
> If a device has device memory capability, save/load data from device memory
> in pre-copy and stop-and-copy phases.
> 
> LOGGING state is set for device memory for dirty page logging:
> in LOGGING state, get device memory returns whole device memory snapshot;
> outside LOGGING state, get device memory returns dirty data since last get
> operation.
> 
> Usually, device memory is very big, qemu needs to chunk it into several
> pieces each with size of device memory region.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
>  hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  hw/vfio/pci.h       |   1 +
>  2 files changed, 231 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 16d6395..f1e9309 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
>      return 0;
>  }
>  
> +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    uint64_t len;
> +    int sz;
> +
> +    sz = sizeof(len);
> +    if (pread(vbasedev->fd, &len, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> +            != sz) {
> +        error_report("vfio: Failed to get length of device memory");
> +        return -1;
> +    }
> +    vdev->migration->devmem_size = len;
> +    return 0;
> +}
> +
> +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    int sz;
> +
> +    sz = sizeof(size);
> +    if (pwrite(vbasedev->fd, &size, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> +            != sz) {
> +        error_report("vfio: Failed to set length of device comemory");
> +        return -1;
> +    }
> +    vdev->migration->devmem_size = size;
> +    return 0;
> +}
> +
> +static
> +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> +                                    uint64_t pos, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> +
> +    if (len > region_devmem->size) {
> +        return -1;
> +    }
> +
> +    sz = sizeof(pos);
> +    if (pwrite(vbasedev->fd, &pos, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> +            != sz) {
> +        error_report("vfio: Failed to set save buffer pos");
> +        return -1;
> +    }
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> +            != sz) {
> +        error_report("vfio: Failed to set save buffer action");
> +        return -1;
> +    }
> +
> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }
> +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
> +            error_report("vfio: error load device memory buffer");

That's forgotten to g_free(buf)

> +            return -1;
> +        }
> +        qemu_put_be64(f, len);
> +        qemu_put_be64(f, pos);
> +        qemu_put_buffer(f, buf, len);
> +        g_free(buf);
> +    } else {
> +        dest = region_devmem->mmaps[0].mmap;
> +        qemu_put_be64(f, len);
> +        qemu_put_be64(f, pos);
> +        qemu_put_buffer(f, dest, len);
> +    }
> +    return 0;
> +}
> +
> +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +    uint64_t total_len = vdev->migration->devmem_size;
> +    uint64_t pos = 0;
> +
> +    qemu_put_be64(f, total_len);
> +    while (pos < total_len) {
> +        uint64_t len = region_devmem->size;
> +
> +        if (pos + len >= total_len) {
> +            len = total_len - pos;
> +        }
> +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
> +            return -1;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static
> +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> +                                uint64_t pos, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> +
> +    if (len > region_devmem->size) {
> +        return -1;
> +    }
> +
> +    sz = sizeof(pos);
> +    if (pwrite(vbasedev->fd, &pos, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> +            != sz) {
> +        error_report("vfio: Failed to set device memory buffer pos");
> +        return -1;
> +    }
> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }
> +        qemu_get_buffer(f, buf, len);
> +        if (pwrite(vbasedev->fd, buf, len,
> +                    region_devmem->fd_offset) != len) {
> +            error_report("vfio: Failed to load devie memory buffer");

Again, failed to free buf

> +            return -1;
> +        }
> +        g_free(buf);
> +    } else {
> +        dest = region_devmem->mmaps[0].mmap;
> +        qemu_get_buffer(f, dest, len);
> +    }

You might want to use qemu_file_get_error(f)  before writing the data
to the device, to check for the case of a read error on the migration
stream that happened somewhere in the pevious qemu_get's

> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> +            != sz) {
> +        error_report("vfio: Failed to set load device memory buffer action");
> +        return -1;
> +    }
> +
> +    return 0;
> +
> +}
> +
> +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
> +                        QEMUFile *f, uint64_t total_len)
> +{
> +    uint64_t pos = 0, len = 0;
> +
> +    vfio_set_device_memory_size(vdev, total_len);
> +
> +    while (pos + len < total_len) {
> +        len = qemu_get_be64(f);
> +        pos = qemu_get_be64(f);

Please check len/pos - always assume that the migration stream could
be (maliciously or accidentally) corrupt.

> +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
> +    }
> +
> +    return 0;
> +}
> +
> +
>  static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
>          uint64_t start_addr, uint64_t page_nr)
>  {
> @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
>          return;
>      }
>  
> +    /* get dirty data size of device memory */
> +    vfio_get_device_memory_size(vdev);
> +
> +    *res_precopy_only += vdev->migration->devmem_size;
>      return;
>  }
>  
> @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>          return 0;
>      }
>  
> -    return 0;
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> +    /* get dirty data of device memory */
> +    return vfio_save_data_device_memory(vdev, f);
>  }
>  
>  static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>              len = qemu_get_be64(f);
>              vfio_load_data_device_config(vdev, f, len);
>              break;
> +        case VFIO_SAVE_FLAG_DEVMEMORY:
> +            len = qemu_get_be64(f);
> +            vfio_load_data_device_memory(vdev, f, len);
> +            break;
>          default:
>              ret = -EINVAL;
>          }
> @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>      VFIOPCIDevice *vdev = opaque;
>      int rc = 0;
>  
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
> +        /* get dirty data of device memory */
> +        vfio_get_device_memory_size(vdev);
> +        rc = vfio_save_data_device_memory(vdev, f);
> +    }
> +
>      qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
>      vfio_pci_save_config(vdev, f);
>  
> @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
>  {
> +    int rc = 0;
>      VFIOPCIDevice *vdev = opaque;
> -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> +        /* get whole snapshot of device memory */
> +        vfio_get_device_memory_size(vdev);
> +        rc = vfio_save_data_device_memory(vdev, f);
> +    } else {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +    }
>  
>      vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
>              VFIO_DEVICE_STATE_LOGGING);
> -    return 0;
> +    return rc;
>  }
>  
>  static int vfio_load_setup(QEMUFile *f, void *opaque)
> @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
>          goto error;
>      }
>  
> -    if (vfio_device_data_cap_device_memory(vdev)) {
> -        error_report("No suppport of data cap device memory Yet");
> +    if (vfio_device_data_cap_device_memory(vdev) &&
> +            vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
> +              "device-state-data-device-memory")) {
>          goto error;
>      }
>  
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 4b7b1bb..a2cc64b 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
>      uint32_t data_caps;
>      uint32_t device_state;
>      uint64_t devconfig_size;
> +    uint64_t devmem_size;
>      VMChangeStateEntry *vm_state;
>  } VFIOMigration;
>  
> -- 
> 2.7.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-19  8:50 ` [Qemu-devel] " Yan Zhao
@ 2019-02-19 11:32   ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 133+ messages in thread
From: Dr. David Alan Gilbert @ 2019-02-19 11:32 UTC (permalink / raw)
  To: Yan Zhao
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck,
	zhi.a.wang, jonathan.davies

* Yan Zhao (yan.y.zhao@intel.com) wrote:
> This patchset enables VFIO devices to have live migration capability.
> Currently it does not support post-copy phase.
> 
> It follows Alex's comments on last version of VFIO live migration patches,
> including device states, VFIO device state region layout, dirty bitmap's
> query.

Hi,
  I've sent minor comments to later patches; but some minor general
comments:

  a) Never trust the incoming migrations stream - it might be corrupt,
    so check when you can.
  b) How do we detect if we're migrating from/to the wrong device or
version of device?  Or say to a device with older firmware or perhaps
a device that has less device memory ?
  c) Consider using the trace_ mechanism - it's really useful to
add to loops writing/reading data so that you can see when it fails.

Dave

(P.S. You have a few typo's grep your code for 'devcie', 'devie' and
'migrtion'

> Device Data
> -----------
> Device data is divided into three types: device memory, device config,
> and system memory dirty pages produced by device.
> 
> Device config: data like MMIOs, page tables...
>         Every device is supposed to possess device config data.
>     	Usually device config's size is small (no big than 10M), and it
>         needs to be loaded in certain strict order.
>         Therefore, device config only needs to be saved/loaded in
>         stop-and-copy phase.
>         The data of device config is held in device config region.
>         Size of device config data is smaller than or equal to that of
>         device config region.
> 
> Device Memory: device's internal memory, standalone and outside system
>         memory. It is usually very big.
>         This kind of data needs to be saved / loaded in pre-copy and
>         stop-and-copy phase.
>         The data of device memory is held in device memory region.
>         Size of devie memory is usually larger than that of device
>         memory region. qemu needs to save/load it in chunks of size of
>         device memory region.
>         Not all device has device memory. Like IGD only uses system memory.
> 
> System memory dirty pages: If a device produces dirty pages in system
>         memory, it is able to get dirty bitmap for certain range of system
>         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
>         phase in .log_sync callback. By setting dirty bitmap in .log_sync
>         callback, dirty pages in system memory will be save/loaded by ram's
>         live migration code.
>         The dirty bitmap of system memory is held in dirty bitmap region.
>         If system memory range is larger than that dirty bitmap region can
>         hold, qemu will cut it into several chunks and get dirty bitmap in
>         succession.
> 
> 
> Device State Regions
> --------------------
> Vendor driver is required to expose two mandatory regions and another two
> optional regions if it plans to support device state management.
> 
> So, there are up to four regions in total.
> One control region: mandatory.
>         Get access via read/write system call.
>         Its layout is defined in struct vfio_device_state_ctl
> Three data regions: mmaped into qemu.
>         device config region: mandatory, holding data of device config
>         device memory region: optional, holding data of device memory
>         dirty bitmap region: optional, holding bitmap of system memory
>                             dirty pages
> 
> (The reason why four seperate regions are defined is that the unit of mmap
> system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> control and three mmaped regions for data seems better than one big region
> padded and sparse mmaped).
> 
> 
> kernel device state interface [1]
> --------------------------------------
> #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> 
> #define VFIO_DEVICE_STATE_RUNNING 0 
> #define VFIO_DEVICE_STATE_STOP 1
> #define VFIO_DEVICE_STATE_LOGGING 2
> 
> #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> 
> struct vfio_device_state_ctl {
> 	__u32 version;		  /* ro */
> 	__u32 device_state;       /* VFIO device state, wo */
> 	__u32 caps;		 /* ro */
>         struct {
> 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> 		__u64 size;    /*rw*/
> 	} device_config;
> 	struct {
> 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> 		__u64 size;     /* rw */  
>                 __u64 pos; /*the offset in total buffer of device memory*/
> 	} device_memory;
> 	struct {
> 		__u64 start_addr; /* wo */
> 		__u64 page_nr;   /* wo */
> 	} system_memory;
> };
> 
> Devcie States
> ------------- 
> After migration is initialzed, it will set device state via writing to
> device_state field of control region.
> 
> Four states are defined for a VFIO device:
>         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> 
> RUNNING: In this state, a VFIO device is in active state ready to receive
>         commands from device driver.
>         It is the default state that a VFIO device enters initially.
> 
> STOP:  In this state, a VFIO device is deactivated to interact with
>        device driver.
> 
> LOGGING: a special state that it CANNOT exist independently. It must be
>        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
>        STOP & LOGGING).
>        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
>        driver can start dirty data logging for device memory and system
>        memory.
>        LOGGING only impacts device/system memory. They return whole
>        snapshot outside LOGGING and dirty data since last get operation
>        inside LOGGING.
>        Device config should be always accessible and return whole config
>        snapshot regardless of LOGGING state.
>        
> Note:
> The reason why RUNNING is the default state is that device's active state
> must not depend on device state interface.
> It is possible that region vfio_device_state_ctl fails to get registered.
> In that condition, a device needs be in active state by default. 
> 
> Get Version & Get Caps
> ----------------------
> On migration init phase, qemu will probe the existence of device state
> regions of vendor driver, then get version of the device state interface
> from the r/w control region.
> 
> Then it will probe VFIO device's data capability by reading caps field of
> control region.
>         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
>         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
>         device memory in pre-copy and stop-and-copy phase. The data of
>         device memory is held in device memory region.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
>         produced by VFIO device during pre-copy and stop-and-copy phase.
>         The dirty bitmap of system memory is held in dirty bitmap region.
> 
> If failing to find two mandatory regions and optional data regions
> corresponding to data caps or version mismatching, it will setup a
> migration blocker and disable live migration for VFIO device.
> 
> 
> Flows to call device state interface for VFIO live migration
> ------------------------------------------------------------
> 
> Live migration save path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_SAVE_SETUP
>  |
>  .save_setup callback -->
>  get device memory size (whole snapshot size)
>  get device memory buffer (whole snapshot data)
>  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
>  |
> MIGRATION_STATUS_ACTIVE
>  |
>  .save_live_pending callback --> get device memory size (dirty data)
>  .save_live_iteration callback --> get device memory buffer (dirty data)
>  .log_sync callback --> get system memory dirty bitmap
>  |
> (vcpu stops) --> set device state -->
>  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
>  |
> .save_live_complete_precopy callback -->
>  get device memory size (dirty data)
>  get device memory buffer (dirty data)
>  get device config size (whole snapshot size)
>  get device config buffer (whole snapshot data)
>  |
> .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> MIGRATION_STATUS_COMPLETED
> 
> MIGRATION_STATUS_CANCELLED or
> MIGRATION_STATUS_FAILED
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> 
> 
> Live migration load path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
>  |
> MIGRATION_STATUS_ACTIVE
>  |
> .load state callback -->
>  set device memory size, set device memory buffer, set device config size,
>  set device config buffer
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_COMPLETED
> 
> 
> 
> In source VM side,
> In precopy phase,
> if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> qemu will first get whole snapshot of device memory in .save_setup
> callback, and then it will get total size of dirty data in device memory in
> .save_live_pending callback by reading device_memory.size field of control
> region.
> Then in .save_live_iteration callback, it will get buffer of device memory's
> dirty data chunk by chunk from device memory region by writing pos &
> action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> control region. (size of each chunk is the size of device memory data
> region).
> .save_live_pending and .save_live_iteration may be called several times in
> precopy phase to get dirty data in device memory.
> 
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> like .save_setup, .save_live_pending, .save_live_iteration will not call
> vendor driver's device state interface to get data from devcie memory.
> 
> In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> .log_sync callback will get system memory dirty bitmap from dirty bitmap
> region by writing system memory's start address, page count and action 
> (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> "system_memory.action" fields of control region.
> If page count passed in .log_sync callback is larger than the bitmap size
> the dirty bitmap region supports, Qemu will cut it into chunks and call
> vendor driver's get system memory dirty bitmap interface.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> returns without call to vendor driver.
> 
> In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> in save_live_complete_precopy callback,
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> get device memory size and get device memory buffer will be called again.
> After that,
> device config data is get from device config region by reading
> devcie_config.size of control region and writing action (GET_BITMAP) to
> device_config.action of control region.
> Then after migration completes, in cleanup handler, LOGGING state will be
> cleared (i.e. deivce state is set to STOP).
> Clearing LOGGING state in cleanup handler is in consideration of the case
> of "migration failed" and "migration cancelled". They can also leverage
> the cleanup handler to unset LOGGING state.
> 
> 
> References
> ----------
> 1. kernel side implementation of Device state interfaces:
> https://patchwork.freedesktop.org/series/56876/
> 
> 
> Yan Zhao (5):
>   vfio/migration: define kernel interfaces
>   vfio/migration: support device of device config capability
>   vfio/migration: tracking of dirty page in system memory
>   vfio/migration: turn on migration
>   vfio/migration: support device memory capability
> 
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/common.c              |  26 ++
>  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |  10 +-
>  hw/vfio/pci.h                 |  26 +-
>  include/hw/vfio/vfio-common.h |   1 +
>  linux-headers/linux/vfio.h    | 260 +++++++++++++
>  7 files changed, 1174 insertions(+), 9 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 
> -- 
> 2.7.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-19 11:32   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 133+ messages in thread
From: Dr. David Alan Gilbert @ 2019-02-19 11:32 UTC (permalink / raw)
  To: Yan Zhao
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, cohuck, shuangtai.tst, zhi.a.wang,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	changpeng.liu, Ken.Xue, kwankhede, kevin.tian, cjia,
	arei.gonglei, kvm

* Yan Zhao (yan.y.zhao@intel.com) wrote:
> This patchset enables VFIO devices to have live migration capability.
> Currently it does not support post-copy phase.
> 
> It follows Alex's comments on last version of VFIO live migration patches,
> including device states, VFIO device state region layout, dirty bitmap's
> query.

Hi,
  I've sent minor comments to later patches; but some minor general
comments:

  a) Never trust the incoming migrations stream - it might be corrupt,
    so check when you can.
  b) How do we detect if we're migrating from/to the wrong device or
version of device?  Or say to a device with older firmware or perhaps
a device that has less device memory ?
  c) Consider using the trace_ mechanism - it's really useful to
add to loops writing/reading data so that you can see when it fails.

Dave

(P.S. You have a few typo's grep your code for 'devcie', 'devie' and
'migrtion'

> Device Data
> -----------
> Device data is divided into three types: device memory, device config,
> and system memory dirty pages produced by device.
> 
> Device config: data like MMIOs, page tables...
>         Every device is supposed to possess device config data.
>     	Usually device config's size is small (no big than 10M), and it
>         needs to be loaded in certain strict order.
>         Therefore, device config only needs to be saved/loaded in
>         stop-and-copy phase.
>         The data of device config is held in device config region.
>         Size of device config data is smaller than or equal to that of
>         device config region.
> 
> Device Memory: device's internal memory, standalone and outside system
>         memory. It is usually very big.
>         This kind of data needs to be saved / loaded in pre-copy and
>         stop-and-copy phase.
>         The data of device memory is held in device memory region.
>         Size of devie memory is usually larger than that of device
>         memory region. qemu needs to save/load it in chunks of size of
>         device memory region.
>         Not all device has device memory. Like IGD only uses system memory.
> 
> System memory dirty pages: If a device produces dirty pages in system
>         memory, it is able to get dirty bitmap for certain range of system
>         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
>         phase in .log_sync callback. By setting dirty bitmap in .log_sync
>         callback, dirty pages in system memory will be save/loaded by ram's
>         live migration code.
>         The dirty bitmap of system memory is held in dirty bitmap region.
>         If system memory range is larger than that dirty bitmap region can
>         hold, qemu will cut it into several chunks and get dirty bitmap in
>         succession.
> 
> 
> Device State Regions
> --------------------
> Vendor driver is required to expose two mandatory regions and another two
> optional regions if it plans to support device state management.
> 
> So, there are up to four regions in total.
> One control region: mandatory.
>         Get access via read/write system call.
>         Its layout is defined in struct vfio_device_state_ctl
> Three data regions: mmaped into qemu.
>         device config region: mandatory, holding data of device config
>         device memory region: optional, holding data of device memory
>         dirty bitmap region: optional, holding bitmap of system memory
>                             dirty pages
> 
> (The reason why four seperate regions are defined is that the unit of mmap
> system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> control and three mmaped regions for data seems better than one big region
> padded and sparse mmaped).
> 
> 
> kernel device state interface [1]
> --------------------------------------
> #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> 
> #define VFIO_DEVICE_STATE_RUNNING 0 
> #define VFIO_DEVICE_STATE_STOP 1
> #define VFIO_DEVICE_STATE_LOGGING 2
> 
> #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> 
> struct vfio_device_state_ctl {
> 	__u32 version;		  /* ro */
> 	__u32 device_state;       /* VFIO device state, wo */
> 	__u32 caps;		 /* ro */
>         struct {
> 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> 		__u64 size;    /*rw*/
> 	} device_config;
> 	struct {
> 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> 		__u64 size;     /* rw */  
>                 __u64 pos; /*the offset in total buffer of device memory*/
> 	} device_memory;
> 	struct {
> 		__u64 start_addr; /* wo */
> 		__u64 page_nr;   /* wo */
> 	} system_memory;
> };
> 
> Devcie States
> ------------- 
> After migration is initialzed, it will set device state via writing to
> device_state field of control region.
> 
> Four states are defined for a VFIO device:
>         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> 
> RUNNING: In this state, a VFIO device is in active state ready to receive
>         commands from device driver.
>         It is the default state that a VFIO device enters initially.
> 
> STOP:  In this state, a VFIO device is deactivated to interact with
>        device driver.
> 
> LOGGING: a special state that it CANNOT exist independently. It must be
>        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
>        STOP & LOGGING).
>        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
>        driver can start dirty data logging for device memory and system
>        memory.
>        LOGGING only impacts device/system memory. They return whole
>        snapshot outside LOGGING and dirty data since last get operation
>        inside LOGGING.
>        Device config should be always accessible and return whole config
>        snapshot regardless of LOGGING state.
>        
> Note:
> The reason why RUNNING is the default state is that device's active state
> must not depend on device state interface.
> It is possible that region vfio_device_state_ctl fails to get registered.
> In that condition, a device needs be in active state by default. 
> 
> Get Version & Get Caps
> ----------------------
> On migration init phase, qemu will probe the existence of device state
> regions of vendor driver, then get version of the device state interface
> from the r/w control region.
> 
> Then it will probe VFIO device's data capability by reading caps field of
> control region.
>         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
>         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
>         device memory in pre-copy and stop-and-copy phase. The data of
>         device memory is held in device memory region.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
>         produced by VFIO device during pre-copy and stop-and-copy phase.
>         The dirty bitmap of system memory is held in dirty bitmap region.
> 
> If failing to find two mandatory regions and optional data regions
> corresponding to data caps or version mismatching, it will setup a
> migration blocker and disable live migration for VFIO device.
> 
> 
> Flows to call device state interface for VFIO live migration
> ------------------------------------------------------------
> 
> Live migration save path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_SAVE_SETUP
>  |
>  .save_setup callback -->
>  get device memory size (whole snapshot size)
>  get device memory buffer (whole snapshot data)
>  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
>  |
> MIGRATION_STATUS_ACTIVE
>  |
>  .save_live_pending callback --> get device memory size (dirty data)
>  .save_live_iteration callback --> get device memory buffer (dirty data)
>  .log_sync callback --> get system memory dirty bitmap
>  |
> (vcpu stops) --> set device state -->
>  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
>  |
> .save_live_complete_precopy callback -->
>  get device memory size (dirty data)
>  get device memory buffer (dirty data)
>  get device config size (whole snapshot size)
>  get device config buffer (whole snapshot data)
>  |
> .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> MIGRATION_STATUS_COMPLETED
> 
> MIGRATION_STATUS_CANCELLED or
> MIGRATION_STATUS_FAILED
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> 
> 
> Live migration load path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
>  |
> MIGRATION_STATUS_ACTIVE
>  |
> .load state callback -->
>  set device memory size, set device memory buffer, set device config size,
>  set device config buffer
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_COMPLETED
> 
> 
> 
> In source VM side,
> In precopy phase,
> if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> qemu will first get whole snapshot of device memory in .save_setup
> callback, and then it will get total size of dirty data in device memory in
> .save_live_pending callback by reading device_memory.size field of control
> region.
> Then in .save_live_iteration callback, it will get buffer of device memory's
> dirty data chunk by chunk from device memory region by writing pos &
> action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> control region. (size of each chunk is the size of device memory data
> region).
> .save_live_pending and .save_live_iteration may be called several times in
> precopy phase to get dirty data in device memory.
> 
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> like .save_setup, .save_live_pending, .save_live_iteration will not call
> vendor driver's device state interface to get data from devcie memory.
> 
> In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> .log_sync callback will get system memory dirty bitmap from dirty bitmap
> region by writing system memory's start address, page count and action 
> (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> "system_memory.action" fields of control region.
> If page count passed in .log_sync callback is larger than the bitmap size
> the dirty bitmap region supports, Qemu will cut it into chunks and call
> vendor driver's get system memory dirty bitmap interface.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> returns without call to vendor driver.
> 
> In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> in save_live_complete_precopy callback,
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> get device memory size and get device memory buffer will be called again.
> After that,
> device config data is get from device config region by reading
> devcie_config.size of control region and writing action (GET_BITMAP) to
> device_config.action of control region.
> Then after migration completes, in cleanup handler, LOGGING state will be
> cleared (i.e. deivce state is set to STOP).
> Clearing LOGGING state in cleanup handler is in consideration of the case
> of "migration failed" and "migration cancelled". They can also leverage
> the cleanup handler to unset LOGGING state.
> 
> 
> References
> ----------
> 1. kernel side implementation of Device state interfaces:
> https://patchwork.freedesktop.org/series/56876/
> 
> 
> Yan Zhao (5):
>   vfio/migration: define kernel interfaces
>   vfio/migration: support device of device config capability
>   vfio/migration: tracking of dirty page in system memory
>   vfio/migration: turn on migration
>   vfio/migration: support device memory capability
> 
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/common.c              |  26 ++
>  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |  10 +-
>  hw/vfio/pci.h                 |  26 +-
>  include/hw/vfio/vfio-common.h |   1 +
>  linux-headers/linux/vfio.h    | 260 +++++++++++++
>  7 files changed, 1174 insertions(+), 9 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 
> -- 
> 2.7.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 1/5] vfio/migration: define kernel interfaces
  2019-02-19  8:52   ` [Qemu-devel] " Yan Zhao
@ 2019-02-19 13:09     ` Cornelia Huck
  -1 siblings, 0 replies; 133+ messages in thread
From: Cornelia Huck @ 2019-02-19 13:09 UTC (permalink / raw)
  To: Yan Zhao
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian, dgilbert,
	alex.williamson, intel-gvt-dev, changpeng.liu, zhi.a.wang,
	jonathan.davies

On Tue, 19 Feb 2019 16:52:14 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> - defined 4 device states regions: one control region and 3 data regions
> - defined layout of control region in struct vfio_device_state_ctl
> - defined 4 device states: running, stop, running&logging, stop&logging
> - define 3 device data categories: device config, device memory, system
>   memory
> - defined 2 device data capabilities: device memory and system memory
> - defined device state interfaces' version and 12 device state interfaces
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> ---
>  linux-headers/linux/vfio.h | 260 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 260 insertions(+)

[commenting here for convenience; changes obviously need to be done in
the Linux patch]

> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index ceb6453..a124fc1 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -303,6 +303,56 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>  
> +/* Device State region type and sub-type
> + *
> + * A VFIO device driver needs to register up to four device state regions in
> + * total: two mandatory and another two optional, if it plans to support device
> + * state management.

Suggest to rephrase:

"A VFIO device driver that plans to support device state management
needs to register..."

> + *
> + * 1. region CTL :
> + *          Mandatory.
> + *          This is a control region.
> + *          Its layout is defined in struct vfio_device_state_ctl.
> + *          Reading from this region can get version, capabilities and data
> + *          size of device state interfaces.
> + *          Writing to this region can set device state, data size and
> + *          choose which interface to use.
> + * 2. region DEVICE_CONFIG
> + *          Mandatory.
> + *          This is a data region that holds device config data.
> + *          Device config is such kind of data like MMIOs, page tables...

"Device config is data such as..."

> + *          Every device is supposed to possess device config data.
> + *          Usually the size of device config data is small (no big

s/no big/no bigger/

> + *          than 10M), and it needs to be loaded in certain strict
> + *          order.
> + *          Therefore no dirty data logging is enabled for device
> + *          config and it must be got/set as a whole.
> + *          Size of device config data is smaller than or equal to that of
> + *          device config region.

Not sure if I understand that sentence correctly... but what if a
device has more config state than fits into this region? Is that
supposed to be covered by the device memory region? Or is this assumed
to be something so exotic that we don't need to plan for it?

> + *          It is able to be mmaped into user space.
> + * 3. region DEVICE_MEMORY
> + *          Optional.
> + *          This is a data region that holds device memory data.
> + *          Device memory is device's internal memory, standalone and outside

s/outside/distinct from/ ?

> + *          system memory.  It is usually very big.
> + *          Not all device has device memory. Like IGD only uses system

s/all devices has/all devices have/

s/Like/E.g./

> + *          memory and has no device memory.
> + *          Size of devie memory is usually larger than that of device

s/devie/device/

> + *          memory region. qemu needs to save/load it in chunks of size of
> + *          device memory region.

I'd rather not explicitly mention QEMU in this header. Maybe
"Userspace"?

> + *          It is able to be mmaped into user space.
> + * 4. region DIRTY_BITMAP
> + *          Optional.
> + *          This is a data region that holds bitmap of dirty pages in system
> + *          memory that a VFIO devices produces.
> + *          It is able to be mmaped into user space.
> + */
> +#define VFIO_REGION_TYPE_DEVICE_STATE           (1 << 1)

Can you make this an explicit number instead?

(FWIW, I plan to add a CCW region as type 2, whatever comes first.)

> +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL       (1)
> +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG      (2)
> +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY      (3)
> +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP (4)
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
> @@ -816,6 +866,216 @@ struct vfio_iommu_spapr_tce_remove {
>  };
>  #define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
>  
> +/* version number of the device state interface */
> +#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1

Hm. Is this supposed to be backwards-compatible, should we need to bump
this?

> +
> +/*
> + * For devices that have devcie memory, it is required to expose

s/devcie/device/

> + * DEVICE_MEMORY capability.
> + *
> + * For devices producing dirty pages in system memory, it is required to
> + * expose cap SYSTEM_MEMORY in order to get dirty bitmap in certain range
> + * of system memory.
> + */
> +#define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> +#define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> +
> +/*
> + * DEVICE STATES
> + *
> + * Four states are defined for a VFIO device:
> + * RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP.
> + * They can be set by writing to device_state field of
> + * vfio_device_state_ctl region.

Who controls this? Userspace?

> + *
> + * RUNNING: In this state, a VFIO device is in active state ready to
> + * receive commands from device driver.
> + * It is the default state that a VFIO device enters initially.
> + *
> + * STOP: In this state, a VFIO device is deactivated to interact with
> + * device driver.

I think 'STOPPED' would read nicer.

> + *
> + * LOGGING state is a special state that it CANNOT exist
> + * independently.

So it's not a state, but rather a modifier?

> + * It must be set alongside with state RUNNING or STOP, i.e,
> + * RUNNING & LOGGING, STOP & LOGGING.
> + * It is used for dirty data logging both for device memory
> + * and system memory.
> + *
> + * LOGGING only impacts device/system memory. In LOGGING state, get buffer
> + * of device memory returns dirty pages since last call; outside LOGGING
> + * state, get buffer of device memory returns whole snapshot of device
> + * memory. system memory's dirty page is only available in LOGGING state.
> + *
> + * Device config should be always accessible and return whole config snapshot
> + * regardless of LOGGING state.
> + * */
> +#define VFIO_DEVICE_STATE_RUNNING 0
> +#define VFIO_DEVICE_STATE_STOP 1
> +#define VFIO_DEVICE_STATE_LOGGING 2
> +
> +/* action to get data from device memory or device config
> + * the action is write to device state's control region, and data is read
> + * from device memory region or device config region.
> + * Each time before read device memory region or device config region,
> + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> + * field in control region. That is because device memory and devie config
> + * region is mmaped into user space. vendor driver has to be notified of
> + * the the GET_BUFFER action in advance.
> + */
> +#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> +
> +/* action to set data to device memory or device config
> + * the action is write to device state's control region, and data is
> + * written to device memory region or device config region.
> + * Each time after write to device memory region or device config region,
> + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> + * field in control region. That is because device memory and devie config
> + * region is mmaped into user space. vendor driver has to be notified of
> + * the the SET_BUFFER action after data written.
> + */
> +#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2

Let me describe this in my own words to make sure that I understand
this correctly.

- The actions are set by userspace to notify the kernel that it is
  going to get data or that it has just written data.
- This is needed as a notification that the mmapped data should not be
  changed resp. just has changed.

So, how does the kernel know whether the read action has finished resp.
whether the write action has started? Even if userspace reads/writes it
as a whole.

> +
> +/* layout of device state interfaces' control region
> + * By reading to control region and reading/writing data from device config
> + * region, device memory region, system memory regions, below interface can
> + * be implemented:
> + *
> + * 1. get version
> + *   (1) user space calls read system call on "version" field of control
> + *   region.
> + *   (2) vendor driver writes version number of device state interfaces
> + *   to the "version" field of control region.
> + *
> + * 2. get caps
> + *   (1) user space calls read system call on "caps" field of control region.
> + *   (2) if a VFIO device has huge device memory, vendor driver reports
> + *      VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY in "caps" field of control region.
> + *      if a VFIO device produces dirty pages in system memory, vendor driver
> + *      reports VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY in "caps" field of
> + *      control region.
> + *
> + * 3. set device state
> + *    (1) user space calls write system call on "device_state" field of
> + *    control region.
> + *    (2) device state transitions as:
> + *
> + *    RUNNING -- start dirty data logging --> RUNNING & LOGGING
> + *    RUNNING -- deactivate --> STOP
> + *    RUNNING -- deactivate & start dirty data longging --> STOP & LOGGING
> + *    RUNNING & LOGGING -- stop dirty data logging --> RUNNING
> + *    RUNNING & LOGGING -- deactivate --> STOP & LOGGING
> + *    RUNNING & LOGGING -- deactivate & stop dirty data logging --> STOP
> + *    STOP -- activate --> RUNNING
> + *    STOP -- start dirty data logging --> STOP & LOGGING
> + *    STOP -- activate & start dirty data logging --> RUNNING & LOGGING
> + *    STOP & LOGGING -- stop dirty data logging --> STOP
> + *    STOP & LOGGING -- activate --> RUNNING & LOGGING
> + *    STOP & LOGGING -- activate & stop dirty data logging --> RUNNING
> + *
> + * 4. get device config size
> + *   (1) user space calls read system call on "device_config.size" field of
> + *       control region for the total size of device config snapshot.
> + *   (2) vendor driver writes device config data's total size in
> + *       "device_config.size" field of control region.
> + *
> + * 5. set device config size
> + *   (1) user space calls write system call.
> + *       total size of device config snapshot --> "device_config.size" field
> + *       of control region.
> + *   (2) vendor driver reads device config data's total size from
> + *       "device_config.size" field of control region.
> + *
> + * 6 get device config buffer
> + *   (1) user space calls write system call.
> + *       "GET_BUFFER" --> "device_config.action" field of control region.
> + *   (2) vendor driver
> + *       a. gets whole snapshot for device config
> + *       b. writes whole device config snapshot to region
> + *       DEVICE_CONFIG.
> + *   (3) user space reads the whole of device config snapshot from region
> + *       DEVICE_CONFIG.
> + *
> + * 7. set device config buffer
> + *   (1) user space writes whole of device config data to region
> + *       DEVICE_CONFIG.
> + *   (2) user space calls write system call.
> + *       "SET_BUFFER" --> "device_config.action" field of control region.
> + *   (3) vendor driver loads whole of device config from region DEVICE_CONFIG.
> + *
> + * 8. get device memory size
> + *   (1) user space calls read system call on "device_memory.size" field of
> + *       control region for device memory size.
> + *   (2) vendor driver
> + *       a. gets device memory snapshot (in state RUNNING or STOP), or
> + *          gets device memory dirty data (in state RUNNING & LOGGING or
> + *          state STOP & LOGGING)
> + *       b. writes size in "device_memory.size" field of control region
> + *
> + * 9. set device memory size
> + *   (1) user space calls write system call on "device_memory.size" field of
> + *       control region to set total size of device memory snapshot.
> + *   (2) vendor driver reads device memory's size from "device_memory.size"
> + *       field of control region.
> + *
> + *
> + * 10. get device memory buffer
> + *   (1) user space calls write system.
> + *       pos --> "device_memory.pos" field of control region,
> + *       "GET_BUFFER" --> "device_memory.action" field of control region.
> + *       (pos must be 0 or multiples of length of region DEVICE_MEMORY).
> + *   (2) vendor driver writes N'th chunk of device memory snapshot/dirty data
> + *       to region DEVICE_MEMORY.
> + *       (N equals to pos/(region length of DEVICE_MEMORY))
> + *   (3) user space reads the N'th chunk of device memory snapshot/dirty data
> + *       from region DEVICE_MEMORY.
> + *
> + * 11. set device memory buffer
> + *   (1) user space writes N'th chunk of device memory snapshot/dirty data to
> + *       region DEVICE_MEMORY.
> + *       (N equals to pos/(region length of DEVICE_MEMORY))
> + *   (2) user space writes pos to "device_memory.pos" field and writes
> + *       "SET_BUFFER" to "device_memory.action" field of control region.
> + *   (3) vendor driver loads N'th chunk of device memory snapshot/dirty data
> + *       from region DEVICE_MEMORY.
> + *
> + * 12. get system memory dirty bitmap
> + *   (1) user space calls write system call to specify a range of system
> + *       memory that querying dirty pages.
> + *       system memory's start address --> "system_memory.start_addr" field
> + *       of control region,
> + *       system memory's page count --> "system_memory.page_nr" field of
> + *       control region.
> + *   (2) if device state is not in RUNNING or STOP & LOGGING,
> + *       vendor driver returns empty bitmap; otherwise,
> + *       vendor driver checks the page_nr,
> + *       if it's larger than the size that region DIRTY_BITMAP can support,
> + *       error returns; if not,
> + *       vendor driver returns as bitmap to specify dirty pages that
> + *       device produces since last query in this range of system memory .
> + *   (3) usespace reads back the dirty bitmap from region DIRTY_BITMAP.
> + *
> + */

It might make sense to extract the explanations above into a separate
design document in the kernel Documentation/ directory. You could also
add ASCII art there :)

> +
> +struct vfio_device_state_ctl {
> +	__u32 version;		  /* ro versio of devcie state interfaces*/

s/versio/version/
s/devcie/device/

> +	__u32 device_state;       /* VFIO device state, wo */
> +	__u32 caps;		 /* ro */
> +        struct {

Indentation looks a bit off.

> +		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> +		__u64 size;    /*rw, total size of device config*/
> +	} device_config;
> +	struct {
> +		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> +		__u64 size;     /* rw, total size of device memory*/
> +        __u64 pos;/*chunk offset in total buffer of device memory*/

Here as well.

> +	} device_memory;
> +	struct {
> +		__u64 start_addr; /* wo */
> +		__u64 page_nr;   /* wo */
> +	} system_memory;
> +}__attribute__((packed));

For an interface definition, it's probably better to avoid packed and
instead add padding if needed.

> +
>  /* ***************************************************************** */
>  
>  #endif /* VFIO_H */

On the whole, I think this is moving into the right direction.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] vfio/migration: define kernel interfaces
@ 2019-02-19 13:09     ` Cornelia Huck
  0 siblings, 0 replies; 133+ messages in thread
From: Cornelia Huck @ 2019-02-19 13:09 UTC (permalink / raw)
  To: Yan Zhao
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue, kwankhede, kevin.tian,
	cjia, arei.gonglei, kvm

On Tue, 19 Feb 2019 16:52:14 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> - defined 4 device states regions: one control region and 3 data regions
> - defined layout of control region in struct vfio_device_state_ctl
> - defined 4 device states: running, stop, running&logging, stop&logging
> - define 3 device data categories: device config, device memory, system
>   memory
> - defined 2 device data capabilities: device memory and system memory
> - defined device state interfaces' version and 12 device state interfaces
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> ---
>  linux-headers/linux/vfio.h | 260 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 260 insertions(+)

[commenting here for convenience; changes obviously need to be done in
the Linux patch]

> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index ceb6453..a124fc1 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -303,6 +303,56 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>  
> +/* Device State region type and sub-type
> + *
> + * A VFIO device driver needs to register up to four device state regions in
> + * total: two mandatory and another two optional, if it plans to support device
> + * state management.

Suggest to rephrase:

"A VFIO device driver that plans to support device state management
needs to register..."

> + *
> + * 1. region CTL :
> + *          Mandatory.
> + *          This is a control region.
> + *          Its layout is defined in struct vfio_device_state_ctl.
> + *          Reading from this region can get version, capabilities and data
> + *          size of device state interfaces.
> + *          Writing to this region can set device state, data size and
> + *          choose which interface to use.
> + * 2. region DEVICE_CONFIG
> + *          Mandatory.
> + *          This is a data region that holds device config data.
> + *          Device config is such kind of data like MMIOs, page tables...

"Device config is data such as..."

> + *          Every device is supposed to possess device config data.
> + *          Usually the size of device config data is small (no big

s/no big/no bigger/

> + *          than 10M), and it needs to be loaded in certain strict
> + *          order.
> + *          Therefore no dirty data logging is enabled for device
> + *          config and it must be got/set as a whole.
> + *          Size of device config data is smaller than or equal to that of
> + *          device config region.

Not sure if I understand that sentence correctly... but what if a
device has more config state than fits into this region? Is that
supposed to be covered by the device memory region? Or is this assumed
to be something so exotic that we don't need to plan for it?

> + *          It is able to be mmaped into user space.
> + * 3. region DEVICE_MEMORY
> + *          Optional.
> + *          This is a data region that holds device memory data.
> + *          Device memory is device's internal memory, standalone and outside

s/outside/distinct from/ ?

> + *          system memory.  It is usually very big.
> + *          Not all device has device memory. Like IGD only uses system

s/all devices has/all devices have/

s/Like/E.g./

> + *          memory and has no device memory.
> + *          Size of devie memory is usually larger than that of device

s/devie/device/

> + *          memory region. qemu needs to save/load it in chunks of size of
> + *          device memory region.

I'd rather not explicitly mention QEMU in this header. Maybe
"Userspace"?

> + *          It is able to be mmaped into user space.
> + * 4. region DIRTY_BITMAP
> + *          Optional.
> + *          This is a data region that holds bitmap of dirty pages in system
> + *          memory that a VFIO devices produces.
> + *          It is able to be mmaped into user space.
> + */
> +#define VFIO_REGION_TYPE_DEVICE_STATE           (1 << 1)

Can you make this an explicit number instead?

(FWIW, I plan to add a CCW region as type 2, whatever comes first.)

> +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL       (1)
> +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG      (2)
> +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY      (3)
> +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP (4)
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
> @@ -816,6 +866,216 @@ struct vfio_iommu_spapr_tce_remove {
>  };
>  #define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
>  
> +/* version number of the device state interface */
> +#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1

Hm. Is this supposed to be backwards-compatible, should we need to bump
this?

> +
> +/*
> + * For devices that have devcie memory, it is required to expose

s/devcie/device/

> + * DEVICE_MEMORY capability.
> + *
> + * For devices producing dirty pages in system memory, it is required to
> + * expose cap SYSTEM_MEMORY in order to get dirty bitmap in certain range
> + * of system memory.
> + */
> +#define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> +#define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> +
> +/*
> + * DEVICE STATES
> + *
> + * Four states are defined for a VFIO device:
> + * RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP.
> + * They can be set by writing to device_state field of
> + * vfio_device_state_ctl region.

Who controls this? Userspace?

> + *
> + * RUNNING: In this state, a VFIO device is in active state ready to
> + * receive commands from device driver.
> + * It is the default state that a VFIO device enters initially.
> + *
> + * STOP: In this state, a VFIO device is deactivated to interact with
> + * device driver.

I think 'STOPPED' would read nicer.

> + *
> + * LOGGING state is a special state that it CANNOT exist
> + * independently.

So it's not a state, but rather a modifier?

> + * It must be set alongside with state RUNNING or STOP, i.e,
> + * RUNNING & LOGGING, STOP & LOGGING.
> + * It is used for dirty data logging both for device memory
> + * and system memory.
> + *
> + * LOGGING only impacts device/system memory. In LOGGING state, get buffer
> + * of device memory returns dirty pages since last call; outside LOGGING
> + * state, get buffer of device memory returns whole snapshot of device
> + * memory. system memory's dirty page is only available in LOGGING state.
> + *
> + * Device config should be always accessible and return whole config snapshot
> + * regardless of LOGGING state.
> + * */
> +#define VFIO_DEVICE_STATE_RUNNING 0
> +#define VFIO_DEVICE_STATE_STOP 1
> +#define VFIO_DEVICE_STATE_LOGGING 2
> +
> +/* action to get data from device memory or device config
> + * the action is write to device state's control region, and data is read
> + * from device memory region or device config region.
> + * Each time before read device memory region or device config region,
> + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> + * field in control region. That is because device memory and devie config
> + * region is mmaped into user space. vendor driver has to be notified of
> + * the the GET_BUFFER action in advance.
> + */
> +#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> +
> +/* action to set data to device memory or device config
> + * the action is write to device state's control region, and data is
> + * written to device memory region or device config region.
> + * Each time after write to device memory region or device config region,
> + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> + * field in control region. That is because device memory and devie config
> + * region is mmaped into user space. vendor driver has to be notified of
> + * the the SET_BUFFER action after data written.
> + */
> +#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2

Let me describe this in my own words to make sure that I understand
this correctly.

- The actions are set by userspace to notify the kernel that it is
  going to get data or that it has just written data.
- This is needed as a notification that the mmapped data should not be
  changed resp. just has changed.

So, how does the kernel know whether the read action has finished resp.
whether the write action has started? Even if userspace reads/writes it
as a whole.

> +
> +/* layout of device state interfaces' control region
> + * By reading to control region and reading/writing data from device config
> + * region, device memory region, system memory regions, below interface can
> + * be implemented:
> + *
> + * 1. get version
> + *   (1) user space calls read system call on "version" field of control
> + *   region.
> + *   (2) vendor driver writes version number of device state interfaces
> + *   to the "version" field of control region.
> + *
> + * 2. get caps
> + *   (1) user space calls read system call on "caps" field of control region.
> + *   (2) if a VFIO device has huge device memory, vendor driver reports
> + *      VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY in "caps" field of control region.
> + *      if a VFIO device produces dirty pages in system memory, vendor driver
> + *      reports VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY in "caps" field of
> + *      control region.
> + *
> + * 3. set device state
> + *    (1) user space calls write system call on "device_state" field of
> + *    control region.
> + *    (2) device state transitions as:
> + *
> + *    RUNNING -- start dirty data logging --> RUNNING & LOGGING
> + *    RUNNING -- deactivate --> STOP
> + *    RUNNING -- deactivate & start dirty data longging --> STOP & LOGGING
> + *    RUNNING & LOGGING -- stop dirty data logging --> RUNNING
> + *    RUNNING & LOGGING -- deactivate --> STOP & LOGGING
> + *    RUNNING & LOGGING -- deactivate & stop dirty data logging --> STOP
> + *    STOP -- activate --> RUNNING
> + *    STOP -- start dirty data logging --> STOP & LOGGING
> + *    STOP -- activate & start dirty data logging --> RUNNING & LOGGING
> + *    STOP & LOGGING -- stop dirty data logging --> STOP
> + *    STOP & LOGGING -- activate --> RUNNING & LOGGING
> + *    STOP & LOGGING -- activate & stop dirty data logging --> RUNNING
> + *
> + * 4. get device config size
> + *   (1) user space calls read system call on "device_config.size" field of
> + *       control region for the total size of device config snapshot.
> + *   (2) vendor driver writes device config data's total size in
> + *       "device_config.size" field of control region.
> + *
> + * 5. set device config size
> + *   (1) user space calls write system call.
> + *       total size of device config snapshot --> "device_config.size" field
> + *       of control region.
> + *   (2) vendor driver reads device config data's total size from
> + *       "device_config.size" field of control region.
> + *
> + * 6 get device config buffer
> + *   (1) user space calls write system call.
> + *       "GET_BUFFER" --> "device_config.action" field of control region.
> + *   (2) vendor driver
> + *       a. gets whole snapshot for device config
> + *       b. writes whole device config snapshot to region
> + *       DEVICE_CONFIG.
> + *   (3) user space reads the whole of device config snapshot from region
> + *       DEVICE_CONFIG.
> + *
> + * 7. set device config buffer
> + *   (1) user space writes whole of device config data to region
> + *       DEVICE_CONFIG.
> + *   (2) user space calls write system call.
> + *       "SET_BUFFER" --> "device_config.action" field of control region.
> + *   (3) vendor driver loads whole of device config from region DEVICE_CONFIG.
> + *
> + * 8. get device memory size
> + *   (1) user space calls read system call on "device_memory.size" field of
> + *       control region for device memory size.
> + *   (2) vendor driver
> + *       a. gets device memory snapshot (in state RUNNING or STOP), or
> + *          gets device memory dirty data (in state RUNNING & LOGGING or
> + *          state STOP & LOGGING)
> + *       b. writes size in "device_memory.size" field of control region
> + *
> + * 9. set device memory size
> + *   (1) user space calls write system call on "device_memory.size" field of
> + *       control region to set total size of device memory snapshot.
> + *   (2) vendor driver reads device memory's size from "device_memory.size"
> + *       field of control region.
> + *
> + *
> + * 10. get device memory buffer
> + *   (1) user space calls write system.
> + *       pos --> "device_memory.pos" field of control region,
> + *       "GET_BUFFER" --> "device_memory.action" field of control region.
> + *       (pos must be 0 or multiples of length of region DEVICE_MEMORY).
> + *   (2) vendor driver writes N'th chunk of device memory snapshot/dirty data
> + *       to region DEVICE_MEMORY.
> + *       (N equals to pos/(region length of DEVICE_MEMORY))
> + *   (3) user space reads the N'th chunk of device memory snapshot/dirty data
> + *       from region DEVICE_MEMORY.
> + *
> + * 11. set device memory buffer
> + *   (1) user space writes N'th chunk of device memory snapshot/dirty data to
> + *       region DEVICE_MEMORY.
> + *       (N equals to pos/(region length of DEVICE_MEMORY))
> + *   (2) user space writes pos to "device_memory.pos" field and writes
> + *       "SET_BUFFER" to "device_memory.action" field of control region.
> + *   (3) vendor driver loads N'th chunk of device memory snapshot/dirty data
> + *       from region DEVICE_MEMORY.
> + *
> + * 12. get system memory dirty bitmap
> + *   (1) user space calls write system call to specify a range of system
> + *       memory that querying dirty pages.
> + *       system memory's start address --> "system_memory.start_addr" field
> + *       of control region,
> + *       system memory's page count --> "system_memory.page_nr" field of
> + *       control region.
> + *   (2) if device state is not in RUNNING or STOP & LOGGING,
> + *       vendor driver returns empty bitmap; otherwise,
> + *       vendor driver checks the page_nr,
> + *       if it's larger than the size that region DIRTY_BITMAP can support,
> + *       error returns; if not,
> + *       vendor driver returns as bitmap to specify dirty pages that
> + *       device produces since last query in this range of system memory .
> + *   (3) usespace reads back the dirty bitmap from region DIRTY_BITMAP.
> + *
> + */

It might make sense to extract the explanations above into a separate
design document in the kernel Documentation/ directory. You could also
add ASCII art there :)

> +
> +struct vfio_device_state_ctl {
> +	__u32 version;		  /* ro versio of devcie state interfaces*/

s/versio/version/
s/devcie/device/

> +	__u32 device_state;       /* VFIO device state, wo */
> +	__u32 caps;		 /* ro */
> +        struct {

Indentation looks a bit off.

> +		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> +		__u64 size;    /*rw, total size of device config*/
> +	} device_config;
> +	struct {
> +		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> +		__u64 size;     /* rw, total size of device memory*/
> +        __u64 pos;/*chunk offset in total buffer of device memory*/

Here as well.

> +	} device_memory;
> +	struct {
> +		__u64 start_addr; /* wo */
> +		__u64 page_nr;   /* wo */
> +	} system_memory;
> +}__attribute__((packed));

For an interface definition, it's probably better to avoid packed and
instead add padding if needed.

> +
>  /* ***************************************************************** */
>  
>  #endif /* VFIO_H */

On the whole, I think this is moving into the right direction.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 2/5] vfio/migration: support device of device config capability
  2019-02-19  8:52   ` [Qemu-devel] " Yan Zhao
@ 2019-02-19 14:37     ` Cornelia Huck
  -1 siblings, 0 replies; 133+ messages in thread
From: Cornelia Huck @ 2019-02-19 14:37 UTC (permalink / raw)
  To: Yan Zhao
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian, dgilbert,
	alex.williamson, intel-gvt-dev, changpeng.liu, zhi.a.wang,
	jonathan.davies

On Tue, 19 Feb 2019 16:52:27 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> Device config is the default data that every device should have. so
> device config capability is by default on, no need to set.
> 
> - Currently two type of resources are saved/loaded for device of device
>   config capability:
>   General PCI config data, and Device config data.
>   They are copies as a whole when precopy is stopped.
> 
> Migration setup flow:
> - Setup device state regions, check its device state version and capabilities.
>   Mmap Device Config Region and Dirty Bitmap Region, if available.
> - If device state regions are failed to get setup, a migration blocker is
>   registered instead.
> - Added SaveVMHandlers to register device state save/load handlers.
> - Register VM state change handler to set device's running/stop states.
> - On migration startup on source machine, set device's state to
>   VFIO_DEVICE_STATE_LOGGING
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 633 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |   1 -
>  hw/vfio/pci.h                 |  25 +-
>  include/hw/vfio/vfio-common.h |   1 +
>  5 files changed, 659 insertions(+), 3 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index 8b3f664..f32ff19 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,6 +1,6 @@
>  ifeq ($(CONFIG_LINUX), y)
>  obj-$(CONFIG_SOFTMMU) += common.o
> -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o

I think you want to split the migration code: The type-independent
code, and the pci-specific code.

>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> new file mode 100644
> index 0000000..16d6395
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,633 @@
> +#include "qemu/osdep.h"
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "migration/blocker.h"
> +#include "migration/register.h"
> +#include "qapi/error.h"
> +#include "pci.h"
> +#include "sysemu/kvm.h"
> +#include "exec/ram_addr.h"
> +
> +#define VFIO_SAVE_FLAG_SETUP 0
> +#define VFIO_SAVE_FLAG_PCI 1
> +#define VFIO_SAVE_FLAG_DEVCONFIG 2
> +#define VFIO_SAVE_FLAG_DEVMEMORY 4
> +#define VFIO_SAVE_FLAG_CONTINUE 8
> +
> +static int vfio_device_state_region_setup(VFIOPCIDevice *vdev,
> +        VFIORegion *region, uint32_t subtype, const char *name)

This function looks like it should be more generic and e.g. take a
VFIODevice instead of a VFIOPCIDevice as argument.

> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    struct vfio_region_info *info;
> +    int ret;
> +
> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_DEVICE_STATE,
> +            subtype, &info);
> +    if (ret) {
> +        error_report("Failed to get info of region %s", name);
> +        return ret;
> +    }
> +
> +    if (vfio_region_setup(OBJECT(vdev), vbasedev,
> +            region, info->index, name)) {
> +        error_report("Failed to setup migrtion region %s", name);
> +        return ret;
> +    }
> +
> +    if (vfio_region_mmap(region)) {
> +        error_report("Failed to mmap migrtion region %s", name);
> +    }
> +
> +    return 0;
> +}
> +
> +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev)
> +{
> +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY);
> +}
> +
> +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev)
> +{
> +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY);
> +}

These two as well. The migration structure should probably hang off the
VFIODevice instead.

> +
> +static bool vfio_device_state_region_mmaped(VFIORegion *region)
> +{
> +    bool mmaped = true;
> +    if (region->nr_mmaps != 1 || region->mmaps[0].offset ||
> +            (region->size != region->mmaps[0].size) ||
> +            (region->mmaps[0].mmap == NULL)) {
> +        mmaped = false;
> +    }
> +
> +    return mmaped;
> +}

s/mmaped/mmapped/ ?

> +
> +static int vfio_get_device_config_size(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];

This looks like it should not depend on pci, either.

> +    uint64_t len;
> +    int sz;
> +
> +    sz = sizeof(len);
> +    if (pread(vbasedev->fd, &len, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.size))
> +            != sz) {
> +        error_report("vfio: Failed to get length of device config");
> +        return -1;
> +    }
> +    if (len > region_config->size) {
> +        error_report("vfio: Error device config length");
> +        return -1;
> +    }
> +    vdev->migration->devconfig_size = len;
> +
> +    return 0;
> +}
> +
> +static int vfio_set_device_config_size(VFIOPCIDevice *vdev, uint64_t size)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];

Ditto. Also for the functions below.

> +    int sz;
> +
> +    if (size > region_config->size) {
> +        return -1;
> +    }
> +
> +    sz = sizeof(size);
> +    if (pwrite(vbasedev->fd, &size, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.size))
> +            != sz) {
> +        error_report("vfio: Failed to set length of device config");
> +        return -1;
> +    }
> +    vdev->migration->devconfig_size = size;
> +    return 0;
> +}
> +
> +static int vfio_save_data_device_config(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> +    uint64_t len = vdev->migration->devconfig_size;
> +
> +    qemu_put_be64(f, len);

Why big endian? (Generally, do we need any endianness considerations?)

> +
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.action))
> +            != sz) {
> +        error_report("vfio: action failure for device config get buffer");
> +        return -1;
> +    }

Might make sense to wrap this into a set_action() helper that takes a
SET_BUFFER/GET_BUFFER argument.

> +
> +    if (!vfio_device_state_region_mmaped(region_config)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }
> +        if (pread(vbasedev->fd, buf, len, region_config->fd_offset) != len) {
> +            error_report("vfio: Failed read device config buffer");
> +            return -1;
> +        }
> +        qemu_put_buffer(f, buf, len);
> +        g_free(buf);
> +    } else {
> +        dest = region_config->mmaps[0].mmap;
> +        qemu_put_buffer(f, dest, len);
> +    }
> +    return 0;
> +}
> +
> +static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> +                            QEMUFile *f, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> +
> +    vfio_set_device_config_size(vdev, len);
> +
> +    if (!vfio_device_state_region_mmaped(region_config)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }
> +        qemu_get_buffer(f, buf, len);
> +        if (pwrite(vbasedev->fd, buf, len,
> +                    region_config->fd_offset) != len) {
> +            error_report("vfio: Failed to write devie config buffer");
> +            return -1;
> +        }
> +        g_free(buf);
> +    } else {
> +        dest = region_config->mmaps[0].mmap;
> +        qemu_get_buffer(f, dest, len);
> +    }
> +
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.action))
> +            != sz) {
> +        error_report("vfio: action failure for device config set buffer");
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> +        uint64_t start_addr, uint64_t page_nr)
> +{
> +
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_bitmap =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> +    unsigned long bitmap_size =
> +                    BITS_TO_LONGS(page_nr) * sizeof(unsigned long);
> +    uint32_t sz;
> +
> +    struct {
> +        __u64 start_addr;
> +        __u64 page_nr;
> +    } system_memory;
> +    system_memory.start_addr = start_addr;
> +    system_memory.page_nr = page_nr;
> +    sz = sizeof(system_memory);
> +    if (pwrite(vbasedev->fd, &system_memory, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, system_memory))
> +            != sz) {
> +        error_report("vfio: Failed to set system memory range for dirty pages");
> +        return -1;
> +    }
> +
> +    if (!vfio_device_state_region_mmaped(region_bitmap)) {
> +        void *bitmap = g_malloc0(bitmap_size);
> +
> +        if (pread(vbasedev->fd, bitmap, bitmap_size,
> +                    region_bitmap->fd_offset) != bitmap_size) {
> +            error_report("vfio: Failed to read dirty bitmap data");
> +            return -1;
> +        }
> +
> +        cpu_physical_memory_set_dirty_lebitmap(bitmap, start_addr, page_nr);
> +
> +        g_free(bitmap);
> +    } else {
> +        cpu_physical_memory_set_dirty_lebitmap(
> +                    region_bitmap->mmaps[0].mmap,
> +                    start_addr, page_nr);
> +    }
> +   return 0;
> +}
> +
> +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> +        uint64_t start_addr, uint64_t page_nr)
> +{
> +    VFIORegion *region_bitmap =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> +    unsigned long chunk_size = region_bitmap->size;
> +    uint64_t chunk_pg_nr = (chunk_size / sizeof(unsigned long)) *
> +                                BITS_PER_LONG;
> +
> +    uint64_t cnt_left;
> +    int rc = 0;
> +
> +    cnt_left = page_nr;
> +
> +    while (cnt_left >= chunk_pg_nr) {
> +        rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, chunk_pg_nr);
> +        if (rc) {
> +            goto exit;
> +        }
> +        cnt_left -= chunk_pg_nr;
> +        start_addr += start_addr;
> +   }
> +   rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, cnt_left);
> +
> +exit:
> +   return rc;
> +}
> +
> +static int vfio_set_device_state(VFIOPCIDevice *vdev,
> +        uint32_t dev_state)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    uint32_t sz = sizeof(dev_state);
> +
> +    if (!vdev->migration) {
> +        return -1;
> +    }
> +
> +    if (pwrite(vbasedev->fd, &dev_state, sz,
> +              region->fd_offset +
> +              offsetof(struct vfio_device_state_ctl, device_state))
> +            != sz) {
> +        error_report("vfio: Failed to set device state %d", dev_state);

Can the kernel reject this if a state transition is not allowed (or are
all transitions allowed?)

> +        return -1;
> +    }
> +    vdev->migration->device_state = dev_state;
> +    return 0;
> +}
> +
> +static int vfio_get_device_data_caps(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +
> +    uint32_t caps;
> +    uint32_t size = sizeof(caps);
> +
> +    if (pread(vbasedev->fd, &caps, size,
> +                region->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, caps))
> +            != size) {
> +        error_report("%s Failed to read data caps of device states",
> +                vbasedev->name);
> +        return -1;
> +    }
> +    vdev->migration->data_caps = caps;
> +    return 0;
> +}
> +
> +
> +static int vfio_check_devstate_version(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +
> +    uint32_t version;
> +    uint32_t size = sizeof(version);
> +
> +    if (pread(vbasedev->fd, &version, size,
> +                region->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, version))
> +            != size) {
> +        error_report("%s Failed to read version of device state interfaces",
> +                vbasedev->name);
> +        return -1;
> +    }
> +
> +    if (version != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> +        error_report("%s migration version mismatch, right version is %d",
> +                vbasedev->name, VFIO_DEVICE_STATE_INTERFACE_VERSION);

So, we require an exact match... or should we allow to extend the
interface in an backwards-compatible way, in which case we'd require
(QEMU interface version) <= (kernel interface version)?

> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_vm_change_state_handler(void *pv, int running, RunState state)
> +{
> +    VFIOPCIDevice *vdev = pv;
> +    uint32_t dev_state = vdev->migration->device_state;
> +
> +    if (!running) {
> +        dev_state |= VFIO_DEVICE_STATE_STOP;
> +    } else {
> +        dev_state &= ~VFIO_DEVICE_STATE_STOP;
> +    }
> +
> +    vfio_set_device_state(vdev, dev_state);
> +}
> +
> +static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> +                                   uint64_t max_size,
> +                                   uint64_t *res_precopy_only,
> +                                   uint64_t *res_compatible,
> +                                   uint64_t *res_post_copy_only)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +
> +    if (!vfio_device_data_cap_device_memory(vdev)) {
> +        return;
> +    }
> +
> +    return;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +
> +    if (!vfio_device_data_cap_device_memory(vdev)) {
> +        return 0;
> +    }
> +
> +    return 0;
> +}

These look a bit weird...

> +
> +static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
> +    bool msi_64bit;
> +
> +    /* retore pci bar configuration */
> +    ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +            ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        bar_cfg = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar_cfg, 4);
> +    }
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +            ctl | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> +
> +    /* restore msi configuration */
> +    ctl = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +    msi_64bit = !!(ctl & PCI_MSI_FLAGS_64BIT);
> +
> +    vfio_pci_write_config(&vdev->pdev,
> +            pdev->msi_cap + PCI_MSI_FLAGS,
> +            ctl & (!PCI_MSI_FLAGS_ENABLE), 2);
> +
> +    msi_lo = qemu_get_be32(f);
> +    vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO, msi_lo, 4);
> +
> +    if (msi_64bit) {
> +        msi_hi = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                msi_hi, 4);
> +    }
> +    msi_data = qemu_get_be32(f);
> +    vfio_pci_write_config(pdev,
> +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +            msi_data, 2);
> +
> +    vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +            ctl | PCI_MSI_FLAGS_ENABLE, 2);
> +

Ok, this function is indeed pci-specific and probably should be moved
to the vfio-pci code (other types could hook themselves up in the same
place, then).

> +}
> +
> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    int flag;
> +    uint64_t len;
> +    int ret = 0;
> +
> +    if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> +        return -EINVAL;
> +    }
> +
> +    do {
> +        flag = qemu_get_byte(f);
> +
> +        switch (flag & ~VFIO_SAVE_FLAG_CONTINUE) {
> +        case VFIO_SAVE_FLAG_SETUP:
> +            break;
> +        case VFIO_SAVE_FLAG_PCI:
> +            vfio_pci_load_config(vdev, f);
> +            break;
> +        case VFIO_SAVE_FLAG_DEVCONFIG:
> +            len = qemu_get_be64(f);
> +            vfio_load_data_device_config(vdev, f, len);
> +            break;
> +        default:
> +            ret = -EINVAL;
> +        }
> +    } while (flag & VFIO_SAVE_FLAG_CONTINUE);
> +
> +    return ret;
> +}
> +
> +static void vfio_pci_save_config(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t msi_cfg, msi_lo, msi_hi, msi_data, bar_cfg, i;
> +    bool msi_64bit;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        bar_cfg = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> +        qemu_put_be32(f, bar_cfg);
> +    }
> +
> +    msi_cfg = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +    msi_64bit = !!(msi_cfg & PCI_MSI_FLAGS_64BIT);
> +
> +    msi_lo = pci_default_read_config(pdev,
> +            pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +    qemu_put_be32(f, msi_lo);
> +
> +    if (msi_64bit) {
> +        msi_hi = pci_default_read_config(pdev,
> +                pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                4);
> +        qemu_put_be32(f, msi_hi);
> +    }
> +
> +    msi_data = pci_default_read_config(pdev,
> +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +            2);
> +    qemu_put_be32(f, msi_data);
> +
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    int rc = 0;
> +
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> +    vfio_pci_save_config(vdev, f);
> +
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVCONFIG);
> +    rc += vfio_get_device_config_size(vdev);
> +    rc += vfio_save_data_device_config(vdev, f);
> +
> +    return rc;
> +}
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +
> +    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> +            VFIO_DEVICE_STATE_LOGGING);
> +    return 0;
> +}
> +
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +    return 0;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    uint32_t dev_state = vdev->migration->device_state;
> +
> +    dev_state &= ~VFIO_DEVICE_STATE_LOGGING;
> +
> +    vfio_set_device_state(vdev, dev_state);
> +}

These look like they should be type-independent, again.

> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_live_pending = vfio_save_live_pending,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_cleanup = vfio_save_cleanup,
> +    .load_setup = vfio_load_setup,
> +    .load_state = vfio_load_state,
> +};
> +
> +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> +{
> +    int ret;
> +    Error *local_err = NULL;
> +    vdev->migration = g_new0(VFIOMigration, 1);
> +
> +    if (vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL,
> +              "device-state-ctl")) {
> +        goto error;
> +    }
> +
> +    if (vfio_check_devstate_version(vdev)) {
> +        goto error;
> +    }
> +
> +    if (vfio_get_device_data_caps(vdev)) {
> +        goto error;
> +    }
> +
> +    if (vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG,
> +              "device-state-data-device-config")) {
> +        goto error;
> +    }
> +
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        error_report("No suppport of data cap device memory Yet");
> +        goto error;
> +    }
> +
> +    if (vfio_device_data_cap_system_memory(vdev) &&
> +            vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP,
> +              "device-state-data-dirtybitmap")) {
> +        goto error;
> +    }
> +
> +    vdev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> +
> +    register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
> +            VFIO_DEVICE_STATE_INTERFACE_VERSION,
> +            &savevm_vfio_handlers,
> +            vdev);
> +
> +    vdev->migration->vm_state =
> +        qemu_add_vm_change_state_handler(vfio_vm_change_state_handler, vdev);
> +
> +    return 0;
> +error:
> +    error_setg(&vdev->migration_blocker,
> +            "VFIO device doesn't support migration");
> +    ret = migrate_add_blocker(vdev->migration_blocker, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        error_free(vdev->migration_blocker);
> +    }
> +
> +    g_free(vdev->migration);
> +    vdev->migration = NULL;
> +
> +    return ret;
> +}
> +
> +void vfio_migration_finalize(VFIOPCIDevice *vdev)
> +{
> +    if (vdev->migration) {
> +        int i;
> +        qemu_del_vm_change_state_handler(vdev->migration->vm_state);
> +        unregister_savevm(NULL, TYPE_VFIO_PCI, vdev);
> +        for (i = 0; i < VFIO_DEVSTATE_REGION_NUM; i++) {
> +            vfio_region_finalize(&vdev->migration->region[i]);
> +        }
> +        g_free(vdev->migration);
> +        vdev->migration = NULL;
> +    } else if (vdev->migration_blocker) {
> +        migrate_del_blocker(vdev->migration_blocker);
> +        error_free(vdev->migration_blocker);
> +    }
> +}
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index c0cb1ec..b8e006b 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -37,7 +37,6 @@
>  
>  #define MSIX_CAP_LENGTH 12
>  
> -#define TYPE_VFIO_PCI "vfio-pci"

Why do you need to move this? That looks like a sign that the layering
needs work.

>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
>  
>  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index b1ae4c0..4b7b1bb 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -19,6 +19,7 @@
>  #include "qemu/event_notifier.h"
>  #include "qemu/queue.h"
>  #include "qemu/timer.h"
> +#include "sysemu/sysemu.h"
>  
>  #define PCI_ANY_ID (~0)
>  
> @@ -56,6 +57,21 @@ typedef struct VFIOBAR {
>      QLIST_HEAD(, VFIOQuirk) quirks;
>  } VFIOBAR;
>  
> +enum {
> +    VFIO_DEVSTATE_REGION_CTL = 0,
> +    VFIO_DEVSTATE_REGION_DATA_CONFIG,
> +    VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY,
> +    VFIO_DEVSTATE_REGION_DATA_BITMAP,
> +    VFIO_DEVSTATE_REGION_NUM,
> +};
> +typedef struct VFIOMigration {
> +    VFIORegion region[VFIO_DEVSTATE_REGION_NUM];
> +    uint32_t data_caps;
> +    uint32_t device_state;
> +    uint64_t devconfig_size;
> +    VMChangeStateEntry *vm_state;
> +} VFIOMigration;
> +
>  typedef struct VFIOVGARegion {
>      MemoryRegion mem;
>      off_t offset;
> @@ -132,6 +148,8 @@ typedef struct VFIOPCIDevice {
>      VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
>      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
>      void *igd_opregion;
> +    VFIOMigration *migration;

As said, it would probably be better to hang this off VFIODevice.

> +    Error *migration_blocker;
>      PCIHostDeviceAddress host;
>      EventNotifier err_notifier;
>      EventNotifier req_notifier;
> @@ -198,5 +216,10 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
>  void vfio_display_reset(VFIOPCIDevice *vdev);
>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
>  void vfio_display_finalize(VFIOPCIDevice *vdev);
> -
> +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev);
> +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev);
> +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> +         uint64_t start_addr, uint64_t page_nr);
> +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp);
> +void vfio_migration_finalize(VFIOPCIDevice *vdev);

And the interfaces should be in vfio-common.

>  #endif /* HW_VFIO_VFIO_PCI_H */
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 1b434d0..ed43613 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -32,6 +32,7 @@
>  #endif
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
> +#define TYPE_VFIO_PCI "vfio-pci"
>  
>  enum {
>      VFIO_DEVICE_TYPE_PCI = 0,

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] vfio/migration: support device of device config capability
@ 2019-02-19 14:37     ` Cornelia Huck
  0 siblings, 0 replies; 133+ messages in thread
From: Cornelia Huck @ 2019-02-19 14:37 UTC (permalink / raw)
  To: Yan Zhao
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue, kwankhede, kevin.tian,
	cjia, arei.gonglei, kvm

On Tue, 19 Feb 2019 16:52:27 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> Device config is the default data that every device should have. so
> device config capability is by default on, no need to set.
> 
> - Currently two type of resources are saved/loaded for device of device
>   config capability:
>   General PCI config data, and Device config data.
>   They are copies as a whole when precopy is stopped.
> 
> Migration setup flow:
> - Setup device state regions, check its device state version and capabilities.
>   Mmap Device Config Region and Dirty Bitmap Region, if available.
> - If device state regions are failed to get setup, a migration blocker is
>   registered instead.
> - Added SaveVMHandlers to register device state save/load handlers.
> - Register VM state change handler to set device's running/stop states.
> - On migration startup on source machine, set device's state to
>   VFIO_DEVICE_STATE_LOGGING
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 633 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |   1 -
>  hw/vfio/pci.h                 |  25 +-
>  include/hw/vfio/vfio-common.h |   1 +
>  5 files changed, 659 insertions(+), 3 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index 8b3f664..f32ff19 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,6 +1,6 @@
>  ifeq ($(CONFIG_LINUX), y)
>  obj-$(CONFIG_SOFTMMU) += common.o
> -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o

I think you want to split the migration code: The type-independent
code, and the pci-specific code.

>  obj-$(CONFIG_VFIO_CCW) += ccw.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> new file mode 100644
> index 0000000..16d6395
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,633 @@
> +#include "qemu/osdep.h"
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "migration/blocker.h"
> +#include "migration/register.h"
> +#include "qapi/error.h"
> +#include "pci.h"
> +#include "sysemu/kvm.h"
> +#include "exec/ram_addr.h"
> +
> +#define VFIO_SAVE_FLAG_SETUP 0
> +#define VFIO_SAVE_FLAG_PCI 1
> +#define VFIO_SAVE_FLAG_DEVCONFIG 2
> +#define VFIO_SAVE_FLAG_DEVMEMORY 4
> +#define VFIO_SAVE_FLAG_CONTINUE 8
> +
> +static int vfio_device_state_region_setup(VFIOPCIDevice *vdev,
> +        VFIORegion *region, uint32_t subtype, const char *name)

This function looks like it should be more generic and e.g. take a
VFIODevice instead of a VFIOPCIDevice as argument.

> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    struct vfio_region_info *info;
> +    int ret;
> +
> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_DEVICE_STATE,
> +            subtype, &info);
> +    if (ret) {
> +        error_report("Failed to get info of region %s", name);
> +        return ret;
> +    }
> +
> +    if (vfio_region_setup(OBJECT(vdev), vbasedev,
> +            region, info->index, name)) {
> +        error_report("Failed to setup migrtion region %s", name);
> +        return ret;
> +    }
> +
> +    if (vfio_region_mmap(region)) {
> +        error_report("Failed to mmap migrtion region %s", name);
> +    }
> +
> +    return 0;
> +}
> +
> +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev)
> +{
> +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY);
> +}
> +
> +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev)
> +{
> +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY);
> +}

These two as well. The migration structure should probably hang off the
VFIODevice instead.

> +
> +static bool vfio_device_state_region_mmaped(VFIORegion *region)
> +{
> +    bool mmaped = true;
> +    if (region->nr_mmaps != 1 || region->mmaps[0].offset ||
> +            (region->size != region->mmaps[0].size) ||
> +            (region->mmaps[0].mmap == NULL)) {
> +        mmaped = false;
> +    }
> +
> +    return mmaped;
> +}

s/mmaped/mmapped/ ?

> +
> +static int vfio_get_device_config_size(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];

This looks like it should not depend on pci, either.

> +    uint64_t len;
> +    int sz;
> +
> +    sz = sizeof(len);
> +    if (pread(vbasedev->fd, &len, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.size))
> +            != sz) {
> +        error_report("vfio: Failed to get length of device config");
> +        return -1;
> +    }
> +    if (len > region_config->size) {
> +        error_report("vfio: Error device config length");
> +        return -1;
> +    }
> +    vdev->migration->devconfig_size = len;
> +
> +    return 0;
> +}
> +
> +static int vfio_set_device_config_size(VFIOPCIDevice *vdev, uint64_t size)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];

Ditto. Also for the functions below.

> +    int sz;
> +
> +    if (size > region_config->size) {
> +        return -1;
> +    }
> +
> +    sz = sizeof(size);
> +    if (pwrite(vbasedev->fd, &size, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.size))
> +            != sz) {
> +        error_report("vfio: Failed to set length of device config");
> +        return -1;
> +    }
> +    vdev->migration->devconfig_size = size;
> +    return 0;
> +}
> +
> +static int vfio_save_data_device_config(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> +    uint64_t len = vdev->migration->devconfig_size;
> +
> +    qemu_put_be64(f, len);

Why big endian? (Generally, do we need any endianness considerations?)

> +
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.action))
> +            != sz) {
> +        error_report("vfio: action failure for device config get buffer");
> +        return -1;
> +    }

Might make sense to wrap this into a set_action() helper that takes a
SET_BUFFER/GET_BUFFER argument.

> +
> +    if (!vfio_device_state_region_mmaped(region_config)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }
> +        if (pread(vbasedev->fd, buf, len, region_config->fd_offset) != len) {
> +            error_report("vfio: Failed read device config buffer");
> +            return -1;
> +        }
> +        qemu_put_buffer(f, buf, len);
> +        g_free(buf);
> +    } else {
> +        dest = region_config->mmaps[0].mmap;
> +        qemu_put_buffer(f, dest, len);
> +    }
> +    return 0;
> +}
> +
> +static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> +                            QEMUFile *f, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_config =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> +
> +    vfio_set_device_config_size(vdev, len);
> +
> +    if (!vfio_device_state_region_mmaped(region_config)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }
> +        qemu_get_buffer(f, buf, len);
> +        if (pwrite(vbasedev->fd, buf, len,
> +                    region_config->fd_offset) != len) {
> +            error_report("vfio: Failed to write devie config buffer");
> +            return -1;
> +        }
> +        g_free(buf);
> +    } else {
> +        dest = region_config->mmaps[0].mmap;
> +        qemu_get_buffer(f, dest, len);
> +    }
> +
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_config.action))
> +            != sz) {
> +        error_report("vfio: action failure for device config set buffer");
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> +        uint64_t start_addr, uint64_t page_nr)
> +{
> +
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_bitmap =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> +    unsigned long bitmap_size =
> +                    BITS_TO_LONGS(page_nr) * sizeof(unsigned long);
> +    uint32_t sz;
> +
> +    struct {
> +        __u64 start_addr;
> +        __u64 page_nr;
> +    } system_memory;
> +    system_memory.start_addr = start_addr;
> +    system_memory.page_nr = page_nr;
> +    sz = sizeof(system_memory);
> +    if (pwrite(vbasedev->fd, &system_memory, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, system_memory))
> +            != sz) {
> +        error_report("vfio: Failed to set system memory range for dirty pages");
> +        return -1;
> +    }
> +
> +    if (!vfio_device_state_region_mmaped(region_bitmap)) {
> +        void *bitmap = g_malloc0(bitmap_size);
> +
> +        if (pread(vbasedev->fd, bitmap, bitmap_size,
> +                    region_bitmap->fd_offset) != bitmap_size) {
> +            error_report("vfio: Failed to read dirty bitmap data");
> +            return -1;
> +        }
> +
> +        cpu_physical_memory_set_dirty_lebitmap(bitmap, start_addr, page_nr);
> +
> +        g_free(bitmap);
> +    } else {
> +        cpu_physical_memory_set_dirty_lebitmap(
> +                    region_bitmap->mmaps[0].mmap,
> +                    start_addr, page_nr);
> +    }
> +   return 0;
> +}
> +
> +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> +        uint64_t start_addr, uint64_t page_nr)
> +{
> +    VFIORegion *region_bitmap =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> +    unsigned long chunk_size = region_bitmap->size;
> +    uint64_t chunk_pg_nr = (chunk_size / sizeof(unsigned long)) *
> +                                BITS_PER_LONG;
> +
> +    uint64_t cnt_left;
> +    int rc = 0;
> +
> +    cnt_left = page_nr;
> +
> +    while (cnt_left >= chunk_pg_nr) {
> +        rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, chunk_pg_nr);
> +        if (rc) {
> +            goto exit;
> +        }
> +        cnt_left -= chunk_pg_nr;
> +        start_addr += start_addr;
> +   }
> +   rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, cnt_left);
> +
> +exit:
> +   return rc;
> +}
> +
> +static int vfio_set_device_state(VFIOPCIDevice *vdev,
> +        uint32_t dev_state)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    uint32_t sz = sizeof(dev_state);
> +
> +    if (!vdev->migration) {
> +        return -1;
> +    }
> +
> +    if (pwrite(vbasedev->fd, &dev_state, sz,
> +              region->fd_offset +
> +              offsetof(struct vfio_device_state_ctl, device_state))
> +            != sz) {
> +        error_report("vfio: Failed to set device state %d", dev_state);

Can the kernel reject this if a state transition is not allowed (or are
all transitions allowed?)

> +        return -1;
> +    }
> +    vdev->migration->device_state = dev_state;
> +    return 0;
> +}
> +
> +static int vfio_get_device_data_caps(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +
> +    uint32_t caps;
> +    uint32_t size = sizeof(caps);
> +
> +    if (pread(vbasedev->fd, &caps, size,
> +                region->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, caps))
> +            != size) {
> +        error_report("%s Failed to read data caps of device states",
> +                vbasedev->name);
> +        return -1;
> +    }
> +    vdev->migration->data_caps = caps;
> +    return 0;
> +}
> +
> +
> +static int vfio_check_devstate_version(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +
> +    uint32_t version;
> +    uint32_t size = sizeof(version);
> +
> +    if (pread(vbasedev->fd, &version, size,
> +                region->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, version))
> +            != size) {
> +        error_report("%s Failed to read version of device state interfaces",
> +                vbasedev->name);
> +        return -1;
> +    }
> +
> +    if (version != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> +        error_report("%s migration version mismatch, right version is %d",
> +                vbasedev->name, VFIO_DEVICE_STATE_INTERFACE_VERSION);

So, we require an exact match... or should we allow to extend the
interface in an backwards-compatible way, in which case we'd require
(QEMU interface version) <= (kernel interface version)?

> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_vm_change_state_handler(void *pv, int running, RunState state)
> +{
> +    VFIOPCIDevice *vdev = pv;
> +    uint32_t dev_state = vdev->migration->device_state;
> +
> +    if (!running) {
> +        dev_state |= VFIO_DEVICE_STATE_STOP;
> +    } else {
> +        dev_state &= ~VFIO_DEVICE_STATE_STOP;
> +    }
> +
> +    vfio_set_device_state(vdev, dev_state);
> +}
> +
> +static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> +                                   uint64_t max_size,
> +                                   uint64_t *res_precopy_only,
> +                                   uint64_t *res_compatible,
> +                                   uint64_t *res_post_copy_only)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +
> +    if (!vfio_device_data_cap_device_memory(vdev)) {
> +        return;
> +    }
> +
> +    return;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +
> +    if (!vfio_device_data_cap_device_memory(vdev)) {
> +        return 0;
> +    }
> +
> +    return 0;
> +}

These look a bit weird...

> +
> +static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
> +    bool msi_64bit;
> +
> +    /* retore pci bar configuration */
> +    ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +            ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        bar_cfg = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar_cfg, 4);
> +    }
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +            ctl | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> +
> +    /* restore msi configuration */
> +    ctl = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +    msi_64bit = !!(ctl & PCI_MSI_FLAGS_64BIT);
> +
> +    vfio_pci_write_config(&vdev->pdev,
> +            pdev->msi_cap + PCI_MSI_FLAGS,
> +            ctl & (!PCI_MSI_FLAGS_ENABLE), 2);
> +
> +    msi_lo = qemu_get_be32(f);
> +    vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO, msi_lo, 4);
> +
> +    if (msi_64bit) {
> +        msi_hi = qemu_get_be32(f);
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                msi_hi, 4);
> +    }
> +    msi_data = qemu_get_be32(f);
> +    vfio_pci_write_config(pdev,
> +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +            msi_data, 2);
> +
> +    vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +            ctl | PCI_MSI_FLAGS_ENABLE, 2);
> +

Ok, this function is indeed pci-specific and probably should be moved
to the vfio-pci code (other types could hook themselves up in the same
place, then).

> +}
> +
> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    int flag;
> +    uint64_t len;
> +    int ret = 0;
> +
> +    if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> +        return -EINVAL;
> +    }
> +
> +    do {
> +        flag = qemu_get_byte(f);
> +
> +        switch (flag & ~VFIO_SAVE_FLAG_CONTINUE) {
> +        case VFIO_SAVE_FLAG_SETUP:
> +            break;
> +        case VFIO_SAVE_FLAG_PCI:
> +            vfio_pci_load_config(vdev, f);
> +            break;
> +        case VFIO_SAVE_FLAG_DEVCONFIG:
> +            len = qemu_get_be64(f);
> +            vfio_load_data_device_config(vdev, f, len);
> +            break;
> +        default:
> +            ret = -EINVAL;
> +        }
> +    } while (flag & VFIO_SAVE_FLAG_CONTINUE);
> +
> +    return ret;
> +}
> +
> +static void vfio_pci_save_config(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t msi_cfg, msi_lo, msi_hi, msi_data, bar_cfg, i;
> +    bool msi_64bit;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        bar_cfg = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> +        qemu_put_be32(f, bar_cfg);
> +    }
> +
> +    msi_cfg = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +    msi_64bit = !!(msi_cfg & PCI_MSI_FLAGS_64BIT);
> +
> +    msi_lo = pci_default_read_config(pdev,
> +            pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +    qemu_put_be32(f, msi_lo);
> +
> +    if (msi_64bit) {
> +        msi_hi = pci_default_read_config(pdev,
> +                pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                4);
> +        qemu_put_be32(f, msi_hi);
> +    }
> +
> +    msi_data = pci_default_read_config(pdev,
> +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +            2);
> +    qemu_put_be32(f, msi_data);
> +
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    int rc = 0;
> +
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> +    vfio_pci_save_config(vdev, f);
> +
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVCONFIG);
> +    rc += vfio_get_device_config_size(vdev);
> +    rc += vfio_save_data_device_config(vdev, f);
> +
> +    return rc;
> +}
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +
> +    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> +            VFIO_DEVICE_STATE_LOGGING);
> +    return 0;
> +}
> +
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +    return 0;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    uint32_t dev_state = vdev->migration->device_state;
> +
> +    dev_state &= ~VFIO_DEVICE_STATE_LOGGING;
> +
> +    vfio_set_device_state(vdev, dev_state);
> +}

These look like they should be type-independent, again.

> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_live_pending = vfio_save_live_pending,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_cleanup = vfio_save_cleanup,
> +    .load_setup = vfio_load_setup,
> +    .load_state = vfio_load_state,
> +};
> +
> +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> +{
> +    int ret;
> +    Error *local_err = NULL;
> +    vdev->migration = g_new0(VFIOMigration, 1);
> +
> +    if (vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL,
> +              "device-state-ctl")) {
> +        goto error;
> +    }
> +
> +    if (vfio_check_devstate_version(vdev)) {
> +        goto error;
> +    }
> +
> +    if (vfio_get_device_data_caps(vdev)) {
> +        goto error;
> +    }
> +
> +    if (vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG,
> +              "device-state-data-device-config")) {
> +        goto error;
> +    }
> +
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        error_report("No suppport of data cap device memory Yet");
> +        goto error;
> +    }
> +
> +    if (vfio_device_data_cap_system_memory(vdev) &&
> +            vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP,
> +              "device-state-data-dirtybitmap")) {
> +        goto error;
> +    }
> +
> +    vdev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> +
> +    register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
> +            VFIO_DEVICE_STATE_INTERFACE_VERSION,
> +            &savevm_vfio_handlers,
> +            vdev);
> +
> +    vdev->migration->vm_state =
> +        qemu_add_vm_change_state_handler(vfio_vm_change_state_handler, vdev);
> +
> +    return 0;
> +error:
> +    error_setg(&vdev->migration_blocker,
> +            "VFIO device doesn't support migration");
> +    ret = migrate_add_blocker(vdev->migration_blocker, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        error_free(vdev->migration_blocker);
> +    }
> +
> +    g_free(vdev->migration);
> +    vdev->migration = NULL;
> +
> +    return ret;
> +}
> +
> +void vfio_migration_finalize(VFIOPCIDevice *vdev)
> +{
> +    if (vdev->migration) {
> +        int i;
> +        qemu_del_vm_change_state_handler(vdev->migration->vm_state);
> +        unregister_savevm(NULL, TYPE_VFIO_PCI, vdev);
> +        for (i = 0; i < VFIO_DEVSTATE_REGION_NUM; i++) {
> +            vfio_region_finalize(&vdev->migration->region[i]);
> +        }
> +        g_free(vdev->migration);
> +        vdev->migration = NULL;
> +    } else if (vdev->migration_blocker) {
> +        migrate_del_blocker(vdev->migration_blocker);
> +        error_free(vdev->migration_blocker);
> +    }
> +}
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index c0cb1ec..b8e006b 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -37,7 +37,6 @@
>  
>  #define MSIX_CAP_LENGTH 12
>  
> -#define TYPE_VFIO_PCI "vfio-pci"

Why do you need to move this? That looks like a sign that the layering
needs work.

>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
>  
>  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index b1ae4c0..4b7b1bb 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -19,6 +19,7 @@
>  #include "qemu/event_notifier.h"
>  #include "qemu/queue.h"
>  #include "qemu/timer.h"
> +#include "sysemu/sysemu.h"
>  
>  #define PCI_ANY_ID (~0)
>  
> @@ -56,6 +57,21 @@ typedef struct VFIOBAR {
>      QLIST_HEAD(, VFIOQuirk) quirks;
>  } VFIOBAR;
>  
> +enum {
> +    VFIO_DEVSTATE_REGION_CTL = 0,
> +    VFIO_DEVSTATE_REGION_DATA_CONFIG,
> +    VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY,
> +    VFIO_DEVSTATE_REGION_DATA_BITMAP,
> +    VFIO_DEVSTATE_REGION_NUM,
> +};
> +typedef struct VFIOMigration {
> +    VFIORegion region[VFIO_DEVSTATE_REGION_NUM];
> +    uint32_t data_caps;
> +    uint32_t device_state;
> +    uint64_t devconfig_size;
> +    VMChangeStateEntry *vm_state;
> +} VFIOMigration;
> +
>  typedef struct VFIOVGARegion {
>      MemoryRegion mem;
>      off_t offset;
> @@ -132,6 +148,8 @@ typedef struct VFIOPCIDevice {
>      VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
>      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
>      void *igd_opregion;
> +    VFIOMigration *migration;

As said, it would probably be better to hang this off VFIODevice.

> +    Error *migration_blocker;
>      PCIHostDeviceAddress host;
>      EventNotifier err_notifier;
>      EventNotifier req_notifier;
> @@ -198,5 +216,10 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
>  void vfio_display_reset(VFIOPCIDevice *vdev);
>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
>  void vfio_display_finalize(VFIOPCIDevice *vdev);
> -
> +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev);
> +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev);
> +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> +         uint64_t start_addr, uint64_t page_nr);
> +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp);
> +void vfio_migration_finalize(VFIOPCIDevice *vdev);

And the interfaces should be in vfio-common.

>  #endif /* HW_VFIO_VFIO_PCI_H */
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 1b434d0..ed43613 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -32,6 +32,7 @@
>  #endif
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
> +#define TYPE_VFIO_PCI "vfio-pci"
>  
>  enum {
>      VFIO_DEVICE_TYPE_PCI = 0,

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 5/5] vfio/migration: support device memory capability
  2019-02-19  8:53   ` [Qemu-devel] " Yan Zhao
@ 2019-02-19 14:42     ` Christophe de Dinechin
  -1 siblings, 0 replies; 133+ messages in thread
From: Christophe de Dinechin @ 2019-02-19 14:42 UTC (permalink / raw)
  To: Yan Zhao
  Cc: cjia, KVM list, Alexey Kardashevskiy, Zhengxiao.zx,
	shuangtai.tst, qemu-devel, Kirti Wankhede, eauger, yi.l.liu,
	Erik Skultety, ziye.yang, mlevitsk, Halil Pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	Alex Williamson, intel-gvt-dev, changpeng.liu, Cornelia Huck,
	Zhi Wang, jonathan.davies



> On 19 Feb 2019, at 09:53, Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> If a device has device memory capability, save/load data from device memory
> in pre-copy and stop-and-copy phases.
> 
> LOGGING state is set for device memory for dirty page logging:
> in LOGGING state, get device memory returns whole device memory snapshot;
> outside LOGGING state, get device memory returns dirty data since last get
> operation.
> 
> Usually, device memory is very big, qemu needs to chunk it into several
> pieces each with size of device memory region.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
> hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> hw/vfio/pci.h       |   1 +
> 2 files changed, 231 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 16d6395..f1e9309 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
>     return 0;
> }
> 
> +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    uint64_t len;
> +    int sz;
> +
> +    sz = sizeof(len);
> +    if (pread(vbasedev->fd, &len, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> +            != sz) {
> +        error_report("vfio: Failed to get length of device memory”);

s/length/size/ ? (to be consistent with function name)
> +        return -1;
> +    }
> +    vdev->migration->devmem_size = len;
> +    return 0;
> +}
> +
> +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    int sz;
> +
> +    sz = sizeof(size);
> +    if (pwrite(vbasedev->fd, &size, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> +            != sz) {
> +        error_report("vfio: Failed to set length of device comemory”);

What is comemory? Typo?

Same comment about length vs size

> +        return -1;
> +    }
> +    vdev->migration->devmem_size = size;
> +    return 0;
> +}
> +
> +static
> +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> +                                    uint64_t pos, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> +
> +    if (len > region_devmem->size) {

Is it intentional that there is no error_report here?

> +        return -1;
> +    }
> +
> +    sz = sizeof(pos);
> +    if (pwrite(vbasedev->fd, &pos, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> +            != sz) {
> +        error_report("vfio: Failed to set save buffer pos");
> +        return -1;
> +    }
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> +            != sz) {
> +        error_report("vfio: Failed to set save buffer action");
> +        return -1;
> +    }
> +
> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate”);
s/migrate/migration/ ?
> +            return -1;
> +        }
> +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
> +            error_report("vfio: error load device memory buffer”);
s/load/loading/ ?
> +            return -1;
> +        }
> +        qemu_put_be64(f, len);
> +        qemu_put_be64(f, pos);
> +        qemu_put_buffer(f, buf, len);
> +        g_free(buf);
> +    } else {
> +        dest = region_devmem->mmaps[0].mmap;
> +        qemu_put_be64(f, len);
> +        qemu_put_be64(f, pos);
> +        qemu_put_buffer(f, dest, len);
> +    }
> +    return 0;
> +}
> +
> +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +    uint64_t total_len = vdev->migration->devmem_size;
> +    uint64_t pos = 0;
> +
> +    qemu_put_be64(f, total_len);
> +    while (pos < total_len) {
> +        uint64_t len = region_devmem->size;
> +
> +        if (pos + len >= total_len) {
> +            len = total_len - pos;
> +        }
> +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
> +            return -1;
> +        }

I don’t see where pos is incremented in this loop

> +    }
> +
> +    return 0;
> +}
> +
> +static
> +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> +                                uint64_t pos, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> +
> +    if (len > region_devmem->size) {

error_report?
> +        return -1;
> +    }
> +
> +    sz = sizeof(pos);
> +    if (pwrite(vbasedev->fd, &pos, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> +            != sz) {
> +        error_report("vfio: Failed to set device memory buffer pos");
> +        return -1;
> +    }
> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }
> +        qemu_get_buffer(f, buf, len);
> +        if (pwrite(vbasedev->fd, buf, len,
> +                    region_devmem->fd_offset) != len) {
> +            error_report("vfio: Failed to load devie memory buffer");
> +            return -1;
> +        }
> +        g_free(buf);
> +    } else {
> +        dest = region_devmem->mmaps[0].mmap;
> +        qemu_get_buffer(f, dest, len);
> +    }
> +
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> +            != sz) {
> +        error_report("vfio: Failed to set load device memory buffer action");
> +        return -1;
> +    }
> +
> +    return 0;
> +
> +}
> +
> +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
> +                        QEMUFile *f, uint64_t total_len)
> +{
> +    uint64_t pos = 0, len = 0;
> +
> +    vfio_set_device_memory_size(vdev, total_len);
> +
> +    while (pos + len < total_len) {
> +        len = qemu_get_be64(f);
> +        pos = qemu_get_be64(f);

Nit: load reads len/pos in the loop, whereas save does it in the
inner function (vfio_save_data_device_memory_chunk)

> +
> +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
> +    }
> +
> +    return 0;
> +}
> +
> +
> static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
>         uint64_t start_addr, uint64_t page_nr)
> {
> @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
>         return;
>     }
> 
> +    /* get dirty data size of device memory */
> +    vfio_get_device_memory_size(vdev);
> +
> +    *res_precopy_only += vdev->migration->devmem_size;
>     return;
> }
> 
> @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>         return 0;
>     }
> 
> -    return 0;
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> +    /* get dirty data of device memory */
> +    return vfio_save_data_device_memory(vdev, f);
> }
> 
> static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>             len = qemu_get_be64(f);
>             vfio_load_data_device_config(vdev, f, len);
>             break;
> +        case VFIO_SAVE_FLAG_DEVMEMORY:
> +            len = qemu_get_be64(f);
> +            vfio_load_data_device_memory(vdev, f, len);
> +            break;
>         default:
>             ret = -EINVAL;
>         }
> @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>     VFIOPCIDevice *vdev = opaque;
>     int rc = 0;
> 
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
> +        /* get dirty data of device memory */
> +        vfio_get_device_memory_size(vdev);
> +        rc = vfio_save_data_device_memory(vdev, f);
> +    }
> +
>     qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
>     vfio_pci_save_config(vdev, f);
> 
> @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> 
> static int vfio_save_setup(QEMUFile *f, void *opaque)
> {
> +    int rc = 0;
>     VFIOPCIDevice *vdev = opaque;
> -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> +        /* get whole snapshot of device memory */
> +        vfio_get_device_memory_size(vdev);
> +        rc = vfio_save_data_device_memory(vdev, f);
> +    } else {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +    }
> 
>     vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
>             VFIO_DEVICE_STATE_LOGGING);
> -    return 0;
> +    return rc;
> }
> 
> static int vfio_load_setup(QEMUFile *f, void *opaque)
> @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
>         goto error;
>     }
> 
> -    if (vfio_device_data_cap_device_memory(vdev)) {
> -        error_report("No suppport of data cap device memory Yet");
> +    if (vfio_device_data_cap_device_memory(vdev) &&
> +            vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
> +              "device-state-data-device-memory")) {
>         goto error;
>     }
> 
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 4b7b1bb..a2cc64b 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
>     uint32_t data_caps;
>     uint32_t device_state;
>     uint64_t devconfig_size;
> +    uint64_t devmem_size;
>     VMChangeStateEntry *vm_state;
> } VFIOMigration;
> 
> -- 
> 2.7.4
> 
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] vfio/migration: support device memory capability
@ 2019-02-19 14:42     ` Christophe de Dinechin
  0 siblings, 0 replies; 133+ messages in thread
From: Christophe de Dinechin @ 2019-02-19 14:42 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Alex Williamson, qemu-devel, cjia, KVM list,
	Alexey Kardashevskiy, Zhengxiao.zx, shuangtai.tst,
	Kirti Wankhede, eauger, yi.l.liu, Erik Skultety, ziye.yang,
	mlevitsk, Halil Pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	intel-gvt-dev, changpeng.liu, Cornelia Huck, Zhi Wang,
	jonathan.davies



> On 19 Feb 2019, at 09:53, Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> If a device has device memory capability, save/load data from device memory
> in pre-copy and stop-and-copy phases.
> 
> LOGGING state is set for device memory for dirty page logging:
> in LOGGING state, get device memory returns whole device memory snapshot;
> outside LOGGING state, get device memory returns dirty data since last get
> operation.
> 
> Usually, device memory is very big, qemu needs to chunk it into several
> pieces each with size of device memory region.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
> hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> hw/vfio/pci.h       |   1 +
> 2 files changed, 231 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 16d6395..f1e9309 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
>     return 0;
> }
> 
> +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    uint64_t len;
> +    int sz;
> +
> +    sz = sizeof(len);
> +    if (pread(vbasedev->fd, &len, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> +            != sz) {
> +        error_report("vfio: Failed to get length of device memory”);

s/length/size/ ? (to be consistent with function name)
> +        return -1;
> +    }
> +    vdev->migration->devmem_size = len;
> +    return 0;
> +}
> +
> +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    int sz;
> +
> +    sz = sizeof(size);
> +    if (pwrite(vbasedev->fd, &size, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> +            != sz) {
> +        error_report("vfio: Failed to set length of device comemory”);

What is comemory? Typo?

Same comment about length vs size

> +        return -1;
> +    }
> +    vdev->migration->devmem_size = size;
> +    return 0;
> +}
> +
> +static
> +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> +                                    uint64_t pos, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> +
> +    if (len > region_devmem->size) {

Is it intentional that there is no error_report here?

> +        return -1;
> +    }
> +
> +    sz = sizeof(pos);
> +    if (pwrite(vbasedev->fd, &pos, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> +            != sz) {
> +        error_report("vfio: Failed to set save buffer pos");
> +        return -1;
> +    }
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> +            != sz) {
> +        error_report("vfio: Failed to set save buffer action");
> +        return -1;
> +    }
> +
> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate”);
s/migrate/migration/ ?
> +            return -1;
> +        }
> +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
> +            error_report("vfio: error load device memory buffer”);
s/load/loading/ ?
> +            return -1;
> +        }
> +        qemu_put_be64(f, len);
> +        qemu_put_be64(f, pos);
> +        qemu_put_buffer(f, buf, len);
> +        g_free(buf);
> +    } else {
> +        dest = region_devmem->mmaps[0].mmap;
> +        qemu_put_be64(f, len);
> +        qemu_put_be64(f, pos);
> +        qemu_put_buffer(f, dest, len);
> +    }
> +    return 0;
> +}
> +
> +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> +{
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +    uint64_t total_len = vdev->migration->devmem_size;
> +    uint64_t pos = 0;
> +
> +    qemu_put_be64(f, total_len);
> +    while (pos < total_len) {
> +        uint64_t len = region_devmem->size;
> +
> +        if (pos + len >= total_len) {
> +            len = total_len - pos;
> +        }
> +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
> +            return -1;
> +        }

I don’t see where pos is incremented in this loop

> +    }
> +
> +    return 0;
> +}
> +
> +static
> +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> +                                uint64_t pos, uint64_t len)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    VFIORegion *region_ctl =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> +    VFIORegion *region_devmem =
> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> +
> +    void *dest;
> +    uint32_t sz;
> +    uint8_t *buf = NULL;
> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> +
> +    if (len > region_devmem->size) {

error_report?
> +        return -1;
> +    }
> +
> +    sz = sizeof(pos);
> +    if (pwrite(vbasedev->fd, &pos, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> +            != sz) {
> +        error_report("vfio: Failed to set device memory buffer pos");
> +        return -1;
> +    }
> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> +        buf = g_malloc(len);
> +        if (buf == NULL) {
> +            error_report("vfio: Failed to allocate memory for migrate");
> +            return -1;
> +        }
> +        qemu_get_buffer(f, buf, len);
> +        if (pwrite(vbasedev->fd, buf, len,
> +                    region_devmem->fd_offset) != len) {
> +            error_report("vfio: Failed to load devie memory buffer");
> +            return -1;
> +        }
> +        g_free(buf);
> +    } else {
> +        dest = region_devmem->mmaps[0].mmap;
> +        qemu_get_buffer(f, dest, len);
> +    }
> +
> +    sz = sizeof(action);
> +    if (pwrite(vbasedev->fd, &action, sz,
> +                region_ctl->fd_offset +
> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> +            != sz) {
> +        error_report("vfio: Failed to set load device memory buffer action");
> +        return -1;
> +    }
> +
> +    return 0;
> +
> +}
> +
> +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
> +                        QEMUFile *f, uint64_t total_len)
> +{
> +    uint64_t pos = 0, len = 0;
> +
> +    vfio_set_device_memory_size(vdev, total_len);
> +
> +    while (pos + len < total_len) {
> +        len = qemu_get_be64(f);
> +        pos = qemu_get_be64(f);

Nit: load reads len/pos in the loop, whereas save does it in the
inner function (vfio_save_data_device_memory_chunk)

> +
> +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
> +    }
> +
> +    return 0;
> +}
> +
> +
> static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
>         uint64_t start_addr, uint64_t page_nr)
> {
> @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
>         return;
>     }
> 
> +    /* get dirty data size of device memory */
> +    vfio_get_device_memory_size(vdev);
> +
> +    *res_precopy_only += vdev->migration->devmem_size;
>     return;
> }
> 
> @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>         return 0;
>     }
> 
> -    return 0;
> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> +    /* get dirty data of device memory */
> +    return vfio_save_data_device_memory(vdev, f);
> }
> 
> static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>             len = qemu_get_be64(f);
>             vfio_load_data_device_config(vdev, f, len);
>             break;
> +        case VFIO_SAVE_FLAG_DEVMEMORY:
> +            len = qemu_get_be64(f);
> +            vfio_load_data_device_memory(vdev, f, len);
> +            break;
>         default:
>             ret = -EINVAL;
>         }
> @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>     VFIOPCIDevice *vdev = opaque;
>     int rc = 0;
> 
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
> +        /* get dirty data of device memory */
> +        vfio_get_device_memory_size(vdev);
> +        rc = vfio_save_data_device_memory(vdev, f);
> +    }
> +
>     qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
>     vfio_pci_save_config(vdev, f);
> 
> @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> 
> static int vfio_save_setup(QEMUFile *f, void *opaque)
> {
> +    int rc = 0;
>     VFIOPCIDevice *vdev = opaque;
> -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +
> +    if (vfio_device_data_cap_device_memory(vdev)) {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> +        /* get whole snapshot of device memory */
> +        vfio_get_device_memory_size(vdev);
> +        rc = vfio_save_data_device_memory(vdev, f);
> +    } else {
> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> +    }
> 
>     vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
>             VFIO_DEVICE_STATE_LOGGING);
> -    return 0;
> +    return rc;
> }
> 
> static int vfio_load_setup(QEMUFile *f, void *opaque)
> @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
>         goto error;
>     }
> 
> -    if (vfio_device_data_cap_device_memory(vdev)) {
> -        error_report("No suppport of data cap device memory Yet");
> +    if (vfio_device_data_cap_device_memory(vdev) &&
> +            vfio_device_state_region_setup(vdev,
> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
> +              "device-state-data-device-memory")) {
>         goto error;
>     }
> 
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 4b7b1bb..a2cc64b 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
>     uint32_t data_caps;
>     uint32_t device_state;
>     uint64_t devconfig_size;
> +    uint64_t devmem_size;
>     VMChangeStateEntry *vm_state;
> } VFIOMigration;
> 
> -- 
> 2.7.4
> 
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 2/5] vfio/migration: support device of device config capability
  2019-02-19 11:01     ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2019-02-20  5:12       ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-20  5:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, zhi.a.wang, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck, Ken.Xue,
	jonathan.davies

On Tue, Feb 19, 2019 at 11:01:45AM +0000, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > Device config is the default data that every device should have. so
> > device config capability is by default on, no need to set.
> > 
> > - Currently two type of resources are saved/loaded for device of device
> >   config capability:
> >   General PCI config data, and Device config data.
> >   They are copies as a whole when precopy is stopped.
> > 
> > Migration setup flow:
> > - Setup device state regions, check its device state version and capabilities.
> >   Mmap Device Config Region and Dirty Bitmap Region, if available.
> > - If device state regions are failed to get setup, a migration blocker is
> >   registered instead.
> > - Added SaveVMHandlers to register device state save/load handlers.
> > - Register VM state change handler to set device's running/stop states.
> > - On migration startup on source machine, set device's state to
> >   VFIO_DEVICE_STATE_LOGGING
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> > ---
> >  hw/vfio/Makefile.objs         |   2 +-
> >  hw/vfio/migration.c           | 633 ++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c                 |   1 -
> >  hw/vfio/pci.h                 |  25 +-
> >  include/hw/vfio/vfio-common.h |   1 +
> >  5 files changed, 659 insertions(+), 3 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> > diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> > index 8b3f664..f32ff19 100644
> > --- a/hw/vfio/Makefile.objs
> > +++ b/hw/vfio/Makefile.objs
> > @@ -1,6 +1,6 @@
> >  ifeq ($(CONFIG_LINUX), y)
> >  obj-$(CONFIG_SOFTMMU) += common.o
> > -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> > +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o
> >  obj-$(CONFIG_VFIO_CCW) += ccw.o
> >  obj-$(CONFIG_SOFTMMU) += platform.o
> >  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > new file mode 100644
> > index 0000000..16d6395
> > --- /dev/null
> > +++ b/hw/vfio/migration.c
> > @@ -0,0 +1,633 @@
> > +#include "qemu/osdep.h"
> > +
> > +#include "hw/vfio/vfio-common.h"
> > +#include "migration/blocker.h"
> > +#include "migration/register.h"
> > +#include "qapi/error.h"
> > +#include "pci.h"
> > +#include "sysemu/kvm.h"
> > +#include "exec/ram_addr.h"
> > +
> > +#define VFIO_SAVE_FLAG_SETUP 0
> > +#define VFIO_SAVE_FLAG_PCI 1
> > +#define VFIO_SAVE_FLAG_DEVCONFIG 2
> > +#define VFIO_SAVE_FLAG_DEVMEMORY 4
> > +#define VFIO_SAVE_FLAG_CONTINUE 8
> > +
> > +static int vfio_device_state_region_setup(VFIOPCIDevice *vdev,
> > +        VFIORegion *region, uint32_t subtype, const char *name)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    struct vfio_region_info *info;
> > +    int ret;
> > +
> > +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_DEVICE_STATE,
> > +            subtype, &info);
> > +    if (ret) {
> > +        error_report("Failed to get info of region %s", name);
> > +        return ret;
> > +    }
> > +
> > +    if (vfio_region_setup(OBJECT(vdev), vbasedev,
> > +            region, info->index, name)) {
> > +        error_report("Failed to setup migrtion region %s", name);
> > +        return ret;
> > +    }
> > +
> > +    if (vfio_region_mmap(region)) {
> > +        error_report("Failed to mmap migrtion region %s", name);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev)
> > +{
> > +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY);
> > +}
> > +
> > +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev)
> > +{
> > +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY);
> > +}
> > +
> > +static bool vfio_device_state_region_mmaped(VFIORegion *region)
> > +{
> > +    bool mmaped = true;
> > +    if (region->nr_mmaps != 1 || region->mmaps[0].offset ||
> > +            (region->size != region->mmaps[0].size) ||
> > +            (region->mmaps[0].mmap == NULL)) {
> > +        mmaped = false;
> > +    }
> > +
> > +    return mmaped;
> > +}
> > +
> > +static int vfio_get_device_config_size(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > +    uint64_t len;
> > +    int sz;
> > +
> > +    sz = sizeof(len);
> > +    if (pread(vbasedev->fd, &len, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to get length of device config");
> > +        return -1;
> > +    }
> > +    if (len > region_config->size) {
> > +        error_report("vfio: Error device config length");
> > +        return -1;
> > +    }
> > +    vdev->migration->devconfig_size = len;
> > +
> > +    return 0;
> > +}
> > +
> > +static int vfio_set_device_config_size(VFIOPCIDevice *vdev, uint64_t size)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > +    int sz;
> > +
> > +    if (size > region_config->size) {
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(size);
> > +    if (pwrite(vbasedev->fd, &size, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to set length of device config");
> > +        return -1;
> > +    }
> > +    vdev->migration->devconfig_size = size;
> > +    return 0;
> > +}
> > +
> > +static int vfio_save_data_device_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> > +    uint64_t len = vdev->migration->devconfig_size;
> > +
> > +    qemu_put_be64(f, len);
> > +
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.action))
> > +            != sz) {
> > +        error_report("vfio: action failure for device config get buffer");
> > +        return -1;
> > +    }
> > +
> > +    if (!vfio_device_state_region_mmaped(region_config)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> 
> g_malloc never returns NULL; it aborts on failure to allocate.
> So you can either drop the check, or my preference is to use
> g_try_malloc for large/unknown areas, and it can return NULL.
> 
ok. got that. I'll use g_try_malloc next time :)

> > +        if (pread(vbasedev->fd, buf, len, region_config->fd_offset) != len) {
> > +            error_report("vfio: Failed read device config buffer");
> > +            return -1;
> > +        }
> > +        qemu_put_buffer(f, buf, len);
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_config->mmaps[0].mmap;
> > +        qemu_put_buffer(f, dest, len);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> > +                            QEMUFile *f, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> > +
> > +    vfio_set_device_config_size(vdev, len);
> > +
> > +    if (!vfio_device_state_region_mmaped(region_config)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> > +        qemu_get_buffer(f, buf, len);
> > +        if (pwrite(vbasedev->fd, buf, len,
> > +                    region_config->fd_offset) != len) {
> > +            error_report("vfio: Failed to write devie config buffer");
> > +            return -1;
> > +        }
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_config->mmaps[0].mmap;
> > +        qemu_get_buffer(f, dest, len);
> > +    }
> > +
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.action))
> > +            != sz) {
> > +        error_report("vfio: action failure for device config set buffer");
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> > +        uint64_t start_addr, uint64_t page_nr)
> > +{
> > +
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_bitmap =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> > +    unsigned long bitmap_size =
> > +                    BITS_TO_LONGS(page_nr) * sizeof(unsigned long);
> > +    uint32_t sz;
> > +
> > +    struct {
> > +        __u64 start_addr;
> > +        __u64 page_nr;
> > +    } system_memory;
> > +    system_memory.start_addr = start_addr;
> > +    system_memory.page_nr = page_nr;
> > +    sz = sizeof(system_memory);
> > +    if (pwrite(vbasedev->fd, &system_memory, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, system_memory))
> > +            != sz) {
> > +        error_report("vfio: Failed to set system memory range for dirty pages");
> > +        return -1;
> > +    }
> > +
> > +    if (!vfio_device_state_region_mmaped(region_bitmap)) {
> > +        void *bitmap = g_malloc0(bitmap_size);
> > +
> > +        if (pread(vbasedev->fd, bitmap, bitmap_size,
> > +                    region_bitmap->fd_offset) != bitmap_size) {
> > +            error_report("vfio: Failed to read dirty bitmap data");
> > +            return -1;
> > +        }
> > +
> > +        cpu_physical_memory_set_dirty_lebitmap(bitmap, start_addr, page_nr);
> > +
> > +        g_free(bitmap);
> > +    } else {
> > +        cpu_physical_memory_set_dirty_lebitmap(
> > +                    region_bitmap->mmaps[0].mmap,
> > +                    start_addr, page_nr);
> > +    }
> > +   return 0;
> > +}
> > +
> > +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> > +        uint64_t start_addr, uint64_t page_nr)
> > +{
> > +    VFIORegion *region_bitmap =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> > +    unsigned long chunk_size = region_bitmap->size;
> > +    uint64_t chunk_pg_nr = (chunk_size / sizeof(unsigned long)) *
> > +                                BITS_PER_LONG;
> > +
> > +    uint64_t cnt_left;
> > +    int rc = 0;
> > +
> > +    cnt_left = page_nr;
> > +
> > +    while (cnt_left >= chunk_pg_nr) {
> > +        rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, chunk_pg_nr);
> > +        if (rc) {
> > +            goto exit;
> > +        }
> > +        cnt_left -= chunk_pg_nr;
> > +        start_addr += start_addr;
> > +   }
> > +   rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, cnt_left);
> > +
> > +exit:
> > +   return rc;
> > +}
> > +
> > +static int vfio_set_device_state(VFIOPCIDevice *vdev,
> > +        uint32_t dev_state)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    uint32_t sz = sizeof(dev_state);
> > +
> > +    if (!vdev->migration) {
> > +        return -1;
> > +    }
> > +
> > +    if (pwrite(vbasedev->fd, &dev_state, sz,
> > +              region->fd_offset +
> > +              offsetof(struct vfio_device_state_ctl, device_state))
> > +            != sz) {
> > +        error_report("vfio: Failed to set device state %d", dev_state);
> > +        return -1;
> > +    }
> > +    vdev->migration->device_state = dev_state;
> > +    return 0;
> > +}
> > +
> > +static int vfio_get_device_data_caps(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +
> > +    uint32_t caps;
> > +    uint32_t size = sizeof(caps);
> > +
> > +    if (pread(vbasedev->fd, &caps, size,
> > +                region->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, caps))
> > +            != size) {
> > +        error_report("%s Failed to read data caps of device states",
> > +                vbasedev->name);
> > +        return -1;
> > +    }
> > +    vdev->migration->data_caps = caps;
> > +    return 0;
> > +}
> > +
> > +
> > +static int vfio_check_devstate_version(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +
> > +    uint32_t version;
> > +    uint32_t size = sizeof(version);
> > +
> > +    if (pread(vbasedev->fd, &version, size,
> > +                region->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, version))
> > +            != size) {
> > +        error_report("%s Failed to read version of device state interfaces",
> > +                vbasedev->name);
> > +        return -1;
> > +    }
> > +
> > +    if (version != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> > +        error_report("%s migration version mismatch, right version is %d",
> > +                vbasedev->name, VFIO_DEVICE_STATE_INTERFACE_VERSION);
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static void vfio_vm_change_state_handler(void *pv, int running, RunState state)
> > +{
> > +    VFIOPCIDevice *vdev = pv;
> > +    uint32_t dev_state = vdev->migration->device_state;
> > +
> > +    if (!running) {
> > +        dev_state |= VFIO_DEVICE_STATE_STOP;
> > +    } else {
> > +        dev_state &= ~VFIO_DEVICE_STATE_STOP;
> > +    }
> > +
> > +    vfio_set_device_state(vdev, dev_state);
> > +}
> > +
> > +static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> > +                                   uint64_t max_size,
> > +                                   uint64_t *res_precopy_only,
> > +                                   uint64_t *res_compatible,
> > +                                   uint64_t *res_post_copy_only)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +
> > +    if (!vfio_device_data_cap_device_memory(vdev)) {
> > +        return;
> > +    }
> > +
> > +    return;
> > +}
> > +
> > +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +
> > +    if (!vfio_device_data_cap_device_memory(vdev)) {
> > +        return 0;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    PCIDevice *pdev = &vdev->pdev;
> > +    uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
> > +    bool msi_64bit;
> > +
> > +    /* retore pci bar configuration */
> > +    ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > +            ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > +        bar_cfg = qemu_get_be32(f);
> > +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar_cfg, 4);
> > +    }
> > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > +            ctl | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> > +
> > +    /* restore msi configuration */
> > +    ctl = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > +    msi_64bit = !!(ctl & PCI_MSI_FLAGS_64BIT);
> > +
> > +    vfio_pci_write_config(&vdev->pdev,
> > +            pdev->msi_cap + PCI_MSI_FLAGS,
> > +            ctl & (!PCI_MSI_FLAGS_ENABLE), 2);
> > +
> > +    msi_lo = qemu_get_be32(f);
> > +    vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO, msi_lo, 4);
> > +
> > +    if (msi_64bit) {
> > +        msi_hi = qemu_get_be32(f);
> > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > +                msi_hi, 4);
> > +    }
> > +    msi_data = qemu_get_be32(f);
> > +    vfio_pci_write_config(pdev,
> > +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > +            msi_data, 2);
> > +
> > +    vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > +            ctl | PCI_MSI_FLAGS_ENABLE, 2);
> 
> It would probably be best to use a VMStateDescription and the macros
> for this if possible; I bet you'll want to add more fields in the future
> for example.
yes, it is also a good idea to use VMStateDescription for pci general
config data, which are saved one time in stop-and-copy phase.
But it's a little hard to maintain two type of interfaces.
Maybe use the way as what VMStateDescription did, i.e. using field list and
macros?

> Also what happens if the data read from the migration stream is bad or
> doesn't agree with this devices hardware? How does this fail?
>
right, errors in migration stream needs to be checked. I'll add the error
check.
For the hardware incompatiable issue, seems vfio_pci_write_config does not
return error if device driver returns error for this write.
Seems it needs to be addressed in somewhere else like checking
compitibility before lauching migration.
> > +}
> > +
> > +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    int flag;
> > +    uint64_t len;
> > +    int ret = 0;
> > +
> > +    if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    do {
> > +        flag = qemu_get_byte(f);
> > +
> > +        switch (flag & ~VFIO_SAVE_FLAG_CONTINUE) {
> > +        case VFIO_SAVE_FLAG_SETUP:
> > +            break;
> > +        case VFIO_SAVE_FLAG_PCI:
> > +            vfio_pci_load_config(vdev, f);
> > +            break;
> > +        case VFIO_SAVE_FLAG_DEVCONFIG:
> > +            len = qemu_get_be64(f);
> > +            vfio_load_data_device_config(vdev, f, len);
> > +            break;
> > +        default:
> > +            ret = -EINVAL;
> > +        }
> > +    } while (flag & VFIO_SAVE_FLAG_CONTINUE);
> > +
> > +    return ret;
> > +}
> > +
> > +static void vfio_pci_save_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    PCIDevice *pdev = &vdev->pdev;
> > +    uint32_t msi_cfg, msi_lo, msi_hi, msi_data, bar_cfg, i;
> > +    bool msi_64bit;
> > +
> > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > +        bar_cfg = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> > +        qemu_put_be32(f, bar_cfg);
> > +    }
> > +
> > +    msi_cfg = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > +    msi_64bit = !!(msi_cfg & PCI_MSI_FLAGS_64BIT);
> > +
> > +    msi_lo = pci_default_read_config(pdev,
> > +            pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > +    qemu_put_be32(f, msi_lo);
> > +
> > +    if (msi_64bit) {
> > +        msi_hi = pci_default_read_config(pdev,
> > +                pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > +                4);
> > +        qemu_put_be32(f, msi_hi);
> > +    }
> > +
> > +    msi_data = pci_default_read_config(pdev,
> > +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > +            2);
> > +    qemu_put_be32(f, msi_data);
> > +
> > +}
> > +
> > +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    int rc = 0;
> > +
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> > +    vfio_pci_save_config(vdev, f);
> > +
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVCONFIG);
> > +    rc += vfio_get_device_config_size(vdev);
> > +    rc += vfio_save_data_device_config(vdev, f);
> > +
> > +    return rc;
> > +}
> > +
> > +static int vfio_save_setup(QEMUFile *f, void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +
> > +    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> > +            VFIO_DEVICE_STATE_LOGGING);
> > +    return 0;
> > +}
> > +
> > +static int vfio_load_setup(QEMUFile *f, void *opaque)
> > +{
> > +    return 0;
> > +}
> > +
> > +static void vfio_save_cleanup(void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    uint32_t dev_state = vdev->migration->device_state;
> > +
> > +    dev_state &= ~VFIO_DEVICE_STATE_LOGGING;
> > +
> > +    vfio_set_device_state(vdev, dev_state);
> > +}
> > +
> > +static SaveVMHandlers savevm_vfio_handlers = {
> > +    .save_setup = vfio_save_setup,
> > +    .save_live_pending = vfio_save_live_pending,
> > +    .save_live_iterate = vfio_save_iterate,
> > +    .save_live_complete_precopy = vfio_save_complete_precopy,
> > +    .save_cleanup = vfio_save_cleanup,
> > +    .load_setup = vfio_load_setup,
> > +    .load_state = vfio_load_state,
> > +};
> > +
> > +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> > +{
> > +    int ret;
> > +    Error *local_err = NULL;
> > +    vdev->migration = g_new0(VFIOMigration, 1);
> > +
> > +    if (vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL,
> > +              "device-state-ctl")) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_check_devstate_version(vdev)) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_get_device_data_caps(vdev)) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG,
> > +              "device-state-data-device-config")) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        error_report("No suppport of data cap device memory Yet");
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_device_data_cap_system_memory(vdev) &&
> > +            vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP,
> > +              "device-state-data-dirtybitmap")) {
> > +        goto error;
> > +    }
> > +
> > +    vdev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> > +
> > +    register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
> > +            VFIO_DEVICE_STATE_INTERFACE_VERSION,
> > +            &savevm_vfio_handlers,
> > +            vdev);
> > +
> > +    vdev->migration->vm_state =
> > +        qemu_add_vm_change_state_handler(vfio_vm_change_state_handler, vdev);
> > +
> > +    return 0;
> > +error:
> > +    error_setg(&vdev->migration_blocker,
> > +            "VFIO device doesn't support migration");
> > +    ret = migrate_add_blocker(vdev->migration_blocker, &local_err);
> > +    if (local_err) {
> > +        error_propagate(errp, local_err);
> > +        error_free(vdev->migration_blocker);
> > +    }
> > +
> > +    g_free(vdev->migration);
> > +    vdev->migration = NULL;
> > +
> > +    return ret;
> > +}
> > +
> > +void vfio_migration_finalize(VFIOPCIDevice *vdev)
> > +{
> > +    if (vdev->migration) {
> > +        int i;
> > +        qemu_del_vm_change_state_handler(vdev->migration->vm_state);
> > +        unregister_savevm(NULL, TYPE_VFIO_PCI, vdev);
> > +        for (i = 0; i < VFIO_DEVSTATE_REGION_NUM; i++) {
> > +            vfio_region_finalize(&vdev->migration->region[i]);
> > +        }
> > +        g_free(vdev->migration);
> > +        vdev->migration = NULL;
> > +    } else if (vdev->migration_blocker) {
> > +        migrate_del_blocker(vdev->migration_blocker);
> > +        error_free(vdev->migration_blocker);
> > +    }
> > +}
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index c0cb1ec..b8e006b 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -37,7 +37,6 @@
> >  
> >  #define MSIX_CAP_LENGTH 12
> >  
> > -#define TYPE_VFIO_PCI "vfio-pci"
> >  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> >  
> >  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index b1ae4c0..4b7b1bb 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -19,6 +19,7 @@
> >  #include "qemu/event_notifier.h"
> >  #include "qemu/queue.h"
> >  #include "qemu/timer.h"
> > +#include "sysemu/sysemu.h"
> >  
> >  #define PCI_ANY_ID (~0)
> >  
> > @@ -56,6 +57,21 @@ typedef struct VFIOBAR {
> >      QLIST_HEAD(, VFIOQuirk) quirks;
> >  } VFIOBAR;
> >  
> > +enum {
> > +    VFIO_DEVSTATE_REGION_CTL = 0,
> > +    VFIO_DEVSTATE_REGION_DATA_CONFIG,
> > +    VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY,
> > +    VFIO_DEVSTATE_REGION_DATA_BITMAP,
> > +    VFIO_DEVSTATE_REGION_NUM,
> > +};
> > +typedef struct VFIOMigration {
> > +    VFIORegion region[VFIO_DEVSTATE_REGION_NUM];
> > +    uint32_t data_caps;
> > +    uint32_t device_state;
> > +    uint64_t devconfig_size;
> > +    VMChangeStateEntry *vm_state;
> > +} VFIOMigration;
> > +
> >  typedef struct VFIOVGARegion {
> >      MemoryRegion mem;
> >      off_t offset;
> > @@ -132,6 +148,8 @@ typedef struct VFIOPCIDevice {
> >      VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
> >      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
> >      void *igd_opregion;
> > +    VFIOMigration *migration;
> > +    Error *migration_blocker;
> >      PCIHostDeviceAddress host;
> >      EventNotifier err_notifier;
> >      EventNotifier req_notifier;
> > @@ -198,5 +216,10 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> >  void vfio_display_reset(VFIOPCIDevice *vdev);
> >  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> >  void vfio_display_finalize(VFIOPCIDevice *vdev);
> > -
> > +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev);
> > +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev);
> > +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> > +         uint64_t start_addr, uint64_t page_nr);
> > +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp);
> > +void vfio_migration_finalize(VFIOPCIDevice *vdev);
> >  #endif /* HW_VFIO_VFIO_PCI_H */
> > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > index 1b434d0..ed43613 100644
> > --- a/include/hw/vfio/vfio-common.h
> > +++ b/include/hw/vfio/vfio-common.h
> > @@ -32,6 +32,7 @@
> >  #endif
> >  
> >  #define VFIO_MSG_PREFIX "vfio %s: "
> > +#define TYPE_VFIO_PCI "vfio-pci"
> >  
> >  enum {
> >      VFIO_DEVICE_TYPE_PCI = 0,
> > -- 
> > 2.7.4
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] vfio/migration: support device of device config capability
@ 2019-02-20  5:12       ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-20  5:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck,
	zhi.a.wang, jonathan.davies

On Tue, Feb 19, 2019 at 11:01:45AM +0000, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > Device config is the default data that every device should have. so
> > device config capability is by default on, no need to set.
> > 
> > - Currently two type of resources are saved/loaded for device of device
> >   config capability:
> >   General PCI config data, and Device config data.
> >   They are copies as a whole when precopy is stopped.
> > 
> > Migration setup flow:
> > - Setup device state regions, check its device state version and capabilities.
> >   Mmap Device Config Region and Dirty Bitmap Region, if available.
> > - If device state regions are failed to get setup, a migration blocker is
> >   registered instead.
> > - Added SaveVMHandlers to register device state save/load handlers.
> > - Register VM state change handler to set device's running/stop states.
> > - On migration startup on source machine, set device's state to
> >   VFIO_DEVICE_STATE_LOGGING
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> > ---
> >  hw/vfio/Makefile.objs         |   2 +-
> >  hw/vfio/migration.c           | 633 ++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c                 |   1 -
> >  hw/vfio/pci.h                 |  25 +-
> >  include/hw/vfio/vfio-common.h |   1 +
> >  5 files changed, 659 insertions(+), 3 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> > diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> > index 8b3f664..f32ff19 100644
> > --- a/hw/vfio/Makefile.objs
> > +++ b/hw/vfio/Makefile.objs
> > @@ -1,6 +1,6 @@
> >  ifeq ($(CONFIG_LINUX), y)
> >  obj-$(CONFIG_SOFTMMU) += common.o
> > -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> > +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o
> >  obj-$(CONFIG_VFIO_CCW) += ccw.o
> >  obj-$(CONFIG_SOFTMMU) += platform.o
> >  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > new file mode 100644
> > index 0000000..16d6395
> > --- /dev/null
> > +++ b/hw/vfio/migration.c
> > @@ -0,0 +1,633 @@
> > +#include "qemu/osdep.h"
> > +
> > +#include "hw/vfio/vfio-common.h"
> > +#include "migration/blocker.h"
> > +#include "migration/register.h"
> > +#include "qapi/error.h"
> > +#include "pci.h"
> > +#include "sysemu/kvm.h"
> > +#include "exec/ram_addr.h"
> > +
> > +#define VFIO_SAVE_FLAG_SETUP 0
> > +#define VFIO_SAVE_FLAG_PCI 1
> > +#define VFIO_SAVE_FLAG_DEVCONFIG 2
> > +#define VFIO_SAVE_FLAG_DEVMEMORY 4
> > +#define VFIO_SAVE_FLAG_CONTINUE 8
> > +
> > +static int vfio_device_state_region_setup(VFIOPCIDevice *vdev,
> > +        VFIORegion *region, uint32_t subtype, const char *name)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    struct vfio_region_info *info;
> > +    int ret;
> > +
> > +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_DEVICE_STATE,
> > +            subtype, &info);
> > +    if (ret) {
> > +        error_report("Failed to get info of region %s", name);
> > +        return ret;
> > +    }
> > +
> > +    if (vfio_region_setup(OBJECT(vdev), vbasedev,
> > +            region, info->index, name)) {
> > +        error_report("Failed to setup migrtion region %s", name);
> > +        return ret;
> > +    }
> > +
> > +    if (vfio_region_mmap(region)) {
> > +        error_report("Failed to mmap migrtion region %s", name);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev)
> > +{
> > +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY);
> > +}
> > +
> > +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev)
> > +{
> > +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY);
> > +}
> > +
> > +static bool vfio_device_state_region_mmaped(VFIORegion *region)
> > +{
> > +    bool mmaped = true;
> > +    if (region->nr_mmaps != 1 || region->mmaps[0].offset ||
> > +            (region->size != region->mmaps[0].size) ||
> > +            (region->mmaps[0].mmap == NULL)) {
> > +        mmaped = false;
> > +    }
> > +
> > +    return mmaped;
> > +}
> > +
> > +static int vfio_get_device_config_size(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > +    uint64_t len;
> > +    int sz;
> > +
> > +    sz = sizeof(len);
> > +    if (pread(vbasedev->fd, &len, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to get length of device config");
> > +        return -1;
> > +    }
> > +    if (len > region_config->size) {
> > +        error_report("vfio: Error device config length");
> > +        return -1;
> > +    }
> > +    vdev->migration->devconfig_size = len;
> > +
> > +    return 0;
> > +}
> > +
> > +static int vfio_set_device_config_size(VFIOPCIDevice *vdev, uint64_t size)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > +    int sz;
> > +
> > +    if (size > region_config->size) {
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(size);
> > +    if (pwrite(vbasedev->fd, &size, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to set length of device config");
> > +        return -1;
> > +    }
> > +    vdev->migration->devconfig_size = size;
> > +    return 0;
> > +}
> > +
> > +static int vfio_save_data_device_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> > +    uint64_t len = vdev->migration->devconfig_size;
> > +
> > +    qemu_put_be64(f, len);
> > +
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.action))
> > +            != sz) {
> > +        error_report("vfio: action failure for device config get buffer");
> > +        return -1;
> > +    }
> > +
> > +    if (!vfio_device_state_region_mmaped(region_config)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> 
> g_malloc never returns NULL; it aborts on failure to allocate.
> So you can either drop the check, or my preference is to use
> g_try_malloc for large/unknown areas, and it can return NULL.
> 
ok. got that. I'll use g_try_malloc next time :)

> > +        if (pread(vbasedev->fd, buf, len, region_config->fd_offset) != len) {
> > +            error_report("vfio: Failed read device config buffer");
> > +            return -1;
> > +        }
> > +        qemu_put_buffer(f, buf, len);
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_config->mmaps[0].mmap;
> > +        qemu_put_buffer(f, dest, len);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> > +                            QEMUFile *f, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> > +
> > +    vfio_set_device_config_size(vdev, len);
> > +
> > +    if (!vfio_device_state_region_mmaped(region_config)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> > +        qemu_get_buffer(f, buf, len);
> > +        if (pwrite(vbasedev->fd, buf, len,
> > +                    region_config->fd_offset) != len) {
> > +            error_report("vfio: Failed to write devie config buffer");
> > +            return -1;
> > +        }
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_config->mmaps[0].mmap;
> > +        qemu_get_buffer(f, dest, len);
> > +    }
> > +
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.action))
> > +            != sz) {
> > +        error_report("vfio: action failure for device config set buffer");
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> > +        uint64_t start_addr, uint64_t page_nr)
> > +{
> > +
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_bitmap =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> > +    unsigned long bitmap_size =
> > +                    BITS_TO_LONGS(page_nr) * sizeof(unsigned long);
> > +    uint32_t sz;
> > +
> > +    struct {
> > +        __u64 start_addr;
> > +        __u64 page_nr;
> > +    } system_memory;
> > +    system_memory.start_addr = start_addr;
> > +    system_memory.page_nr = page_nr;
> > +    sz = sizeof(system_memory);
> > +    if (pwrite(vbasedev->fd, &system_memory, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, system_memory))
> > +            != sz) {
> > +        error_report("vfio: Failed to set system memory range for dirty pages");
> > +        return -1;
> > +    }
> > +
> > +    if (!vfio_device_state_region_mmaped(region_bitmap)) {
> > +        void *bitmap = g_malloc0(bitmap_size);
> > +
> > +        if (pread(vbasedev->fd, bitmap, bitmap_size,
> > +                    region_bitmap->fd_offset) != bitmap_size) {
> > +            error_report("vfio: Failed to read dirty bitmap data");
> > +            return -1;
> > +        }
> > +
> > +        cpu_physical_memory_set_dirty_lebitmap(bitmap, start_addr, page_nr);
> > +
> > +        g_free(bitmap);
> > +    } else {
> > +        cpu_physical_memory_set_dirty_lebitmap(
> > +                    region_bitmap->mmaps[0].mmap,
> > +                    start_addr, page_nr);
> > +    }
> > +   return 0;
> > +}
> > +
> > +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> > +        uint64_t start_addr, uint64_t page_nr)
> > +{
> > +    VFIORegion *region_bitmap =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> > +    unsigned long chunk_size = region_bitmap->size;
> > +    uint64_t chunk_pg_nr = (chunk_size / sizeof(unsigned long)) *
> > +                                BITS_PER_LONG;
> > +
> > +    uint64_t cnt_left;
> > +    int rc = 0;
> > +
> > +    cnt_left = page_nr;
> > +
> > +    while (cnt_left >= chunk_pg_nr) {
> > +        rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, chunk_pg_nr);
> > +        if (rc) {
> > +            goto exit;
> > +        }
> > +        cnt_left -= chunk_pg_nr;
> > +        start_addr += start_addr;
> > +   }
> > +   rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, cnt_left);
> > +
> > +exit:
> > +   return rc;
> > +}
> > +
> > +static int vfio_set_device_state(VFIOPCIDevice *vdev,
> > +        uint32_t dev_state)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    uint32_t sz = sizeof(dev_state);
> > +
> > +    if (!vdev->migration) {
> > +        return -1;
> > +    }
> > +
> > +    if (pwrite(vbasedev->fd, &dev_state, sz,
> > +              region->fd_offset +
> > +              offsetof(struct vfio_device_state_ctl, device_state))
> > +            != sz) {
> > +        error_report("vfio: Failed to set device state %d", dev_state);
> > +        return -1;
> > +    }
> > +    vdev->migration->device_state = dev_state;
> > +    return 0;
> > +}
> > +
> > +static int vfio_get_device_data_caps(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +
> > +    uint32_t caps;
> > +    uint32_t size = sizeof(caps);
> > +
> > +    if (pread(vbasedev->fd, &caps, size,
> > +                region->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, caps))
> > +            != size) {
> > +        error_report("%s Failed to read data caps of device states",
> > +                vbasedev->name);
> > +        return -1;
> > +    }
> > +    vdev->migration->data_caps = caps;
> > +    return 0;
> > +}
> > +
> > +
> > +static int vfio_check_devstate_version(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +
> > +    uint32_t version;
> > +    uint32_t size = sizeof(version);
> > +
> > +    if (pread(vbasedev->fd, &version, size,
> > +                region->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, version))
> > +            != size) {
> > +        error_report("%s Failed to read version of device state interfaces",
> > +                vbasedev->name);
> > +        return -1;
> > +    }
> > +
> > +    if (version != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> > +        error_report("%s migration version mismatch, right version is %d",
> > +                vbasedev->name, VFIO_DEVICE_STATE_INTERFACE_VERSION);
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static void vfio_vm_change_state_handler(void *pv, int running, RunState state)
> > +{
> > +    VFIOPCIDevice *vdev = pv;
> > +    uint32_t dev_state = vdev->migration->device_state;
> > +
> > +    if (!running) {
> > +        dev_state |= VFIO_DEVICE_STATE_STOP;
> > +    } else {
> > +        dev_state &= ~VFIO_DEVICE_STATE_STOP;
> > +    }
> > +
> > +    vfio_set_device_state(vdev, dev_state);
> > +}
> > +
> > +static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> > +                                   uint64_t max_size,
> > +                                   uint64_t *res_precopy_only,
> > +                                   uint64_t *res_compatible,
> > +                                   uint64_t *res_post_copy_only)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +
> > +    if (!vfio_device_data_cap_device_memory(vdev)) {
> > +        return;
> > +    }
> > +
> > +    return;
> > +}
> > +
> > +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +
> > +    if (!vfio_device_data_cap_device_memory(vdev)) {
> > +        return 0;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    PCIDevice *pdev = &vdev->pdev;
> > +    uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
> > +    bool msi_64bit;
> > +
> > +    /* retore pci bar configuration */
> > +    ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > +            ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > +        bar_cfg = qemu_get_be32(f);
> > +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar_cfg, 4);
> > +    }
> > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > +            ctl | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> > +
> > +    /* restore msi configuration */
> > +    ctl = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > +    msi_64bit = !!(ctl & PCI_MSI_FLAGS_64BIT);
> > +
> > +    vfio_pci_write_config(&vdev->pdev,
> > +            pdev->msi_cap + PCI_MSI_FLAGS,
> > +            ctl & (!PCI_MSI_FLAGS_ENABLE), 2);
> > +
> > +    msi_lo = qemu_get_be32(f);
> > +    vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO, msi_lo, 4);
> > +
> > +    if (msi_64bit) {
> > +        msi_hi = qemu_get_be32(f);
> > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > +                msi_hi, 4);
> > +    }
> > +    msi_data = qemu_get_be32(f);
> > +    vfio_pci_write_config(pdev,
> > +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > +            msi_data, 2);
> > +
> > +    vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > +            ctl | PCI_MSI_FLAGS_ENABLE, 2);
> 
> It would probably be best to use a VMStateDescription and the macros
> for this if possible; I bet you'll want to add more fields in the future
> for example.
yes, it is also a good idea to use VMStateDescription for pci general
config data, which are saved one time in stop-and-copy phase.
But it's a little hard to maintain two type of interfaces.
Maybe use the way as what VMStateDescription did, i.e. using field list and
macros?

> Also what happens if the data read from the migration stream is bad or
> doesn't agree with this devices hardware? How does this fail?
>
right, errors in migration stream needs to be checked. I'll add the error
check.
For the hardware incompatiable issue, seems vfio_pci_write_config does not
return error if device driver returns error for this write.
Seems it needs to be addressed in somewhere else like checking
compitibility before lauching migration.
> > +}
> > +
> > +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    int flag;
> > +    uint64_t len;
> > +    int ret = 0;
> > +
> > +    if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    do {
> > +        flag = qemu_get_byte(f);
> > +
> > +        switch (flag & ~VFIO_SAVE_FLAG_CONTINUE) {
> > +        case VFIO_SAVE_FLAG_SETUP:
> > +            break;
> > +        case VFIO_SAVE_FLAG_PCI:
> > +            vfio_pci_load_config(vdev, f);
> > +            break;
> > +        case VFIO_SAVE_FLAG_DEVCONFIG:
> > +            len = qemu_get_be64(f);
> > +            vfio_load_data_device_config(vdev, f, len);
> > +            break;
> > +        default:
> > +            ret = -EINVAL;
> > +        }
> > +    } while (flag & VFIO_SAVE_FLAG_CONTINUE);
> > +
> > +    return ret;
> > +}
> > +
> > +static void vfio_pci_save_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    PCIDevice *pdev = &vdev->pdev;
> > +    uint32_t msi_cfg, msi_lo, msi_hi, msi_data, bar_cfg, i;
> > +    bool msi_64bit;
> > +
> > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > +        bar_cfg = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> > +        qemu_put_be32(f, bar_cfg);
> > +    }
> > +
> > +    msi_cfg = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > +    msi_64bit = !!(msi_cfg & PCI_MSI_FLAGS_64BIT);
> > +
> > +    msi_lo = pci_default_read_config(pdev,
> > +            pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > +    qemu_put_be32(f, msi_lo);
> > +
> > +    if (msi_64bit) {
> > +        msi_hi = pci_default_read_config(pdev,
> > +                pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > +                4);
> > +        qemu_put_be32(f, msi_hi);
> > +    }
> > +
> > +    msi_data = pci_default_read_config(pdev,
> > +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > +            2);
> > +    qemu_put_be32(f, msi_data);
> > +
> > +}
> > +
> > +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    int rc = 0;
> > +
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> > +    vfio_pci_save_config(vdev, f);
> > +
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVCONFIG);
> > +    rc += vfio_get_device_config_size(vdev);
> > +    rc += vfio_save_data_device_config(vdev, f);
> > +
> > +    return rc;
> > +}
> > +
> > +static int vfio_save_setup(QEMUFile *f, void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +
> > +    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> > +            VFIO_DEVICE_STATE_LOGGING);
> > +    return 0;
> > +}
> > +
> > +static int vfio_load_setup(QEMUFile *f, void *opaque)
> > +{
> > +    return 0;
> > +}
> > +
> > +static void vfio_save_cleanup(void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    uint32_t dev_state = vdev->migration->device_state;
> > +
> > +    dev_state &= ~VFIO_DEVICE_STATE_LOGGING;
> > +
> > +    vfio_set_device_state(vdev, dev_state);
> > +}
> > +
> > +static SaveVMHandlers savevm_vfio_handlers = {
> > +    .save_setup = vfio_save_setup,
> > +    .save_live_pending = vfio_save_live_pending,
> > +    .save_live_iterate = vfio_save_iterate,
> > +    .save_live_complete_precopy = vfio_save_complete_precopy,
> > +    .save_cleanup = vfio_save_cleanup,
> > +    .load_setup = vfio_load_setup,
> > +    .load_state = vfio_load_state,
> > +};
> > +
> > +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> > +{
> > +    int ret;
> > +    Error *local_err = NULL;
> > +    vdev->migration = g_new0(VFIOMigration, 1);
> > +
> > +    if (vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL,
> > +              "device-state-ctl")) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_check_devstate_version(vdev)) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_get_device_data_caps(vdev)) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG,
> > +              "device-state-data-device-config")) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        error_report("No suppport of data cap device memory Yet");
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_device_data_cap_system_memory(vdev) &&
> > +            vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP,
> > +              "device-state-data-dirtybitmap")) {
> > +        goto error;
> > +    }
> > +
> > +    vdev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> > +
> > +    register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
> > +            VFIO_DEVICE_STATE_INTERFACE_VERSION,
> > +            &savevm_vfio_handlers,
> > +            vdev);
> > +
> > +    vdev->migration->vm_state =
> > +        qemu_add_vm_change_state_handler(vfio_vm_change_state_handler, vdev);
> > +
> > +    return 0;
> > +error:
> > +    error_setg(&vdev->migration_blocker,
> > +            "VFIO device doesn't support migration");
> > +    ret = migrate_add_blocker(vdev->migration_blocker, &local_err);
> > +    if (local_err) {
> > +        error_propagate(errp, local_err);
> > +        error_free(vdev->migration_blocker);
> > +    }
> > +
> > +    g_free(vdev->migration);
> > +    vdev->migration = NULL;
> > +
> > +    return ret;
> > +}
> > +
> > +void vfio_migration_finalize(VFIOPCIDevice *vdev)
> > +{
> > +    if (vdev->migration) {
> > +        int i;
> > +        qemu_del_vm_change_state_handler(vdev->migration->vm_state);
> > +        unregister_savevm(NULL, TYPE_VFIO_PCI, vdev);
> > +        for (i = 0; i < VFIO_DEVSTATE_REGION_NUM; i++) {
> > +            vfio_region_finalize(&vdev->migration->region[i]);
> > +        }
> > +        g_free(vdev->migration);
> > +        vdev->migration = NULL;
> > +    } else if (vdev->migration_blocker) {
> > +        migrate_del_blocker(vdev->migration_blocker);
> > +        error_free(vdev->migration_blocker);
> > +    }
> > +}
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index c0cb1ec..b8e006b 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -37,7 +37,6 @@
> >  
> >  #define MSIX_CAP_LENGTH 12
> >  
> > -#define TYPE_VFIO_PCI "vfio-pci"
> >  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> >  
> >  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index b1ae4c0..4b7b1bb 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -19,6 +19,7 @@
> >  #include "qemu/event_notifier.h"
> >  #include "qemu/queue.h"
> >  #include "qemu/timer.h"
> > +#include "sysemu/sysemu.h"
> >  
> >  #define PCI_ANY_ID (~0)
> >  
> > @@ -56,6 +57,21 @@ typedef struct VFIOBAR {
> >      QLIST_HEAD(, VFIOQuirk) quirks;
> >  } VFIOBAR;
> >  
> > +enum {
> > +    VFIO_DEVSTATE_REGION_CTL = 0,
> > +    VFIO_DEVSTATE_REGION_DATA_CONFIG,
> > +    VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY,
> > +    VFIO_DEVSTATE_REGION_DATA_BITMAP,
> > +    VFIO_DEVSTATE_REGION_NUM,
> > +};
> > +typedef struct VFIOMigration {
> > +    VFIORegion region[VFIO_DEVSTATE_REGION_NUM];
> > +    uint32_t data_caps;
> > +    uint32_t device_state;
> > +    uint64_t devconfig_size;
> > +    VMChangeStateEntry *vm_state;
> > +} VFIOMigration;
> > +
> >  typedef struct VFIOVGARegion {
> >      MemoryRegion mem;
> >      off_t offset;
> > @@ -132,6 +148,8 @@ typedef struct VFIOPCIDevice {
> >      VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
> >      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
> >      void *igd_opregion;
> > +    VFIOMigration *migration;
> > +    Error *migration_blocker;
> >      PCIHostDeviceAddress host;
> >      EventNotifier err_notifier;
> >      EventNotifier req_notifier;
> > @@ -198,5 +216,10 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> >  void vfio_display_reset(VFIOPCIDevice *vdev);
> >  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> >  void vfio_display_finalize(VFIOPCIDevice *vdev);
> > -
> > +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev);
> > +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev);
> > +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> > +         uint64_t start_addr, uint64_t page_nr);
> > +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp);
> > +void vfio_migration_finalize(VFIOPCIDevice *vdev);
> >  #endif /* HW_VFIO_VFIO_PCI_H */
> > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > index 1b434d0..ed43613 100644
> > --- a/include/hw/vfio/vfio-common.h
> > +++ b/include/hw/vfio/vfio-common.h
> > @@ -32,6 +32,7 @@
> >  #endif
> >  
> >  #define VFIO_MSG_PREFIX "vfio %s: "
> > +#define TYPE_VFIO_PCI "vfio-pci"
> >  
> >  enum {
> >      VFIO_DEVICE_TYPE_PCI = 0,
> > -- 
> > 2.7.4
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 5/5] vfio/migration: support device memory capability
  2019-02-19 11:25     ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2019-02-20  5:17       ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-20  5:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, zhi.a.wang, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck, Ken.Xue,
	jonathan.davies

On Tue, Feb 19, 2019 at 11:25:43AM +0000, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > If a device has device memory capability, save/load data from device memory
> > in pre-copy and stop-and-copy phases.
> > 
> > LOGGING state is set for device memory for dirty page logging:
> > in LOGGING state, get device memory returns whole device memory snapshot;
> > outside LOGGING state, get device memory returns dirty data since last get
> > operation.
> > 
> > Usually, device memory is very big, qemu needs to chunk it into several
> > pieces each with size of device memory region.
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > ---
> >  hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  hw/vfio/pci.h       |   1 +
> >  2 files changed, 231 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 16d6395..f1e9309 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> >      return 0;
> >  }
> >  
> > +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    uint64_t len;
> > +    int sz;
> > +
> > +    sz = sizeof(len);
> > +    if (pread(vbasedev->fd, &len, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to get length of device memory");
> > +        return -1;
> > +    }
> > +    vdev->migration->devmem_size = len;
> > +    return 0;
> > +}
> > +
> > +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    int sz;
> > +
> > +    sz = sizeof(size);
> > +    if (pwrite(vbasedev->fd, &size, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to set length of device comemory");
> > +        return -1;
> > +    }
> > +    vdev->migration->devmem_size = size;
> > +    return 0;
> > +}
> > +
> > +static
> > +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> > +                                    uint64_t pos, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> > +
> > +    if (len > region_devmem->size) {
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(pos);
> > +    if (pwrite(vbasedev->fd, &pos, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> > +            != sz) {
> > +        error_report("vfio: Failed to set save buffer pos");
> > +        return -1;
> > +    }
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> > +            != sz) {
> > +        error_report("vfio: Failed to set save buffer action");
> > +        return -1;
> > +    }
> > +
> > +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> > +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
> > +            error_report("vfio: error load device memory buffer");
> 
> That's forgotten to g_free(buf)
>
Right, I'll correct that.

> > +            return -1;
> > +        }
> > +        qemu_put_be64(f, len);
> > +        qemu_put_be64(f, pos);
> > +        qemu_put_buffer(f, buf, len);
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_devmem->mmaps[0].mmap;
> > +        qemu_put_be64(f, len);
> > +        qemu_put_be64(f, pos);
> > +        qemu_put_buffer(f, dest, len);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +    uint64_t total_len = vdev->migration->devmem_size;
> > +    uint64_t pos = 0;
> > +
> > +    qemu_put_be64(f, total_len);
> > +    while (pos < total_len) {
> > +        uint64_t len = region_devmem->size;
> > +
> > +        if (pos + len >= total_len) {
> > +            len = total_len - pos;
> > +        }
> > +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
> > +            return -1;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static
> > +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> > +                                uint64_t pos, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> > +
> > +    if (len > region_devmem->size) {
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(pos);
> > +    if (pwrite(vbasedev->fd, &pos, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> > +            != sz) {
> > +        error_report("vfio: Failed to set device memory buffer pos");
> > +        return -1;
> > +    }
> > +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> > +        qemu_get_buffer(f, buf, len);
> > +        if (pwrite(vbasedev->fd, buf, len,
> > +                    region_devmem->fd_offset) != len) {
> > +            error_report("vfio: Failed to load devie memory buffer");
> 
> Again, failed to free buf
> 
Right, I'll correct that.
> > +            return -1;
> > +        }
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_devmem->mmaps[0].mmap;
> > +        qemu_get_buffer(f, dest, len);
> > +    }
> 
> You might want to use qemu_file_get_error(f)  before writing the data
> to the device, to check for the case of a read error on the migration
> stream that happened somewhere in the pevious qemu_get's
>

ok.

> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> > +            != sz) {
> > +        error_report("vfio: Failed to set load device memory buffer action");
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +
> > +}
> > +
> > +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
> > +                        QEMUFile *f, uint64_t total_len)
> > +{
> > +    uint64_t pos = 0, len = 0;
> > +
> > +    vfio_set_device_memory_size(vdev, total_len);
> > +
> > +    while (pos + len < total_len) {
> > +        len = qemu_get_be64(f);
> > +        pos = qemu_get_be64(f);
> 
> Please check len/pos - always assume that the migration stream could
> be (maliciously or accidentally) corrupt.
>
ok.

> > +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +
> >  static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> >          uint64_t start_addr, uint64_t page_nr)
> >  {
> > @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> >          return;
> >      }
> >  
> > +    /* get dirty data size of device memory */
> > +    vfio_get_device_memory_size(vdev);
> > +
> > +    *res_precopy_only += vdev->migration->devmem_size;
> >      return;
> >  }
> >  
> > @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
> >          return 0;
> >      }
> >  
> > -    return 0;
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> > +    /* get dirty data of device memory */
> > +    return vfio_save_data_device_memory(vdev, f);
> >  }
> >  
> >  static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> >              len = qemu_get_be64(f);
> >              vfio_load_data_device_config(vdev, f, len);
> >              break;
> > +        case VFIO_SAVE_FLAG_DEVMEMORY:
> > +            len = qemu_get_be64(f);
> > +            vfio_load_data_device_memory(vdev, f, len);
> > +            break;
> >          default:
> >              ret = -EINVAL;
> >          }
> > @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >      VFIOPCIDevice *vdev = opaque;
> >      int rc = 0;
> >  
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
> > +        /* get dirty data of device memory */
> > +        vfio_get_device_memory_size(vdev);
> > +        rc = vfio_save_data_device_memory(vdev, f);
> > +    }
> > +
> >      qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> >      vfio_pci_save_config(vdev, f);
> >  
> > @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >  
> >  static int vfio_save_setup(QEMUFile *f, void *opaque)
> >  {
> > +    int rc = 0;
> >      VFIOPCIDevice *vdev = opaque;
> > -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> > +        /* get whole snapshot of device memory */
> > +        vfio_get_device_memory_size(vdev);
> > +        rc = vfio_save_data_device_memory(vdev, f);
> > +    } else {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +    }
> >  
> >      vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> >              VFIO_DEVICE_STATE_LOGGING);
> > -    return 0;
> > +    return rc;
> >  }
> >  
> >  static int vfio_load_setup(QEMUFile *f, void *opaque)
> > @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> >          goto error;
> >      }
> >  
> > -    if (vfio_device_data_cap_device_memory(vdev)) {
> > -        error_report("No suppport of data cap device memory Yet");
> > +    if (vfio_device_data_cap_device_memory(vdev) &&
> > +            vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
> > +              "device-state-data-device-memory")) {
> >          goto error;
> >      }
> >  
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index 4b7b1bb..a2cc64b 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
> >      uint32_t data_caps;
> >      uint32_t device_state;
> >      uint64_t devconfig_size;
> > +    uint64_t devmem_size;
> >      VMChangeStateEntry *vm_state;
> >  } VFIOMigration;
> >  
> > -- 
> > 2.7.4
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] vfio/migration: support device memory capability
@ 2019-02-20  5:17       ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-20  5:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck,
	zhi.a.wang, jonathan.davies

On Tue, Feb 19, 2019 at 11:25:43AM +0000, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > If a device has device memory capability, save/load data from device memory
> > in pre-copy and stop-and-copy phases.
> > 
> > LOGGING state is set for device memory for dirty page logging:
> > in LOGGING state, get device memory returns whole device memory snapshot;
> > outside LOGGING state, get device memory returns dirty data since last get
> > operation.
> > 
> > Usually, device memory is very big, qemu needs to chunk it into several
> > pieces each with size of device memory region.
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > ---
> >  hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  hw/vfio/pci.h       |   1 +
> >  2 files changed, 231 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 16d6395..f1e9309 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> >      return 0;
> >  }
> >  
> > +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    uint64_t len;
> > +    int sz;
> > +
> > +    sz = sizeof(len);
> > +    if (pread(vbasedev->fd, &len, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to get length of device memory");
> > +        return -1;
> > +    }
> > +    vdev->migration->devmem_size = len;
> > +    return 0;
> > +}
> > +
> > +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    int sz;
> > +
> > +    sz = sizeof(size);
> > +    if (pwrite(vbasedev->fd, &size, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to set length of device comemory");
> > +        return -1;
> > +    }
> > +    vdev->migration->devmem_size = size;
> > +    return 0;
> > +}
> > +
> > +static
> > +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> > +                                    uint64_t pos, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> > +
> > +    if (len > region_devmem->size) {
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(pos);
> > +    if (pwrite(vbasedev->fd, &pos, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> > +            != sz) {
> > +        error_report("vfio: Failed to set save buffer pos");
> > +        return -1;
> > +    }
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> > +            != sz) {
> > +        error_report("vfio: Failed to set save buffer action");
> > +        return -1;
> > +    }
> > +
> > +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> > +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
> > +            error_report("vfio: error load device memory buffer");
> 
> That's forgotten to g_free(buf)
>
Right, I'll correct that.

> > +            return -1;
> > +        }
> > +        qemu_put_be64(f, len);
> > +        qemu_put_be64(f, pos);
> > +        qemu_put_buffer(f, buf, len);
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_devmem->mmaps[0].mmap;
> > +        qemu_put_be64(f, len);
> > +        qemu_put_be64(f, pos);
> > +        qemu_put_buffer(f, dest, len);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +    uint64_t total_len = vdev->migration->devmem_size;
> > +    uint64_t pos = 0;
> > +
> > +    qemu_put_be64(f, total_len);
> > +    while (pos < total_len) {
> > +        uint64_t len = region_devmem->size;
> > +
> > +        if (pos + len >= total_len) {
> > +            len = total_len - pos;
> > +        }
> > +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
> > +            return -1;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static
> > +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> > +                                uint64_t pos, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> > +
> > +    if (len > region_devmem->size) {
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(pos);
> > +    if (pwrite(vbasedev->fd, &pos, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> > +            != sz) {
> > +        error_report("vfio: Failed to set device memory buffer pos");
> > +        return -1;
> > +    }
> > +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> > +        qemu_get_buffer(f, buf, len);
> > +        if (pwrite(vbasedev->fd, buf, len,
> > +                    region_devmem->fd_offset) != len) {
> > +            error_report("vfio: Failed to load devie memory buffer");
> 
> Again, failed to free buf
> 
Right, I'll correct that.
> > +            return -1;
> > +        }
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_devmem->mmaps[0].mmap;
> > +        qemu_get_buffer(f, dest, len);
> > +    }
> 
> You might want to use qemu_file_get_error(f)  before writing the data
> to the device, to check for the case of a read error on the migration
> stream that happened somewhere in the pevious qemu_get's
>

ok.

> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> > +            != sz) {
> > +        error_report("vfio: Failed to set load device memory buffer action");
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +
> > +}
> > +
> > +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
> > +                        QEMUFile *f, uint64_t total_len)
> > +{
> > +    uint64_t pos = 0, len = 0;
> > +
> > +    vfio_set_device_memory_size(vdev, total_len);
> > +
> > +    while (pos + len < total_len) {
> > +        len = qemu_get_be64(f);
> > +        pos = qemu_get_be64(f);
> 
> Please check len/pos - always assume that the migration stream could
> be (maliciously or accidentally) corrupt.
>
ok.

> > +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +
> >  static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> >          uint64_t start_addr, uint64_t page_nr)
> >  {
> > @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> >          return;
> >      }
> >  
> > +    /* get dirty data size of device memory */
> > +    vfio_get_device_memory_size(vdev);
> > +
> > +    *res_precopy_only += vdev->migration->devmem_size;
> >      return;
> >  }
> >  
> > @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
> >          return 0;
> >      }
> >  
> > -    return 0;
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> > +    /* get dirty data of device memory */
> > +    return vfio_save_data_device_memory(vdev, f);
> >  }
> >  
> >  static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> >              len = qemu_get_be64(f);
> >              vfio_load_data_device_config(vdev, f, len);
> >              break;
> > +        case VFIO_SAVE_FLAG_DEVMEMORY:
> > +            len = qemu_get_be64(f);
> > +            vfio_load_data_device_memory(vdev, f, len);
> > +            break;
> >          default:
> >              ret = -EINVAL;
> >          }
> > @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >      VFIOPCIDevice *vdev = opaque;
> >      int rc = 0;
> >  
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
> > +        /* get dirty data of device memory */
> > +        vfio_get_device_memory_size(vdev);
> > +        rc = vfio_save_data_device_memory(vdev, f);
> > +    }
> > +
> >      qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> >      vfio_pci_save_config(vdev, f);
> >  
> > @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >  
> >  static int vfio_save_setup(QEMUFile *f, void *opaque)
> >  {
> > +    int rc = 0;
> >      VFIOPCIDevice *vdev = opaque;
> > -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> > +        /* get whole snapshot of device memory */
> > +        vfio_get_device_memory_size(vdev);
> > +        rc = vfio_save_data_device_memory(vdev, f);
> > +    } else {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +    }
> >  
> >      vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> >              VFIO_DEVICE_STATE_LOGGING);
> > -    return 0;
> > +    return rc;
> >  }
> >  
> >  static int vfio_load_setup(QEMUFile *f, void *opaque)
> > @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> >          goto error;
> >      }
> >  
> > -    if (vfio_device_data_cap_device_memory(vdev)) {
> > -        error_report("No suppport of data cap device memory Yet");
> > +    if (vfio_device_data_cap_device_memory(vdev) &&
> > +            vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
> > +              "device-state-data-device-memory")) {
> >          goto error;
> >      }
> >  
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index 4b7b1bb..a2cc64b 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
> >      uint32_t data_caps;
> >      uint32_t device_state;
> >      uint64_t devconfig_size;
> > +    uint64_t devmem_size;
> >      VMChangeStateEntry *vm_state;
> >  } VFIOMigration;
> >  
> > -- 
> > 2.7.4
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-19 11:32   ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2019-02-20  5:28     ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-20  5:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, zhi.a.wang, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck, Ken.Xue,
	jonathan.davies

On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > This patchset enables VFIO devices to have live migration capability.
> > Currently it does not support post-copy phase.
> > 
> > It follows Alex's comments on last version of VFIO live migration patches,
> > including device states, VFIO device state region layout, dirty bitmap's
> > query.
> 
> Hi,
>   I've sent minor comments to later patches; but some minor general
> comments:
> 
>   a) Never trust the incoming migrations stream - it might be corrupt,
>     so check when you can.
hi Dave
Thanks for this suggestion. I'll add more checks for migration streams.


>   b) How do we detect if we're migrating from/to the wrong device or
> version of device?  Or say to a device with older firmware or perhaps
> a device that has less device memory ?
Actually it's still an open for VFIO migration. Need to think about
whether it's better to check that in libvirt or qemu (like a device magic
along with verion ?).
This patchset is intended to settle down the main device state interfaces
for VFIO migration. So that we can work on that and improve it.


>   c) Consider using the trace_ mechanism - it's really useful to
> add to loops writing/reading data so that you can see when it fails.
> 
> Dave
>
Got it. many thanks~~


> (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> 'migrtion'

sorry :)
> 
> > Device Data
> > -----------
> > Device data is divided into three types: device memory, device config,
> > and system memory dirty pages produced by device.
> > 
> > Device config: data like MMIOs, page tables...
> >         Every device is supposed to possess device config data.
> >     	Usually device config's size is small (no big than 10M), and it
> >         needs to be loaded in certain strict order.
> >         Therefore, device config only needs to be saved/loaded in
> >         stop-and-copy phase.
> >         The data of device config is held in device config region.
> >         Size of device config data is smaller than or equal to that of
> >         device config region.
> > 
> > Device Memory: device's internal memory, standalone and outside system
> >         memory. It is usually very big.
> >         This kind of data needs to be saved / loaded in pre-copy and
> >         stop-and-copy phase.
> >         The data of device memory is held in device memory region.
> >         Size of devie memory is usually larger than that of device
> >         memory region. qemu needs to save/load it in chunks of size of
> >         device memory region.
> >         Not all device has device memory. Like IGD only uses system memory.
> > 
> > System memory dirty pages: If a device produces dirty pages in system
> >         memory, it is able to get dirty bitmap for certain range of system
> >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> >         callback, dirty pages in system memory will be save/loaded by ram's
> >         live migration code.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> >         If system memory range is larger than that dirty bitmap region can
> >         hold, qemu will cut it into several chunks and get dirty bitmap in
> >         succession.
> > 
> > 
> > Device State Regions
> > --------------------
> > Vendor driver is required to expose two mandatory regions and another two
> > optional regions if it plans to support device state management.
> > 
> > So, there are up to four regions in total.
> > One control region: mandatory.
> >         Get access via read/write system call.
> >         Its layout is defined in struct vfio_device_state_ctl
> > Three data regions: mmaped into qemu.
> >         device config region: mandatory, holding data of device config
> >         device memory region: optional, holding data of device memory
> >         dirty bitmap region: optional, holding bitmap of system memory
> >                             dirty pages
> > 
> > (The reason why four seperate regions are defined is that the unit of mmap
> > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > control and three mmaped regions for data seems better than one big region
> > padded and sparse mmaped).
> > 
> > 
> > kernel device state interface [1]
> > --------------------------------------
> > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > 
> > #define VFIO_DEVICE_STATE_RUNNING 0 
> > #define VFIO_DEVICE_STATE_STOP 1
> > #define VFIO_DEVICE_STATE_LOGGING 2
> > 
> > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > 
> > struct vfio_device_state_ctl {
> > 	__u32 version;		  /* ro */
> > 	__u32 device_state;       /* VFIO device state, wo */
> > 	__u32 caps;		 /* ro */
> >         struct {
> > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > 		__u64 size;    /*rw*/
> > 	} device_config;
> > 	struct {
> > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > 		__u64 size;     /* rw */  
> >                 __u64 pos; /*the offset in total buffer of device memory*/
> > 	} device_memory;
> > 	struct {
> > 		__u64 start_addr; /* wo */
> > 		__u64 page_nr;   /* wo */
> > 	} system_memory;
> > };
> > 
> > Devcie States
> > ------------- 
> > After migration is initialzed, it will set device state via writing to
> > device_state field of control region.
> > 
> > Four states are defined for a VFIO device:
> >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > 
> > RUNNING: In this state, a VFIO device is in active state ready to receive
> >         commands from device driver.
> >         It is the default state that a VFIO device enters initially.
> > 
> > STOP:  In this state, a VFIO device is deactivated to interact with
> >        device driver.
> > 
> > LOGGING: a special state that it CANNOT exist independently. It must be
> >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> >        STOP & LOGGING).
> >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> >        driver can start dirty data logging for device memory and system
> >        memory.
> >        LOGGING only impacts device/system memory. They return whole
> >        snapshot outside LOGGING and dirty data since last get operation
> >        inside LOGGING.
> >        Device config should be always accessible and return whole config
> >        snapshot regardless of LOGGING state.
> >        
> > Note:
> > The reason why RUNNING is the default state is that device's active state
> > must not depend on device state interface.
> > It is possible that region vfio_device_state_ctl fails to get registered.
> > In that condition, a device needs be in active state by default. 
> > 
> > Get Version & Get Caps
> > ----------------------
> > On migration init phase, qemu will probe the existence of device state
> > regions of vendor driver, then get version of the device state interface
> > from the r/w control region.
> > 
> > Then it will probe VFIO device's data capability by reading caps field of
> > control region.
> >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> >         device memory in pre-copy and stop-and-copy phase. The data of
> >         device memory is held in device memory region.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> >         produced by VFIO device during pre-copy and stop-and-copy phase.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> > 
> > If failing to find two mandatory regions and optional data regions
> > corresponding to data caps or version mismatching, it will setup a
> > migration blocker and disable live migration for VFIO device.
> > 
> > 
> > Flows to call device state interface for VFIO live migration
> > ------------------------------------------------------------
> > 
> > Live migration save path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_SAVE_SETUP
> >  |
> >  .save_setup callback -->
> >  get device memory size (whole snapshot size)
> >  get device memory buffer (whole snapshot data)
> >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> >  .save_live_pending callback --> get device memory size (dirty data)
> >  .save_live_iteration callback --> get device memory buffer (dirty data)
> >  .log_sync callback --> get system memory dirty bitmap
> >  |
> > (vcpu stops) --> set device state -->
> >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> >  |
> > .save_live_complete_precopy callback -->
> >  get device memory size (dirty data)
> >  get device memory buffer (dirty data)
> >  get device config size (whole snapshot size)
> >  get device config buffer (whole snapshot data)
> >  |
> > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > MIGRATION_STATUS_COMPLETED
> > 
> > MIGRATION_STATUS_CANCELLED or
> > MIGRATION_STATUS_FAILED
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > 
> > 
> > Live migration load path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> > .load state callback -->
> >  set device memory size, set device memory buffer, set device config size,
> >  set device config buffer
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_COMPLETED
> > 
> > 
> > 
> > In source VM side,
> > In precopy phase,
> > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > qemu will first get whole snapshot of device memory in .save_setup
> > callback, and then it will get total size of dirty data in device memory in
> > .save_live_pending callback by reading device_memory.size field of control
> > region.
> > Then in .save_live_iteration callback, it will get buffer of device memory's
> > dirty data chunk by chunk from device memory region by writing pos &
> > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > control region. (size of each chunk is the size of device memory data
> > region).
> > .save_live_pending and .save_live_iteration may be called several times in
> > precopy phase to get dirty data in device memory.
> > 
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > vendor driver's device state interface to get data from devcie memory.
> > 
> > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > region by writing system memory's start address, page count and action 
> > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > "system_memory.action" fields of control region.
> > If page count passed in .log_sync callback is larger than the bitmap size
> > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > vendor driver's get system memory dirty bitmap interface.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > returns without call to vendor driver.
> > 
> > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > in save_live_complete_precopy callback,
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > get device memory size and get device memory buffer will be called again.
> > After that,
> > device config data is get from device config region by reading
> > devcie_config.size of control region and writing action (GET_BITMAP) to
> > device_config.action of control region.
> > Then after migration completes, in cleanup handler, LOGGING state will be
> > cleared (i.e. deivce state is set to STOP).
> > Clearing LOGGING state in cleanup handler is in consideration of the case
> > of "migration failed" and "migration cancelled". They can also leverage
> > the cleanup handler to unset LOGGING state.
> > 
> > 
> > References
> > ----------
> > 1. kernel side implementation of Device state interfaces:
> > https://patchwork.freedesktop.org/series/56876/
> > 
> > 
> > Yan Zhao (5):
> >   vfio/migration: define kernel interfaces
> >   vfio/migration: support device of device config capability
> >   vfio/migration: tracking of dirty page in system memory
> >   vfio/migration: turn on migration
> >   vfio/migration: support device memory capability
> > 
> >  hw/vfio/Makefile.objs         |   2 +-
> >  hw/vfio/common.c              |  26 ++
> >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c                 |  10 +-
> >  hw/vfio/pci.h                 |  26 +-
> >  include/hw/vfio/vfio-common.h |   1 +
> >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> >  7 files changed, 1174 insertions(+), 9 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> > -- 
> > 2.7.4
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-20  5:28     ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-20  5:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck,
	zhi.a.wang, jonathan.davies

On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > This patchset enables VFIO devices to have live migration capability.
> > Currently it does not support post-copy phase.
> > 
> > It follows Alex's comments on last version of VFIO live migration patches,
> > including device states, VFIO device state region layout, dirty bitmap's
> > query.
> 
> Hi,
>   I've sent minor comments to later patches; but some minor general
> comments:
> 
>   a) Never trust the incoming migrations stream - it might be corrupt,
>     so check when you can.
hi Dave
Thanks for this suggestion. I'll add more checks for migration streams.


>   b) How do we detect if we're migrating from/to the wrong device or
> version of device?  Or say to a device with older firmware or perhaps
> a device that has less device memory ?
Actually it's still an open for VFIO migration. Need to think about
whether it's better to check that in libvirt or qemu (like a device magic
along with verion ?).
This patchset is intended to settle down the main device state interfaces
for VFIO migration. So that we can work on that and improve it.


>   c) Consider using the trace_ mechanism - it's really useful to
> add to loops writing/reading data so that you can see when it fails.
> 
> Dave
>
Got it. many thanks~~


> (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> 'migrtion'

sorry :)
> 
> > Device Data
> > -----------
> > Device data is divided into three types: device memory, device config,
> > and system memory dirty pages produced by device.
> > 
> > Device config: data like MMIOs, page tables...
> >         Every device is supposed to possess device config data.
> >     	Usually device config's size is small (no big than 10M), and it
> >         needs to be loaded in certain strict order.
> >         Therefore, device config only needs to be saved/loaded in
> >         stop-and-copy phase.
> >         The data of device config is held in device config region.
> >         Size of device config data is smaller than or equal to that of
> >         device config region.
> > 
> > Device Memory: device's internal memory, standalone and outside system
> >         memory. It is usually very big.
> >         This kind of data needs to be saved / loaded in pre-copy and
> >         stop-and-copy phase.
> >         The data of device memory is held in device memory region.
> >         Size of devie memory is usually larger than that of device
> >         memory region. qemu needs to save/load it in chunks of size of
> >         device memory region.
> >         Not all device has device memory. Like IGD only uses system memory.
> > 
> > System memory dirty pages: If a device produces dirty pages in system
> >         memory, it is able to get dirty bitmap for certain range of system
> >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> >         callback, dirty pages in system memory will be save/loaded by ram's
> >         live migration code.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> >         If system memory range is larger than that dirty bitmap region can
> >         hold, qemu will cut it into several chunks and get dirty bitmap in
> >         succession.
> > 
> > 
> > Device State Regions
> > --------------------
> > Vendor driver is required to expose two mandatory regions and another two
> > optional regions if it plans to support device state management.
> > 
> > So, there are up to four regions in total.
> > One control region: mandatory.
> >         Get access via read/write system call.
> >         Its layout is defined in struct vfio_device_state_ctl
> > Three data regions: mmaped into qemu.
> >         device config region: mandatory, holding data of device config
> >         device memory region: optional, holding data of device memory
> >         dirty bitmap region: optional, holding bitmap of system memory
> >                             dirty pages
> > 
> > (The reason why four seperate regions are defined is that the unit of mmap
> > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > control and three mmaped regions for data seems better than one big region
> > padded and sparse mmaped).
> > 
> > 
> > kernel device state interface [1]
> > --------------------------------------
> > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > 
> > #define VFIO_DEVICE_STATE_RUNNING 0 
> > #define VFIO_DEVICE_STATE_STOP 1
> > #define VFIO_DEVICE_STATE_LOGGING 2
> > 
> > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > 
> > struct vfio_device_state_ctl {
> > 	__u32 version;		  /* ro */
> > 	__u32 device_state;       /* VFIO device state, wo */
> > 	__u32 caps;		 /* ro */
> >         struct {
> > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > 		__u64 size;    /*rw*/
> > 	} device_config;
> > 	struct {
> > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > 		__u64 size;     /* rw */  
> >                 __u64 pos; /*the offset in total buffer of device memory*/
> > 	} device_memory;
> > 	struct {
> > 		__u64 start_addr; /* wo */
> > 		__u64 page_nr;   /* wo */
> > 	} system_memory;
> > };
> > 
> > Devcie States
> > ------------- 
> > After migration is initialzed, it will set device state via writing to
> > device_state field of control region.
> > 
> > Four states are defined for a VFIO device:
> >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > 
> > RUNNING: In this state, a VFIO device is in active state ready to receive
> >         commands from device driver.
> >         It is the default state that a VFIO device enters initially.
> > 
> > STOP:  In this state, a VFIO device is deactivated to interact with
> >        device driver.
> > 
> > LOGGING: a special state that it CANNOT exist independently. It must be
> >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> >        STOP & LOGGING).
> >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> >        driver can start dirty data logging for device memory and system
> >        memory.
> >        LOGGING only impacts device/system memory. They return whole
> >        snapshot outside LOGGING and dirty data since last get operation
> >        inside LOGGING.
> >        Device config should be always accessible and return whole config
> >        snapshot regardless of LOGGING state.
> >        
> > Note:
> > The reason why RUNNING is the default state is that device's active state
> > must not depend on device state interface.
> > It is possible that region vfio_device_state_ctl fails to get registered.
> > In that condition, a device needs be in active state by default. 
> > 
> > Get Version & Get Caps
> > ----------------------
> > On migration init phase, qemu will probe the existence of device state
> > regions of vendor driver, then get version of the device state interface
> > from the r/w control region.
> > 
> > Then it will probe VFIO device's data capability by reading caps field of
> > control region.
> >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> >         device memory in pre-copy and stop-and-copy phase. The data of
> >         device memory is held in device memory region.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> >         produced by VFIO device during pre-copy and stop-and-copy phase.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> > 
> > If failing to find two mandatory regions and optional data regions
> > corresponding to data caps or version mismatching, it will setup a
> > migration blocker and disable live migration for VFIO device.
> > 
> > 
> > Flows to call device state interface for VFIO live migration
> > ------------------------------------------------------------
> > 
> > Live migration save path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_SAVE_SETUP
> >  |
> >  .save_setup callback -->
> >  get device memory size (whole snapshot size)
> >  get device memory buffer (whole snapshot data)
> >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> >  .save_live_pending callback --> get device memory size (dirty data)
> >  .save_live_iteration callback --> get device memory buffer (dirty data)
> >  .log_sync callback --> get system memory dirty bitmap
> >  |
> > (vcpu stops) --> set device state -->
> >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> >  |
> > .save_live_complete_precopy callback -->
> >  get device memory size (dirty data)
> >  get device memory buffer (dirty data)
> >  get device config size (whole snapshot size)
> >  get device config buffer (whole snapshot data)
> >  |
> > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > MIGRATION_STATUS_COMPLETED
> > 
> > MIGRATION_STATUS_CANCELLED or
> > MIGRATION_STATUS_FAILED
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > 
> > 
> > Live migration load path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> > .load state callback -->
> >  set device memory size, set device memory buffer, set device config size,
> >  set device config buffer
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_COMPLETED
> > 
> > 
> > 
> > In source VM side,
> > In precopy phase,
> > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > qemu will first get whole snapshot of device memory in .save_setup
> > callback, and then it will get total size of dirty data in device memory in
> > .save_live_pending callback by reading device_memory.size field of control
> > region.
> > Then in .save_live_iteration callback, it will get buffer of device memory's
> > dirty data chunk by chunk from device memory region by writing pos &
> > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > control region. (size of each chunk is the size of device memory data
> > region).
> > .save_live_pending and .save_live_iteration may be called several times in
> > precopy phase to get dirty data in device memory.
> > 
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > vendor driver's device state interface to get data from devcie memory.
> > 
> > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > region by writing system memory's start address, page count and action 
> > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > "system_memory.action" fields of control region.
> > If page count passed in .log_sync callback is larger than the bitmap size
> > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > vendor driver's get system memory dirty bitmap interface.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > returns without call to vendor driver.
> > 
> > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > in save_live_complete_precopy callback,
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > get device memory size and get device memory buffer will be called again.
> > After that,
> > device config data is get from device config region by reading
> > devcie_config.size of control region and writing action (GET_BITMAP) to
> > device_config.action of control region.
> > Then after migration completes, in cleanup handler, LOGGING state will be
> > cleared (i.e. deivce state is set to STOP).
> > Clearing LOGGING state in cleanup handler is in consideration of the case
> > of "migration failed" and "migration cancelled". They can also leverage
> > the cleanup handler to unset LOGGING state.
> > 
> > 
> > References
> > ----------
> > 1. kernel side implementation of Device state interfaces:
> > https://patchwork.freedesktop.org/series/56876/
> > 
> > 
> > Yan Zhao (5):
> >   vfio/migration: define kernel interfaces
> >   vfio/migration: support device of device config capability
> >   vfio/migration: tracking of dirty page in system memory
> >   vfio/migration: turn on migration
> >   vfio/migration: support device memory capability
> > 
> >  hw/vfio/Makefile.objs         |   2 +-
> >  hw/vfio/common.c              |  26 ++
> >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c                 |  10 +-
> >  hw/vfio/pci.h                 |  26 +-
> >  include/hw/vfio/vfio-common.h |   1 +
> >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> >  7 files changed, 1174 insertions(+), 9 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> > -- 
> > 2.7.4
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 1/5] vfio/migration: define kernel interfaces
  2019-02-19 13:09     ` [Qemu-devel] " Cornelia Huck
@ 2019-02-20  7:36       ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-20  7:36 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian, dgilbert,
	alex.williamson, intel-gvt-dev, changpeng.liu, zhi.a.wang,
	jonathan.davies

On Tue, Feb 19, 2019 at 02:09:18PM +0100, Cornelia Huck wrote:
> On Tue, 19 Feb 2019 16:52:14 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > - defined 4 device states regions: one control region and 3 data regions
> > - defined layout of control region in struct vfio_device_state_ctl
> > - defined 4 device states: running, stop, running&logging, stop&logging
> > - define 3 device data categories: device config, device memory, system
> >   memory
> > - defined 2 device data capabilities: device memory and system memory
> > - defined device state interfaces' version and 12 device state interfaces
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> > ---
> >  linux-headers/linux/vfio.h | 260 +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 260 insertions(+)
> 
> [commenting here for convenience; changes obviously need to be done in
> the Linux patch]
> 
yes, you can find the corresponding kernel part code at
https://patchwork.freedesktop.org/series/56876/


> > 
> > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > index ceb6453..a124fc1 100644
> > --- a/linux-headers/linux/vfio.h
> > +++ b/linux-headers/linux/vfio.h
> > @@ -303,6 +303,56 @@ struct vfio_region_info_cap_type {
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> >  
> > +/* Device State region type and sub-type
> > + *
> > + * A VFIO device driver needs to register up to four device state regions in
> > + * total: two mandatory and another two optional, if it plans to support device
> > + * state management.
> 
> Suggest to rephrase:
> 
> "A VFIO device driver that plans to support device state management
> needs to register..."
>
ok :)

> > + *
> > + * 1. region CTL :
> > + *          Mandatory.
> > + *          This is a control region.
> > + *          Its layout is defined in struct vfio_device_state_ctl.
> > + *          Reading from this region can get version, capabilities and data
> > + *          size of device state interfaces.
> > + *          Writing to this region can set device state, data size and
> > + *          choose which interface to use.
> > + * 2. region DEVICE_CONFIG
> > + *          Mandatory.
> > + *          This is a data region that holds device config data.
> > + *          Device config is such kind of data like MMIOs, page tables...
> 
> "Device config is data such as..."

ok :)
> 
> > + *          Every device is supposed to possess device config data.
> > + *          Usually the size of device config data is small (no big
> 
> s/no big/no bigger/

right :)
> 
> > + *          than 10M), and it needs to be loaded in certain strict
> > + *          order.
> > + *          Therefore no dirty data logging is enabled for device
> > + *          config and it must be got/set as a whole.
> > + *          Size of device config data is smaller than or equal to that of
> > + *          device config region.
> 
> Not sure if I understand that sentence correctly... but what if a
> device has more config state than fits into this region? Is that
> supposed to be covered by the device memory region? Or is this assumed
> to be something so exotic that we don't need to plan for it?
> 
Device config data and device config region are all provided by vendor
driver, so vendor driver is always able to create a large enough device
config region to hold device config data.
So, if a device has data that are better to be saved after device stop and
saved/loaded in strict order, the data needs to be in device config region.
This kind of data is supposed to be small.
If the device data can be saved/loaded several times, it can also be put
into device memory region.


> > + *          It is able to be mmaped into user space.
> > + * 3. region DEVICE_MEMORY
> > + *          Optional.
> > + *          This is a data region that holds device memory data.
> > + *          Device memory is device's internal memory, standalone and outside
> 
> s/outside/distinct from/ ?
ok.
> 
> > + *          system memory.  It is usually very big.
> > + *          Not all device has device memory. Like IGD only uses system
> 
> s/all devices has/all devices have/
> 
> s/Like/E.g./
>
ok :)

> > + *          memory and has no device memory.
> > + *          Size of devie memory is usually larger than that of device
> 
> s/devie/device/
> 
thanks:)

> > + *          memory region. qemu needs to save/load it in chunks of size of
> > + *          device memory region.
> 
> I'd rather not explicitly mention QEMU in this header. Maybe
> "Userspace"?
>
ok.

> > + *          It is able to be mmaped into user space.
> > + * 4. region DIRTY_BITMAP
> > + *          Optional.
> > + *          This is a data region that holds bitmap of dirty pages in system
> > + *          memory that a VFIO devices produces.
> > + *          It is able to be mmaped into user space.
> > + */
> > +#define VFIO_REGION_TYPE_DEVICE_STATE           (1 << 1)
> 
> Can you make this an explicit number instead?
> 
> (FWIW, I plan to add a CCW region as type 2, whatever comes first.)
ok :)

> 
> > +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL       (1)
> > +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG      (2)
> > +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY      (3)
> > +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP (4)
> > +
> >  /*
> >   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> >   * which allows direct access to non-MSIX registers which happened to be within
> > @@ -816,6 +866,216 @@ struct vfio_iommu_spapr_tce_remove {
> >  };
> >  #define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
> >  
> > +/* version number of the device state interface */
> > +#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> 
> Hm. Is this supposed to be backwards-compatible, should we need to bump
> this?
>
currently no backwords-compatible. we can discuss on that.

> > +
> > +/*
> > + * For devices that have devcie memory, it is required to expose
> 
> s/devcie/device/
> 
> > + * DEVICE_MEMORY capability.
> > + *
> > + * For devices producing dirty pages in system memory, it is required to
> > + * expose cap SYSTEM_MEMORY in order to get dirty bitmap in certain range
> > + * of system memory.
> > + */
> > +#define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > +#define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > +
> > +/*
> > + * DEVICE STATES
> > + *
> > + * Four states are defined for a VFIO device:
> > + * RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP.
> > + * They can be set by writing to device_state field of
> > + * vfio_device_state_ctl region.
> 
> Who controls this? Userspace?

Yes. Userspace notifies vendor driver to do the state switching.

> > + *
> > + * RUNNING: In this state, a VFIO device is in active state ready to
> > + * receive commands from device driver.
> > + * It is the default state that a VFIO device enters initially.
> > + *
> > + * STOP: In this state, a VFIO device is deactivated to interact with
> > + * device driver.
> 
> I think 'STOPPED' would read nicer.
> 
sounds better :)

> > + *
> > + * LOGGING state is a special state that it CANNOT exist
> > + * independently.
> 
> So it's not a state, but rather a modifier?
> 
yes. or thinking LOGGING/not LOGGING as bit 1 of a device state,
whereas RUNNING/STOPPED is bit 0 of a device state.
They have to be got as a whole.


> > + * It must be set alongside with state RUNNING or STOP, i.e,
> > + * RUNNING & LOGGING, STOP & LOGGING.
> > + * It is used for dirty data logging both for device memory
> > + * and system memory.
> > + *
> > + * LOGGING only impacts device/system memory. In LOGGING state, get buffer
> > + * of device memory returns dirty pages since last call; outside LOGGING
> > + * state, get buffer of device memory returns whole snapshot of device
> > + * memory. system memory's dirty page is only available in LOGGING state.
> > + *
> > + * Device config should be always accessible and return whole config snapshot
> > + * regardless of LOGGING state.
> > + * */
> > +#define VFIO_DEVICE_STATE_RUNNING 0
> > +#define VFIO_DEVICE_STATE_STOP 1
> > +#define VFIO_DEVICE_STATE_LOGGING 2
> > +
> > +/* action to get data from device memory or device config
> > + * the action is write to device state's control region, and data is read
> > + * from device memory region or device config region.
> > + * Each time before read device memory region or device config region,
> > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > + * field in control region. That is because device memory and devie config
> > + * region is mmaped into user space. vendor driver has to be notified of
> > + * the the GET_BUFFER action in advance.
> > + */
> > +#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > +
> > +/* action to set data to device memory or device config
> > + * the action is write to device state's control region, and data is
> > + * written to device memory region or device config region.
> > + * Each time after write to device memory region or device config region,
> > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > + * field in control region. That is because device memory and devie config
> > + * region is mmaped into user space. vendor driver has to be notified of
> > + * the the SET_BUFFER action after data written.
> > + */
> > +#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> 
> Let me describe this in my own words to make sure that I understand
> this correctly.
> 
> - The actions are set by userspace to notify the kernel that it is
>   going to get data or that it has just written data.
> - This is needed as a notification that the mmapped data should not be
>   changed resp. just has changed.
we need this notification is because when userspace read the mmapped data,
it's from the ptr returned from mmap(). So, when userspace reads that ptr,
there will be no page fault or read/write system calls, so vendor driver
does not know whether read/write opertation happens or not. 
Therefore, before userspace reads the ptr from mmap, it first writes action
field in control region (through write system call), and vendor driver
will not return the write system call until data prepared.

When userspace writes to that ptr from mmap, it writes data to the data
region first, then writes the action field in control region (through write
system call) to notify vendor driver. vendor driver will return the system
call after it copies the buffer completely.
> 
> So, how does the kernel know whether the read action has finished resp.
> whether the write action has started? Even if userspace reads/writes it
> as a whole.
> 
kernel does not touch the data region except when in response to the
"action" write system call.
> > +
> > +/* layout of device state interfaces' control region
> > + * By reading to control region and reading/writing data from device config
> > + * region, device memory region, system memory regions, below interface can
> > + * be implemented:
> > + *
> > + * 1. get version
> > + *   (1) user space calls read system call on "version" field of control
> > + *   region.
> > + *   (2) vendor driver writes version number of device state interfaces
> > + *   to the "version" field of control region.
> > + *
> > + * 2. get caps
> > + *   (1) user space calls read system call on "caps" field of control region.
> > + *   (2) if a VFIO device has huge device memory, vendor driver reports
> > + *      VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY in "caps" field of control region.
> > + *      if a VFIO device produces dirty pages in system memory, vendor driver
> > + *      reports VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY in "caps" field of
> > + *      control region.
> > + *
> > + * 3. set device state
> > + *    (1) user space calls write system call on "device_state" field of
> > + *    control region.
> > + *    (2) device state transitions as:
> > + *
> > + *    RUNNING -- start dirty data logging --> RUNNING & LOGGING
> > + *    RUNNING -- deactivate --> STOP
> > + *    RUNNING -- deactivate & start dirty data longging --> STOP & LOGGING
> > + *    RUNNING & LOGGING -- stop dirty data logging --> RUNNING
> > + *    RUNNING & LOGGING -- deactivate --> STOP & LOGGING
> > + *    RUNNING & LOGGING -- deactivate & stop dirty data logging --> STOP
> > + *    STOP -- activate --> RUNNING
> > + *    STOP -- start dirty data logging --> STOP & LOGGING
> > + *    STOP -- activate & start dirty data logging --> RUNNING & LOGGING
> > + *    STOP & LOGGING -- stop dirty data logging --> STOP
> > + *    STOP & LOGGING -- activate --> RUNNING & LOGGING
> > + *    STOP & LOGGING -- activate & stop dirty data logging --> RUNNING
> > + *
> > + * 4. get device config size
> > + *   (1) user space calls read system call on "device_config.size" field of
> > + *       control region for the total size of device config snapshot.
> > + *   (2) vendor driver writes device config data's total size in
> > + *       "device_config.size" field of control region.
> > + *
> > + * 5. set device config size
> > + *   (1) user space calls write system call.
> > + *       total size of device config snapshot --> "device_config.size" field
> > + *       of control region.
> > + *   (2) vendor driver reads device config data's total size from
> > + *       "device_config.size" field of control region.
> > + *
> > + * 6 get device config buffer
> > + *   (1) user space calls write system call.
> > + *       "GET_BUFFER" --> "device_config.action" field of control region.
> > + *   (2) vendor driver
> > + *       a. gets whole snapshot for device config
> > + *       b. writes whole device config snapshot to region
> > + *       DEVICE_CONFIG.
> > + *   (3) user space reads the whole of device config snapshot from region
> > + *       DEVICE_CONFIG.
> > + *
> > + * 7. set device config buffer
> > + *   (1) user space writes whole of device config data to region
> > + *       DEVICE_CONFIG.
> > + *   (2) user space calls write system call.
> > + *       "SET_BUFFER" --> "device_config.action" field of control region.
> > + *   (3) vendor driver loads whole of device config from region DEVICE_CONFIG.
> > + *
> > + * 8. get device memory size
> > + *   (1) user space calls read system call on "device_memory.size" field of
> > + *       control region for device memory size.
> > + *   (2) vendor driver
> > + *       a. gets device memory snapshot (in state RUNNING or STOP), or
> > + *          gets device memory dirty data (in state RUNNING & LOGGING or
> > + *          state STOP & LOGGING)
> > + *       b. writes size in "device_memory.size" field of control region
> > + *
> > + * 9. set device memory size
> > + *   (1) user space calls write system call on "device_memory.size" field of
> > + *       control region to set total size of device memory snapshot.
> > + *   (2) vendor driver reads device memory's size from "device_memory.size"
> > + *       field of control region.
> > + *
> > + *
> > + * 10. get device memory buffer
> > + *   (1) user space calls write system.
> > + *       pos --> "device_memory.pos" field of control region,
> > + *       "GET_BUFFER" --> "device_memory.action" field of control region.
> > + *       (pos must be 0 or multiples of length of region DEVICE_MEMORY).
> > + *   (2) vendor driver writes N'th chunk of device memory snapshot/dirty data
> > + *       to region DEVICE_MEMORY.
> > + *       (N equals to pos/(region length of DEVICE_MEMORY))
> > + *   (3) user space reads the N'th chunk of device memory snapshot/dirty data
> > + *       from region DEVICE_MEMORY.
> > + *
> > + * 11. set device memory buffer
> > + *   (1) user space writes N'th chunk of device memory snapshot/dirty data to
> > + *       region DEVICE_MEMORY.
> > + *       (N equals to pos/(region length of DEVICE_MEMORY))
> > + *   (2) user space writes pos to "device_memory.pos" field and writes
> > + *       "SET_BUFFER" to "device_memory.action" field of control region.
> > + *   (3) vendor driver loads N'th chunk of device memory snapshot/dirty data
> > + *       from region DEVICE_MEMORY.
> > + *
> > + * 12. get system memory dirty bitmap
> > + *   (1) user space calls write system call to specify a range of system
> > + *       memory that querying dirty pages.
> > + *       system memory's start address --> "system_memory.start_addr" field
> > + *       of control region,
> > + *       system memory's page count --> "system_memory.page_nr" field of
> > + *       control region.
> > + *   (2) if device state is not in RUNNING or STOP & LOGGING,
> > + *       vendor driver returns empty bitmap; otherwise,
> > + *       vendor driver checks the page_nr,
> > + *       if it's larger than the size that region DIRTY_BITMAP can support,
> > + *       error returns; if not,
> > + *       vendor driver returns as bitmap to specify dirty pages that
> > + *       device produces since last query in this range of system memory .
> > + *   (3) usespace reads back the dirty bitmap from region DIRTY_BITMAP.
> > + *
> > + */
> 
> It might make sense to extract the explanations above into a separate
> design document in the kernel Documentation/ directory. You could also
> add ASCII art there :)
>
yes, a diagram is better:)

> > +
> > +struct vfio_device_state_ctl {
> > +	__u32 version;		  /* ro versio of devcie state interfaces*/
> 
> s/versio/version/
> s/devcie/device/
> 
thanks~
> > +	__u32 device_state;       /* VFIO device state, wo */
> > +	__u32 caps;		 /* ro */
> > +        struct {
> 
> Indentation looks a bit off.
> 
> > +		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> > +		__u64 size;    /*rw, total size of device config*/
> > +	} device_config;
> > +	struct {
> > +		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> > +		__u64 size;     /* rw, total size of device memory*/
> > +        __u64 pos;/*chunk offset in total buffer of device memory*/
> 
> Here as well.
> 
thanks~
> > +	} device_memory;
> > +	struct {
> > +		__u64 start_addr; /* wo */
> > +		__u64 page_nr;   /* wo */
> > +	} system_memory;
> > +}__attribute__((packed));
> 
> For an interface definition, it's probably better to avoid packed and
> instead add padding if needed.
> 
ok. so just remove the __attribute__((packed)); is enough for this
interface. 

> > +
> >  /* ***************************************************************** */
> >  
> >  #endif /* VFIO_H */
> 
> On the whole, I think this is moving into the right direction.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] vfio/migration: define kernel interfaces
@ 2019-02-20  7:36       ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-20  7:36 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue, kwankhede, kevin.tian,
	cjia, arei.gonglei, kvm

On Tue, Feb 19, 2019 at 02:09:18PM +0100, Cornelia Huck wrote:
> On Tue, 19 Feb 2019 16:52:14 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > - defined 4 device states regions: one control region and 3 data regions
> > - defined layout of control region in struct vfio_device_state_ctl
> > - defined 4 device states: running, stop, running&logging, stop&logging
> > - define 3 device data categories: device config, device memory, system
> >   memory
> > - defined 2 device data capabilities: device memory and system memory
> > - defined device state interfaces' version and 12 device state interfaces
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> > ---
> >  linux-headers/linux/vfio.h | 260 +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 260 insertions(+)
> 
> [commenting here for convenience; changes obviously need to be done in
> the Linux patch]
> 
yes, you can find the corresponding kernel part code at
https://patchwork.freedesktop.org/series/56876/


> > 
> > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > index ceb6453..a124fc1 100644
> > --- a/linux-headers/linux/vfio.h
> > +++ b/linux-headers/linux/vfio.h
> > @@ -303,6 +303,56 @@ struct vfio_region_info_cap_type {
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> >  
> > +/* Device State region type and sub-type
> > + *
> > + * A VFIO device driver needs to register up to four device state regions in
> > + * total: two mandatory and another two optional, if it plans to support device
> > + * state management.
> 
> Suggest to rephrase:
> 
> "A VFIO device driver that plans to support device state management
> needs to register..."
>
ok :)

> > + *
> > + * 1. region CTL :
> > + *          Mandatory.
> > + *          This is a control region.
> > + *          Its layout is defined in struct vfio_device_state_ctl.
> > + *          Reading from this region can get version, capabilities and data
> > + *          size of device state interfaces.
> > + *          Writing to this region can set device state, data size and
> > + *          choose which interface to use.
> > + * 2. region DEVICE_CONFIG
> > + *          Mandatory.
> > + *          This is a data region that holds device config data.
> > + *          Device config is such kind of data like MMIOs, page tables...
> 
> "Device config is data such as..."

ok :)
> 
> > + *          Every device is supposed to possess device config data.
> > + *          Usually the size of device config data is small (no big
> 
> s/no big/no bigger/

right :)
> 
> > + *          than 10M), and it needs to be loaded in certain strict
> > + *          order.
> > + *          Therefore no dirty data logging is enabled for device
> > + *          config and it must be got/set as a whole.
> > + *          Size of device config data is smaller than or equal to that of
> > + *          device config region.
> 
> Not sure if I understand that sentence correctly... but what if a
> device has more config state than fits into this region? Is that
> supposed to be covered by the device memory region? Or is this assumed
> to be something so exotic that we don't need to plan for it?
> 
Device config data and device config region are all provided by vendor
driver, so vendor driver is always able to create a large enough device
config region to hold device config data.
So, if a device has data that are better to be saved after device stop and
saved/loaded in strict order, the data needs to be in device config region.
This kind of data is supposed to be small.
If the device data can be saved/loaded several times, it can also be put
into device memory region.


> > + *          It is able to be mmaped into user space.
> > + * 3. region DEVICE_MEMORY
> > + *          Optional.
> > + *          This is a data region that holds device memory data.
> > + *          Device memory is device's internal memory, standalone and outside
> 
> s/outside/distinct from/ ?
ok.
> 
> > + *          system memory.  It is usually very big.
> > + *          Not all device has device memory. Like IGD only uses system
> 
> s/all devices has/all devices have/
> 
> s/Like/E.g./
>
ok :)

> > + *          memory and has no device memory.
> > + *          Size of devie memory is usually larger than that of device
> 
> s/devie/device/
> 
thanks:)

> > + *          memory region. qemu needs to save/load it in chunks of size of
> > + *          device memory region.
> 
> I'd rather not explicitly mention QEMU in this header. Maybe
> "Userspace"?
>
ok.

> > + *          It is able to be mmaped into user space.
> > + * 4. region DIRTY_BITMAP
> > + *          Optional.
> > + *          This is a data region that holds bitmap of dirty pages in system
> > + *          memory that a VFIO devices produces.
> > + *          It is able to be mmaped into user space.
> > + */
> > +#define VFIO_REGION_TYPE_DEVICE_STATE           (1 << 1)
> 
> Can you make this an explicit number instead?
> 
> (FWIW, I plan to add a CCW region as type 2, whatever comes first.)
ok :)

> 
> > +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL       (1)
> > +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG      (2)
> > +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY      (3)
> > +#define VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP (4)
> > +
> >  /*
> >   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> >   * which allows direct access to non-MSIX registers which happened to be within
> > @@ -816,6 +866,216 @@ struct vfio_iommu_spapr_tce_remove {
> >  };
> >  #define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
> >  
> > +/* version number of the device state interface */
> > +#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> 
> Hm. Is this supposed to be backwards-compatible, should we need to bump
> this?
>
currently no backwords-compatible. we can discuss on that.

> > +
> > +/*
> > + * For devices that have devcie memory, it is required to expose
> 
> s/devcie/device/
> 
> > + * DEVICE_MEMORY capability.
> > + *
> > + * For devices producing dirty pages in system memory, it is required to
> > + * expose cap SYSTEM_MEMORY in order to get dirty bitmap in certain range
> > + * of system memory.
> > + */
> > +#define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > +#define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > +
> > +/*
> > + * DEVICE STATES
> > + *
> > + * Four states are defined for a VFIO device:
> > + * RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP.
> > + * They can be set by writing to device_state field of
> > + * vfio_device_state_ctl region.
> 
> Who controls this? Userspace?

Yes. Userspace notifies vendor driver to do the state switching.

> > + *
> > + * RUNNING: In this state, a VFIO device is in active state ready to
> > + * receive commands from device driver.
> > + * It is the default state that a VFIO device enters initially.
> > + *
> > + * STOP: In this state, a VFIO device is deactivated to interact with
> > + * device driver.
> 
> I think 'STOPPED' would read nicer.
> 
sounds better :)

> > + *
> > + * LOGGING state is a special state that it CANNOT exist
> > + * independently.
> 
> So it's not a state, but rather a modifier?
> 
yes. or thinking LOGGING/not LOGGING as bit 1 of a device state,
whereas RUNNING/STOPPED is bit 0 of a device state.
They have to be got as a whole.


> > + * It must be set alongside with state RUNNING or STOP, i.e,
> > + * RUNNING & LOGGING, STOP & LOGGING.
> > + * It is used for dirty data logging both for device memory
> > + * and system memory.
> > + *
> > + * LOGGING only impacts device/system memory. In LOGGING state, get buffer
> > + * of device memory returns dirty pages since last call; outside LOGGING
> > + * state, get buffer of device memory returns whole snapshot of device
> > + * memory. system memory's dirty page is only available in LOGGING state.
> > + *
> > + * Device config should be always accessible and return whole config snapshot
> > + * regardless of LOGGING state.
> > + * */
> > +#define VFIO_DEVICE_STATE_RUNNING 0
> > +#define VFIO_DEVICE_STATE_STOP 1
> > +#define VFIO_DEVICE_STATE_LOGGING 2
> > +
> > +/* action to get data from device memory or device config
> > + * the action is write to device state's control region, and data is read
> > + * from device memory region or device config region.
> > + * Each time before read device memory region or device config region,
> > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > + * field in control region. That is because device memory and devie config
> > + * region is mmaped into user space. vendor driver has to be notified of
> > + * the the GET_BUFFER action in advance.
> > + */
> > +#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > +
> > +/* action to set data to device memory or device config
> > + * the action is write to device state's control region, and data is
> > + * written to device memory region or device config region.
> > + * Each time after write to device memory region or device config region,
> > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > + * field in control region. That is because device memory and devie config
> > + * region is mmaped into user space. vendor driver has to be notified of
> > + * the the SET_BUFFER action after data written.
> > + */
> > +#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> 
> Let me describe this in my own words to make sure that I understand
> this correctly.
> 
> - The actions are set by userspace to notify the kernel that it is
>   going to get data or that it has just written data.
> - This is needed as a notification that the mmapped data should not be
>   changed resp. just has changed.
we need this notification is because when userspace read the mmapped data,
it's from the ptr returned from mmap(). So, when userspace reads that ptr,
there will be no page fault or read/write system calls, so vendor driver
does not know whether read/write opertation happens or not. 
Therefore, before userspace reads the ptr from mmap, it first writes action
field in control region (through write system call), and vendor driver
will not return the write system call until data prepared.

When userspace writes to that ptr from mmap, it writes data to the data
region first, then writes the action field in control region (through write
system call) to notify vendor driver. vendor driver will return the system
call after it copies the buffer completely.
> 
> So, how does the kernel know whether the read action has finished resp.
> whether the write action has started? Even if userspace reads/writes it
> as a whole.
> 
kernel does not touch the data region except when in response to the
"action" write system call.
> > +
> > +/* layout of device state interfaces' control region
> > + * By reading to control region and reading/writing data from device config
> > + * region, device memory region, system memory regions, below interface can
> > + * be implemented:
> > + *
> > + * 1. get version
> > + *   (1) user space calls read system call on "version" field of control
> > + *   region.
> > + *   (2) vendor driver writes version number of device state interfaces
> > + *   to the "version" field of control region.
> > + *
> > + * 2. get caps
> > + *   (1) user space calls read system call on "caps" field of control region.
> > + *   (2) if a VFIO device has huge device memory, vendor driver reports
> > + *      VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY in "caps" field of control region.
> > + *      if a VFIO device produces dirty pages in system memory, vendor driver
> > + *      reports VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY in "caps" field of
> > + *      control region.
> > + *
> > + * 3. set device state
> > + *    (1) user space calls write system call on "device_state" field of
> > + *    control region.
> > + *    (2) device state transitions as:
> > + *
> > + *    RUNNING -- start dirty data logging --> RUNNING & LOGGING
> > + *    RUNNING -- deactivate --> STOP
> > + *    RUNNING -- deactivate & start dirty data longging --> STOP & LOGGING
> > + *    RUNNING & LOGGING -- stop dirty data logging --> RUNNING
> > + *    RUNNING & LOGGING -- deactivate --> STOP & LOGGING
> > + *    RUNNING & LOGGING -- deactivate & stop dirty data logging --> STOP
> > + *    STOP -- activate --> RUNNING
> > + *    STOP -- start dirty data logging --> STOP & LOGGING
> > + *    STOP -- activate & start dirty data logging --> RUNNING & LOGGING
> > + *    STOP & LOGGING -- stop dirty data logging --> STOP
> > + *    STOP & LOGGING -- activate --> RUNNING & LOGGING
> > + *    STOP & LOGGING -- activate & stop dirty data logging --> RUNNING
> > + *
> > + * 4. get device config size
> > + *   (1) user space calls read system call on "device_config.size" field of
> > + *       control region for the total size of device config snapshot.
> > + *   (2) vendor driver writes device config data's total size in
> > + *       "device_config.size" field of control region.
> > + *
> > + * 5. set device config size
> > + *   (1) user space calls write system call.
> > + *       total size of device config snapshot --> "device_config.size" field
> > + *       of control region.
> > + *   (2) vendor driver reads device config data's total size from
> > + *       "device_config.size" field of control region.
> > + *
> > + * 6 get device config buffer
> > + *   (1) user space calls write system call.
> > + *       "GET_BUFFER" --> "device_config.action" field of control region.
> > + *   (2) vendor driver
> > + *       a. gets whole snapshot for device config
> > + *       b. writes whole device config snapshot to region
> > + *       DEVICE_CONFIG.
> > + *   (3) user space reads the whole of device config snapshot from region
> > + *       DEVICE_CONFIG.
> > + *
> > + * 7. set device config buffer
> > + *   (1) user space writes whole of device config data to region
> > + *       DEVICE_CONFIG.
> > + *   (2) user space calls write system call.
> > + *       "SET_BUFFER" --> "device_config.action" field of control region.
> > + *   (3) vendor driver loads whole of device config from region DEVICE_CONFIG.
> > + *
> > + * 8. get device memory size
> > + *   (1) user space calls read system call on "device_memory.size" field of
> > + *       control region for device memory size.
> > + *   (2) vendor driver
> > + *       a. gets device memory snapshot (in state RUNNING or STOP), or
> > + *          gets device memory dirty data (in state RUNNING & LOGGING or
> > + *          state STOP & LOGGING)
> > + *       b. writes size in "device_memory.size" field of control region
> > + *
> > + * 9. set device memory size
> > + *   (1) user space calls write system call on "device_memory.size" field of
> > + *       control region to set total size of device memory snapshot.
> > + *   (2) vendor driver reads device memory's size from "device_memory.size"
> > + *       field of control region.
> > + *
> > + *
> > + * 10. get device memory buffer
> > + *   (1) user space calls write system.
> > + *       pos --> "device_memory.pos" field of control region,
> > + *       "GET_BUFFER" --> "device_memory.action" field of control region.
> > + *       (pos must be 0 or multiples of length of region DEVICE_MEMORY).
> > + *   (2) vendor driver writes N'th chunk of device memory snapshot/dirty data
> > + *       to region DEVICE_MEMORY.
> > + *       (N equals to pos/(region length of DEVICE_MEMORY))
> > + *   (3) user space reads the N'th chunk of device memory snapshot/dirty data
> > + *       from region DEVICE_MEMORY.
> > + *
> > + * 11. set device memory buffer
> > + *   (1) user space writes N'th chunk of device memory snapshot/dirty data to
> > + *       region DEVICE_MEMORY.
> > + *       (N equals to pos/(region length of DEVICE_MEMORY))
> > + *   (2) user space writes pos to "device_memory.pos" field and writes
> > + *       "SET_BUFFER" to "device_memory.action" field of control region.
> > + *   (3) vendor driver loads N'th chunk of device memory snapshot/dirty data
> > + *       from region DEVICE_MEMORY.
> > + *
> > + * 12. get system memory dirty bitmap
> > + *   (1) user space calls write system call to specify a range of system
> > + *       memory that querying dirty pages.
> > + *       system memory's start address --> "system_memory.start_addr" field
> > + *       of control region,
> > + *       system memory's page count --> "system_memory.page_nr" field of
> > + *       control region.
> > + *   (2) if device state is not in RUNNING or STOP & LOGGING,
> > + *       vendor driver returns empty bitmap; otherwise,
> > + *       vendor driver checks the page_nr,
> > + *       if it's larger than the size that region DIRTY_BITMAP can support,
> > + *       error returns; if not,
> > + *       vendor driver returns as bitmap to specify dirty pages that
> > + *       device produces since last query in this range of system memory .
> > + *   (3) usespace reads back the dirty bitmap from region DIRTY_BITMAP.
> > + *
> > + */
> 
> It might make sense to extract the explanations above into a separate
> design document in the kernel Documentation/ directory. You could also
> add ASCII art there :)
>
yes, a diagram is better:)

> > +
> > +struct vfio_device_state_ctl {
> > +	__u32 version;		  /* ro versio of devcie state interfaces*/
> 
> s/versio/version/
> s/devcie/device/
> 
thanks~
> > +	__u32 device_state;       /* VFIO device state, wo */
> > +	__u32 caps;		 /* ro */
> > +        struct {
> 
> Indentation looks a bit off.
> 
> > +		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> > +		__u64 size;    /*rw, total size of device config*/
> > +	} device_config;
> > +	struct {
> > +		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> > +		__u64 size;     /* rw, total size of device memory*/
> > +        __u64 pos;/*chunk offset in total buffer of device memory*/
> 
> Here as well.
> 
thanks~
> > +	} device_memory;
> > +	struct {
> > +		__u64 start_addr; /* wo */
> > +		__u64 page_nr;   /* wo */
> > +	} system_memory;
> > +}__attribute__((packed));
> 
> For an interface definition, it's probably better to avoid packed and
> instead add padding if needed.
> 
ok. so just remove the __attribute__((packed)); is enough for this
interface. 

> > +
> >  /* ***************************************************************** */
> >  
> >  #endif /* VFIO_H */
> 
> On the whole, I think this is moving into the right direction.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 5/5] vfio/migration: support device memory capability
  2019-02-19 14:42     ` [Qemu-devel] " Christophe de Dinechin
@ 2019-02-20  7:58       ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-20  7:58 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: cjia, KVM list, Alexey Kardashevskiy, Zhengxiao.zx,
	shuangtai.tst, qemu-devel, Kirti Wankhede, eauger, yi.l.liu,
	Erik Skultety, ziye.yang, mlevitsk, Halil Pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	Alex Williamson, intel-gvt-dev, changpeng.liu, Cornelia Huck,
	Zhi Wang, jonathan.davies

On Tue, Feb 19, 2019 at 03:42:36PM +0100, Christophe de Dinechin wrote:
> 
> 
> > On 19 Feb 2019, at 09:53, Yan Zhao <yan.y.zhao@intel.com> wrote:
> > 
> > If a device has device memory capability, save/load data from device memory
> > in pre-copy and stop-and-copy phases.
> > 
> > LOGGING state is set for device memory for dirty page logging:
> > in LOGGING state, get device memory returns whole device memory snapshot;
> > outside LOGGING state, get device memory returns dirty data since last get
> > operation.
> > 
> > Usually, device memory is very big, qemu needs to chunk it into several
> > pieces each with size of device memory region.
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > ---
> > hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > hw/vfio/pci.h       |   1 +
> > 2 files changed, 231 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 16d6395..f1e9309 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> >     return 0;
> > }
> > 
> > +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    uint64_t len;
> > +    int sz;
> > +
> > +    sz = sizeof(len);
> > +    if (pread(vbasedev->fd, &len, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to get length of device memory”);
> 
> s/length/size/ ? (to be consistent with function name)

ok. thanks
> > +        return -1;
> > +    }
> > +    vdev->migration->devmem_size = len;
> > +    return 0;
> > +}
> > +
> > +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    int sz;
> > +
> > +    sz = sizeof(size);
> > +    if (pwrite(vbasedev->fd, &size, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to set length of device comemory”);
> 
> What is comemory? Typo?

Right, typo. should be "memory" :)
> 
> Same comment about length vs size
>
got it. thanks

> > +        return -1;
> > +    }
> > +    vdev->migration->devmem_size = size;
> > +    return 0;
> > +}
> > +
> > +static
> > +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> > +                                    uint64_t pos, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> > +
> > +    if (len > region_devmem->size) {
> 
> Is it intentional that there is no error_report here?
>
an error_report here may be better.
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(pos);
> > +    if (pwrite(vbasedev->fd, &pos, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> > +            != sz) {
> > +        error_report("vfio: Failed to set save buffer pos");
> > +        return -1;
> > +    }
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> > +            != sz) {
> > +        error_report("vfio: Failed to set save buffer action");
> > +        return -1;
> > +    }
> > +
> > +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate”);
> s/migrate/migration/ ?

yes, thanks
> > +            return -1;
> > +        }
> > +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
> > +            error_report("vfio: error load device memory buffer”);
> s/load/loading/ ?
error to load? :)

> > +            return -1;
> > +        }
> > +        qemu_put_be64(f, len);
> > +        qemu_put_be64(f, pos);
> > +        qemu_put_buffer(f, buf, len);
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_devmem->mmaps[0].mmap;
> > +        qemu_put_be64(f, len);
> > +        qemu_put_be64(f, pos);
> > +        qemu_put_buffer(f, dest, len);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +    uint64_t total_len = vdev->migration->devmem_size;
> > +    uint64_t pos = 0;
> > +
> > +    qemu_put_be64(f, total_len);
> > +    while (pos < total_len) {
> > +        uint64_t len = region_devmem->size;
> > +
> > +        if (pos + len >= total_len) {
> > +            len = total_len - pos;
> > +        }
> > +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
> > +            return -1;
> > +        }
> 
> I don’t see where pos is incremented in this loop
> 
yes, missing one line "pos += len;"
Currently, code is not verified in hardware with device memory cap on.
Thanks:)
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static
> > +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> > +                                uint64_t pos, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> > +
> > +    if (len > region_devmem->size) {
> 
> error_report?

seems better to add error_report.
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(pos);
> > +    if (pwrite(vbasedev->fd, &pos, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> > +            != sz) {
> > +        error_report("vfio: Failed to set device memory buffer pos");
> > +        return -1;
> > +    }
> > +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> > +        qemu_get_buffer(f, buf, len);
> > +        if (pwrite(vbasedev->fd, buf, len,
> > +                    region_devmem->fd_offset) != len) {
> > +            error_report("vfio: Failed to load devie memory buffer");
> > +            return -1;
> > +        }
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_devmem->mmaps[0].mmap;
> > +        qemu_get_buffer(f, dest, len);
> > +    }
> > +
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> > +            != sz) {
> > +        error_report("vfio: Failed to set load device memory buffer action");
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +
> > +}
> > +
> > +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
> > +                        QEMUFile *f, uint64_t total_len)
> > +{
> > +    uint64_t pos = 0, len = 0;
> > +
> > +    vfio_set_device_memory_size(vdev, total_len);
> > +
> > +    while (pos + len < total_len) {
> > +        len = qemu_get_be64(f);
> > +        pos = qemu_get_be64(f);
> 
> Nit: load reads len/pos in the loop, whereas save does it in the
> inner function (vfio_save_data_device_memory_chunk)
right, load has to read len/pos in the loop.
> 
> > +
> > +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +
> > static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> >         uint64_t start_addr, uint64_t page_nr)
> > {
> > @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> >         return;
> >     }
> > 
> > +    /* get dirty data size of device memory */
> > +    vfio_get_device_memory_size(vdev);
> > +
> > +    *res_precopy_only += vdev->migration->devmem_size;
> >     return;
> > }
> > 
> > @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
> >         return 0;
> >     }
> > 
> > -    return 0;
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> > +    /* get dirty data of device memory */
> > +    return vfio_save_data_device_memory(vdev, f);
> > }
> > 
> > static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> >             len = qemu_get_be64(f);
> >             vfio_load_data_device_config(vdev, f, len);
> >             break;
> > +        case VFIO_SAVE_FLAG_DEVMEMORY:
> > +            len = qemu_get_be64(f);
> > +            vfio_load_data_device_memory(vdev, f, len);
> > +            break;
> >         default:
> >             ret = -EINVAL;
> >         }
> > @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >     VFIOPCIDevice *vdev = opaque;
> >     int rc = 0;
> > 
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
> > +        /* get dirty data of device memory */
> > +        vfio_get_device_memory_size(vdev);
> > +        rc = vfio_save_data_device_memory(vdev, f);
> > +    }
> > +
> >     qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> >     vfio_pci_save_config(vdev, f);
> > 
> > @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > 
> > static int vfio_save_setup(QEMUFile *f, void *opaque)
> > {
> > +    int rc = 0;
> >     VFIOPCIDevice *vdev = opaque;
> > -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> > +        /* get whole snapshot of device memory */
> > +        vfio_get_device_memory_size(vdev);
> > +        rc = vfio_save_data_device_memory(vdev, f);
> > +    } else {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +    }
> > 
> >     vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> >             VFIO_DEVICE_STATE_LOGGING);
> > -    return 0;
> > +    return rc;
> > }
> > 
> > static int vfio_load_setup(QEMUFile *f, void *opaque)
> > @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> >         goto error;
> >     }
> > 
> > -    if (vfio_device_data_cap_device_memory(vdev)) {
> > -        error_report("No suppport of data cap device memory Yet");
> > +    if (vfio_device_data_cap_device_memory(vdev) &&
> > +            vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
> > +              "device-state-data-device-memory")) {
> >         goto error;
> >     }
> > 
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index 4b7b1bb..a2cc64b 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
> >     uint32_t data_caps;
> >     uint32_t device_state;
> >     uint64_t devconfig_size;
> > +    uint64_t devmem_size;
> >     VMChangeStateEntry *vm_state;
> > } VFIOMigration;
> > 
> > -- 
> > 2.7.4
> > 
> > _______________________________________________
> > intel-gvt-dev mailing list
> > intel-gvt-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> 
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] vfio/migration: support device memory capability
@ 2019-02-20  7:58       ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-20  7:58 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: cjia, KVM list, Alexey Kardashevskiy, Zhengxiao.zx,
	shuangtai.tst, qemu-devel, Kirti Wankhede, eauger, yi.l.liu,
	Erik Skultety, ziye.yang, mlevitsk, Halil Pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	Alex Williamson, intel-gvt-dev, changpeng.liu, Cornelia Huck,
	Zhi Wang, jonathan.davies

On Tue, Feb 19, 2019 at 03:42:36PM +0100, Christophe de Dinechin wrote:
> 
> 
> > On 19 Feb 2019, at 09:53, Yan Zhao <yan.y.zhao@intel.com> wrote:
> > 
> > If a device has device memory capability, save/load data from device memory
> > in pre-copy and stop-and-copy phases.
> > 
> > LOGGING state is set for device memory for dirty page logging:
> > in LOGGING state, get device memory returns whole device memory snapshot;
> > outside LOGGING state, get device memory returns dirty data since last get
> > operation.
> > 
> > Usually, device memory is very big, qemu needs to chunk it into several
> > pieces each with size of device memory region.
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > ---
> > hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > hw/vfio/pci.h       |   1 +
> > 2 files changed, 231 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 16d6395..f1e9309 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> >     return 0;
> > }
> > 
> > +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    uint64_t len;
> > +    int sz;
> > +
> > +    sz = sizeof(len);
> > +    if (pread(vbasedev->fd, &len, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to get length of device memory”);
> 
> s/length/size/ ? (to be consistent with function name)

ok. thanks
> > +        return -1;
> > +    }
> > +    vdev->migration->devmem_size = len;
> > +    return 0;
> > +}
> > +
> > +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    int sz;
> > +
> > +    sz = sizeof(size);
> > +    if (pwrite(vbasedev->fd, &size, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to set length of device comemory”);
> 
> What is comemory? Typo?

Right, typo. should be "memory" :)
> 
> Same comment about length vs size
>
got it. thanks

> > +        return -1;
> > +    }
> > +    vdev->migration->devmem_size = size;
> > +    return 0;
> > +}
> > +
> > +static
> > +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> > +                                    uint64_t pos, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> > +
> > +    if (len > region_devmem->size) {
> 
> Is it intentional that there is no error_report here?
>
an error_report here may be better.
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(pos);
> > +    if (pwrite(vbasedev->fd, &pos, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> > +            != sz) {
> > +        error_report("vfio: Failed to set save buffer pos");
> > +        return -1;
> > +    }
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> > +            != sz) {
> > +        error_report("vfio: Failed to set save buffer action");
> > +        return -1;
> > +    }
> > +
> > +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate”);
> s/migrate/migration/ ?

yes, thanks
> > +            return -1;
> > +        }
> > +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
> > +            error_report("vfio: error load device memory buffer”);
> s/load/loading/ ?
error to load? :)

> > +            return -1;
> > +        }
> > +        qemu_put_be64(f, len);
> > +        qemu_put_be64(f, pos);
> > +        qemu_put_buffer(f, buf, len);
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_devmem->mmaps[0].mmap;
> > +        qemu_put_be64(f, len);
> > +        qemu_put_be64(f, pos);
> > +        qemu_put_buffer(f, dest, len);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +    uint64_t total_len = vdev->migration->devmem_size;
> > +    uint64_t pos = 0;
> > +
> > +    qemu_put_be64(f, total_len);
> > +    while (pos < total_len) {
> > +        uint64_t len = region_devmem->size;
> > +
> > +        if (pos + len >= total_len) {
> > +            len = total_len - pos;
> > +        }
> > +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
> > +            return -1;
> > +        }
> 
> I don’t see where pos is incremented in this loop
> 
yes, missing one line "pos += len;"
Currently, code is not verified in hardware with device memory cap on.
Thanks:)
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static
> > +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> > +                                uint64_t pos, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_devmem =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> > +
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> > +
> > +    if (len > region_devmem->size) {
> 
> error_report?

seems better to add error_report.
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(pos);
> > +    if (pwrite(vbasedev->fd, &pos, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> > +            != sz) {
> > +        error_report("vfio: Failed to set device memory buffer pos");
> > +        return -1;
> > +    }
> > +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> > +        qemu_get_buffer(f, buf, len);
> > +        if (pwrite(vbasedev->fd, buf, len,
> > +                    region_devmem->fd_offset) != len) {
> > +            error_report("vfio: Failed to load devie memory buffer");
> > +            return -1;
> > +        }
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_devmem->mmaps[0].mmap;
> > +        qemu_get_buffer(f, dest, len);
> > +    }
> > +
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> > +            != sz) {
> > +        error_report("vfio: Failed to set load device memory buffer action");
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +
> > +}
> > +
> > +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
> > +                        QEMUFile *f, uint64_t total_len)
> > +{
> > +    uint64_t pos = 0, len = 0;
> > +
> > +    vfio_set_device_memory_size(vdev, total_len);
> > +
> > +    while (pos + len < total_len) {
> > +        len = qemu_get_be64(f);
> > +        pos = qemu_get_be64(f);
> 
> Nit: load reads len/pos in the loop, whereas save does it in the
> inner function (vfio_save_data_device_memory_chunk)
right, load has to read len/pos in the loop.
> 
> > +
> > +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +
> > static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> >         uint64_t start_addr, uint64_t page_nr)
> > {
> > @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> >         return;
> >     }
> > 
> > +    /* get dirty data size of device memory */
> > +    vfio_get_device_memory_size(vdev);
> > +
> > +    *res_precopy_only += vdev->migration->devmem_size;
> >     return;
> > }
> > 
> > @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
> >         return 0;
> >     }
> > 
> > -    return 0;
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> > +    /* get dirty data of device memory */
> > +    return vfio_save_data_device_memory(vdev, f);
> > }
> > 
> > static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> >             len = qemu_get_be64(f);
> >             vfio_load_data_device_config(vdev, f, len);
> >             break;
> > +        case VFIO_SAVE_FLAG_DEVMEMORY:
> > +            len = qemu_get_be64(f);
> > +            vfio_load_data_device_memory(vdev, f, len);
> > +            break;
> >         default:
> >             ret = -EINVAL;
> >         }
> > @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >     VFIOPCIDevice *vdev = opaque;
> >     int rc = 0;
> > 
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
> > +        /* get dirty data of device memory */
> > +        vfio_get_device_memory_size(vdev);
> > +        rc = vfio_save_data_device_memory(vdev, f);
> > +    }
> > +
> >     qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> >     vfio_pci_save_config(vdev, f);
> > 
> > @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > 
> > static int vfio_save_setup(QEMUFile *f, void *opaque)
> > {
> > +    int rc = 0;
> >     VFIOPCIDevice *vdev = opaque;
> > -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> > +        /* get whole snapshot of device memory */
> > +        vfio_get_device_memory_size(vdev);
> > +        rc = vfio_save_data_device_memory(vdev, f);
> > +    } else {
> > +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +    }
> > 
> >     vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> >             VFIO_DEVICE_STATE_LOGGING);
> > -    return 0;
> > +    return rc;
> > }
> > 
> > static int vfio_load_setup(QEMUFile *f, void *opaque)
> > @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> >         goto error;
> >     }
> > 
> > -    if (vfio_device_data_cap_device_memory(vdev)) {
> > -        error_report("No suppport of data cap device memory Yet");
> > +    if (vfio_device_data_cap_device_memory(vdev) &&
> > +            vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
> > +              "device-state-data-device-memory")) {
> >         goto error;
> >     }
> > 
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index 4b7b1bb..a2cc64b 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
> >     uint32_t data_caps;
> >     uint32_t device_state;
> >     uint64_t devconfig_size;
> > +    uint64_t devmem_size;
> >     VMChangeStateEntry *vm_state;
> > } VFIOMigration;
> > 
> > -- 
> > 2.7.4
> > 
> > _______________________________________________
> > intel-gvt-dev mailing list
> > intel-gvt-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> 
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 5/5] vfio/migration: support device memory capability
  2019-02-20  7:58       ` [Qemu-devel] " Zhao Yan
@ 2019-02-20 10:14         ` Christophe de Dinechin
  -1 siblings, 0 replies; 133+ messages in thread
From: Christophe de Dinechin @ 2019-02-20 10:14 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, KVM list, Alexey Kardashevskiy, Zhengxiao.zx,
	shuangtai.tst, qemu-devel, Kirti Wankhede, eauger, yi.l.liu,
	Erik Skultety, ziye.yang, mlevitsk, Halil Pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	Alex Williamson, intel-gvt-dev, changpeng.liu, Cornelia Huck,
	Zhi Wang, jonathan.davies



> On 20 Feb 2019, at 08:58, Zhao Yan <yan.y.zhao@intel.com> wrote:
> 
> On Tue, Feb 19, 2019 at 03:42:36PM +0100, Christophe de Dinechin wrote:
>> 
>> 
>>> On 19 Feb 2019, at 09:53, Yan Zhao <yan.y.zhao@intel.com> wrote:
>>> 
>>> If a device has device memory capability, save/load data from device memory
>>> in pre-copy and stop-and-copy phases.
>>> 
>>> LOGGING state is set for device memory for dirty page logging:
>>> in LOGGING state, get device memory returns whole device memory snapshot;
>>> outside LOGGING state, get device memory returns dirty data since last get
>>> operation.
>>> 
>>> Usually, device memory is very big, qemu needs to chunk it into several
>>> pieces each with size of device memory region.
>>> 
>>> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>> ---
>>> hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>>> hw/vfio/pci.h       |   1 +
>>> 2 files changed, 231 insertions(+), 5 deletions(-)
>>> 
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 16d6395..f1e9309 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
>>>    return 0;
>>> }
>>> 
>>> +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
>>> +{
>>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>>> +    VFIORegion *region_ctl =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
>>> +    uint64_t len;
>>> +    int sz;
>>> +
>>> +    sz = sizeof(len);
>>> +    if (pread(vbasedev->fd, &len, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to get length of device memory”);
>> 
>> s/length/size/ ? (to be consistent with function name)
> 
> ok. thanks
>>> +        return -1;
>>> +    }
>>> +    vdev->migration->devmem_size = len;
>>> +    return 0;
>>> +}
>>> +
>>> +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
>>> +{
>>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>>> +    VFIORegion *region_ctl =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
>>> +    int sz;
>>> +
>>> +    sz = sizeof(size);
>>> +    if (pwrite(vbasedev->fd, &size, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to set length of device comemory”);
>> 
>> What is comemory? Typo?
> 
> Right, typo. should be "memory" :)
>> 
>> Same comment about length vs size
>> 
> got it. thanks
> 
>>> +        return -1;
>>> +    }
>>> +    vdev->migration->devmem_size = size;
>>> +    return 0;
>>> +}
>>> +
>>> +static
>>> +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
>>> +                                    uint64_t pos, uint64_t len)
>>> +{
>>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>>> +    VFIORegion *region_ctl =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
>>> +    VFIORegion *region_devmem =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
>>> +    void *dest;
>>> +    uint32_t sz;
>>> +    uint8_t *buf = NULL;
>>> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
>>> +
>>> +    if (len > region_devmem->size) {
>> 
>> Is it intentional that there is no error_report here?
>> 
> an error_report here may be better.
>>> +        return -1;
>>> +    }
>>> +
>>> +    sz = sizeof(pos);
>>> +    if (pwrite(vbasedev->fd, &pos, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to set save buffer pos");
>>> +        return -1;
>>> +    }
>>> +    sz = sizeof(action);
>>> +    if (pwrite(vbasedev->fd, &action, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to set save buffer action");
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
>>> +        buf = g_malloc(len);
>>> +        if (buf == NULL) {
>>> +            error_report("vfio: Failed to allocate memory for migrate”);
>> s/migrate/migration/ ?
> 
> yes, thanks
>>> +            return -1;
>>> +        }
>>> +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
>>> +            error_report("vfio: error load device memory buffer”);
>> s/load/loading/ ?
> error to load? :)

I’d check with a native speaker, but I believe it’s “error loading”.

To me (to be checked), the two sentences don’t have the same meaning:

“It is an error to load device memory buffer” -> “You are not allowed to do that”
“I had an error loading device memory buffer” -> “I tried, but it failed"

> 
>>> +            return -1;
>>> +        }
>>> +        qemu_put_be64(f, len);
>>> +        qemu_put_be64(f, pos);
>>> +        qemu_put_buffer(f, buf, len);
>>> +        g_free(buf);
>>> +    } else {
>>> +        dest = region_devmem->mmaps[0].mmap;
>>> +        qemu_put_be64(f, len);
>>> +        qemu_put_be64(f, pos);
>>> +        qemu_put_buffer(f, dest, len);
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>> +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
>>> +{
>>> +    VFIORegion *region_devmem =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
>>> +    uint64_t total_len = vdev->migration->devmem_size;
>>> +    uint64_t pos = 0;
>>> +
>>> +    qemu_put_be64(f, total_len);
>>> +    while (pos < total_len) {
>>> +        uint64_t len = region_devmem->size;
>>> +
>>> +        if (pos + len >= total_len) {
>>> +            len = total_len - pos;
>>> +        }
>>> +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
>>> +            return -1;
>>> +        }
>> 
>> I don’t see where pos is incremented in this loop
>> 
> yes, missing one line "pos += len;"
> Currently, code is not verified in hardware with device memory cap on.
> Thanks:)
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static
>>> +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
>>> +                                uint64_t pos, uint64_t len)
>>> +{
>>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>>> +    VFIORegion *region_ctl =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
>>> +    VFIORegion *region_devmem =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
>>> +
>>> +    void *dest;
>>> +    uint32_t sz;
>>> +    uint8_t *buf = NULL;
>>> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
>>> +
>>> +    if (len > region_devmem->size) {
>> 
>> error_report?
> 
> seems better to add error_report.
>>> +        return -1;
>>> +    }
>>> +
>>> +    sz = sizeof(pos);
>>> +    if (pwrite(vbasedev->fd, &pos, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to set device memory buffer pos");
>>> +        return -1;
>>> +    }
>>> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
>>> +        buf = g_malloc(len);
>>> +        if (buf == NULL) {
>>> +            error_report("vfio: Failed to allocate memory for migrate");
>>> +            return -1;
>>> +        }
>>> +        qemu_get_buffer(f, buf, len);
>>> +        if (pwrite(vbasedev->fd, buf, len,
>>> +                    region_devmem->fd_offset) != len) {
>>> +            error_report("vfio: Failed to load devie memory buffer");
>>> +            return -1;
>>> +        }
>>> +        g_free(buf);
>>> +    } else {
>>> +        dest = region_devmem->mmaps[0].mmap;
>>> +        qemu_get_buffer(f, dest, len);
>>> +    }
>>> +
>>> +    sz = sizeof(action);
>>> +    if (pwrite(vbasedev->fd, &action, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to set load device memory buffer action");
>>> +        return -1;
>>> +    }
>>> +
>>> +    return 0;
>>> +
>>> +}
>>> +
>>> +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
>>> +                        QEMUFile *f, uint64_t total_len)
>>> +{
>>> +    uint64_t pos = 0, len = 0;
>>> +
>>> +    vfio_set_device_memory_size(vdev, total_len);
>>> +
>>> +    while (pos + len < total_len) {
>>> +        len = qemu_get_be64(f);
>>> +        pos = qemu_get_be64(f);
>> 
>> Nit: load reads len/pos in the loop, whereas save does it in the
>> inner function (vfio_save_data_device_memory_chunk)
> right, load has to read len/pos in the loop.
>> 
>>> +
>>> +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +
>>> static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
>>>        uint64_t start_addr, uint64_t page_nr)
>>> {
>>> @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
>>>        return;
>>>    }
>>> 
>>> +    /* get dirty data size of device memory */
>>> +    vfio_get_device_memory_size(vdev);
>>> +
>>> +    *res_precopy_only += vdev->migration->devmem_size;
>>>    return;
>>> }
>>> 
>>> @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>>        return 0;
>>>    }
>>> 
>>> -    return 0;
>>> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
>>> +    /* get dirty data of device memory */
>>> +    return vfio_save_data_device_memory(vdev, f);
>>> }
>>> 
>>> static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
>>> @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>>            len = qemu_get_be64(f);
>>>            vfio_load_data_device_config(vdev, f, len);
>>>            break;
>>> +        case VFIO_SAVE_FLAG_DEVMEMORY:
>>> +            len = qemu_get_be64(f);
>>> +            vfio_load_data_device_memory(vdev, f, len);
>>> +            break;
>>>        default:
>>>            ret = -EINVAL;
>>>        }
>>> @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>    VFIOPCIDevice *vdev = opaque;
>>>    int rc = 0;
>>> 
>>> +    if (vfio_device_data_cap_device_memory(vdev)) {
>>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
>>> +        /* get dirty data of device memory */
>>> +        vfio_get_device_memory_size(vdev);
>>> +        rc = vfio_save_data_device_memory(vdev, f);
>>> +    }
>>> +
>>>    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
>>>    vfio_pci_save_config(vdev, f);
>>> 
>>> @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>> 
>>> static int vfio_save_setup(QEMUFile *f, void *opaque)
>>> {
>>> +    int rc = 0;
>>>    VFIOPCIDevice *vdev = opaque;
>>> -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
>>> +
>>> +    if (vfio_device_data_cap_device_memory(vdev)) {
>>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
>>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
>>> +        /* get whole snapshot of device memory */
>>> +        vfio_get_device_memory_size(vdev);
>>> +        rc = vfio_save_data_device_memory(vdev, f);
>>> +    } else {
>>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
>>> +    }
>>> 
>>>    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
>>>            VFIO_DEVICE_STATE_LOGGING);
>>> -    return 0;
>>> +    return rc;
>>> }
>>> 
>>> static int vfio_load_setup(QEMUFile *f, void *opaque)
>>> @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
>>>        goto error;
>>>    }
>>> 
>>> -    if (vfio_device_data_cap_device_memory(vdev)) {
>>> -        error_report("No suppport of data cap device memory Yet");
>>> +    if (vfio_device_data_cap_device_memory(vdev) &&
>>> +            vfio_device_state_region_setup(vdev,
>>> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
>>> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
>>> +              "device-state-data-device-memory")) {
>>>        goto error;
>>>    }
>>> 
>>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>>> index 4b7b1bb..a2cc64b 100644
>>> --- a/hw/vfio/pci.h
>>> +++ b/hw/vfio/pci.h
>>> @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
>>>    uint32_t data_caps;
>>>    uint32_t device_state;
>>>    uint64_t devconfig_size;
>>> +    uint64_t devmem_size;
>>>    VMChangeStateEntry *vm_state;
>>> } VFIOMigration;
>>> 
>>> -- 
>>> 2.7.4
>>> 
>>> _______________________________________________
>>> intel-gvt-dev mailing list
>>> intel-gvt-dev@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
>> 
>> _______________________________________________
>> intel-gvt-dev mailing list
>> intel-gvt-dev@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] vfio/migration: support device memory capability
@ 2019-02-20 10:14         ` Christophe de Dinechin
  0 siblings, 0 replies; 133+ messages in thread
From: Christophe de Dinechin @ 2019-02-20 10:14 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, KVM list, Alexey Kardashevskiy, Zhengxiao.zx,
	shuangtai.tst, qemu-devel, Kirti Wankhede, eauger, yi.l.liu,
	Erik Skultety, ziye.yang, mlevitsk, Halil Pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	Alex Williamson, intel-gvt-dev, changpeng.liu, Cornelia Huck,
	Zhi Wang, jonathan.davies



> On 20 Feb 2019, at 08:58, Zhao Yan <yan.y.zhao@intel.com> wrote:
> 
> On Tue, Feb 19, 2019 at 03:42:36PM +0100, Christophe de Dinechin wrote:
>> 
>> 
>>> On 19 Feb 2019, at 09:53, Yan Zhao <yan.y.zhao@intel.com> wrote:
>>> 
>>> If a device has device memory capability, save/load data from device memory
>>> in pre-copy and stop-and-copy phases.
>>> 
>>> LOGGING state is set for device memory for dirty page logging:
>>> in LOGGING state, get device memory returns whole device memory snapshot;
>>> outside LOGGING state, get device memory returns dirty data since last get
>>> operation.
>>> 
>>> Usually, device memory is very big, qemu needs to chunk it into several
>>> pieces each with size of device memory region.
>>> 
>>> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>> ---
>>> hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>>> hw/vfio/pci.h       |   1 +
>>> 2 files changed, 231 insertions(+), 5 deletions(-)
>>> 
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 16d6395..f1e9309 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
>>>    return 0;
>>> }
>>> 
>>> +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
>>> +{
>>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>>> +    VFIORegion *region_ctl =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
>>> +    uint64_t len;
>>> +    int sz;
>>> +
>>> +    sz = sizeof(len);
>>> +    if (pread(vbasedev->fd, &len, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to get length of device memory”);
>> 
>> s/length/size/ ? (to be consistent with function name)
> 
> ok. thanks
>>> +        return -1;
>>> +    }
>>> +    vdev->migration->devmem_size = len;
>>> +    return 0;
>>> +}
>>> +
>>> +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
>>> +{
>>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>>> +    VFIORegion *region_ctl =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
>>> +    int sz;
>>> +
>>> +    sz = sizeof(size);
>>> +    if (pwrite(vbasedev->fd, &size, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to set length of device comemory”);
>> 
>> What is comemory? Typo?
> 
> Right, typo. should be "memory" :)
>> 
>> Same comment about length vs size
>> 
> got it. thanks
> 
>>> +        return -1;
>>> +    }
>>> +    vdev->migration->devmem_size = size;
>>> +    return 0;
>>> +}
>>> +
>>> +static
>>> +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
>>> +                                    uint64_t pos, uint64_t len)
>>> +{
>>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>>> +    VFIORegion *region_ctl =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
>>> +    VFIORegion *region_devmem =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
>>> +    void *dest;
>>> +    uint32_t sz;
>>> +    uint8_t *buf = NULL;
>>> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
>>> +
>>> +    if (len > region_devmem->size) {
>> 
>> Is it intentional that there is no error_report here?
>> 
> an error_report here may be better.
>>> +        return -1;
>>> +    }
>>> +
>>> +    sz = sizeof(pos);
>>> +    if (pwrite(vbasedev->fd, &pos, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to set save buffer pos");
>>> +        return -1;
>>> +    }
>>> +    sz = sizeof(action);
>>> +    if (pwrite(vbasedev->fd, &action, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to set save buffer action");
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
>>> +        buf = g_malloc(len);
>>> +        if (buf == NULL) {
>>> +            error_report("vfio: Failed to allocate memory for migrate”);
>> s/migrate/migration/ ?
> 
> yes, thanks
>>> +            return -1;
>>> +        }
>>> +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
>>> +            error_report("vfio: error load device memory buffer”);
>> s/load/loading/ ?
> error to load? :)

I’d check with a native speaker, but I believe it’s “error loading”.

To me (to be checked), the two sentences don’t have the same meaning:

“It is an error to load device memory buffer” -> “You are not allowed to do that”
“I had an error loading device memory buffer” -> “I tried, but it failed"

> 
>>> +            return -1;
>>> +        }
>>> +        qemu_put_be64(f, len);
>>> +        qemu_put_be64(f, pos);
>>> +        qemu_put_buffer(f, buf, len);
>>> +        g_free(buf);
>>> +    } else {
>>> +        dest = region_devmem->mmaps[0].mmap;
>>> +        qemu_put_be64(f, len);
>>> +        qemu_put_be64(f, pos);
>>> +        qemu_put_buffer(f, dest, len);
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>> +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
>>> +{
>>> +    VFIORegion *region_devmem =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
>>> +    uint64_t total_len = vdev->migration->devmem_size;
>>> +    uint64_t pos = 0;
>>> +
>>> +    qemu_put_be64(f, total_len);
>>> +    while (pos < total_len) {
>>> +        uint64_t len = region_devmem->size;
>>> +
>>> +        if (pos + len >= total_len) {
>>> +            len = total_len - pos;
>>> +        }
>>> +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
>>> +            return -1;
>>> +        }
>> 
>> I don’t see where pos is incremented in this loop
>> 
> yes, missing one line "pos += len;"
> Currently, code is not verified in hardware with device memory cap on.
> Thanks:)
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static
>>> +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
>>> +                                uint64_t pos, uint64_t len)
>>> +{
>>> +    VFIODevice *vbasedev = &vdev->vbasedev;
>>> +    VFIORegion *region_ctl =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
>>> +    VFIORegion *region_devmem =
>>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
>>> +
>>> +    void *dest;
>>> +    uint32_t sz;
>>> +    uint8_t *buf = NULL;
>>> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
>>> +
>>> +    if (len > region_devmem->size) {
>> 
>> error_report?
> 
> seems better to add error_report.
>>> +        return -1;
>>> +    }
>>> +
>>> +    sz = sizeof(pos);
>>> +    if (pwrite(vbasedev->fd, &pos, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to set device memory buffer pos");
>>> +        return -1;
>>> +    }
>>> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
>>> +        buf = g_malloc(len);
>>> +        if (buf == NULL) {
>>> +            error_report("vfio: Failed to allocate memory for migrate");
>>> +            return -1;
>>> +        }
>>> +        qemu_get_buffer(f, buf, len);
>>> +        if (pwrite(vbasedev->fd, buf, len,
>>> +                    region_devmem->fd_offset) != len) {
>>> +            error_report("vfio: Failed to load devie memory buffer");
>>> +            return -1;
>>> +        }
>>> +        g_free(buf);
>>> +    } else {
>>> +        dest = region_devmem->mmaps[0].mmap;
>>> +        qemu_get_buffer(f, dest, len);
>>> +    }
>>> +
>>> +    sz = sizeof(action);
>>> +    if (pwrite(vbasedev->fd, &action, sz,
>>> +                region_ctl->fd_offset +
>>> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
>>> +            != sz) {
>>> +        error_report("vfio: Failed to set load device memory buffer action");
>>> +        return -1;
>>> +    }
>>> +
>>> +    return 0;
>>> +
>>> +}
>>> +
>>> +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
>>> +                        QEMUFile *f, uint64_t total_len)
>>> +{
>>> +    uint64_t pos = 0, len = 0;
>>> +
>>> +    vfio_set_device_memory_size(vdev, total_len);
>>> +
>>> +    while (pos + len < total_len) {
>>> +        len = qemu_get_be64(f);
>>> +        pos = qemu_get_be64(f);
>> 
>> Nit: load reads len/pos in the loop, whereas save does it in the
>> inner function (vfio_save_data_device_memory_chunk)
> right, load has to read len/pos in the loop.
>> 
>>> +
>>> +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +
>>> static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
>>>        uint64_t start_addr, uint64_t page_nr)
>>> {
>>> @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
>>>        return;
>>>    }
>>> 
>>> +    /* get dirty data size of device memory */
>>> +    vfio_get_device_memory_size(vdev);
>>> +
>>> +    *res_precopy_only += vdev->migration->devmem_size;
>>>    return;
>>> }
>>> 
>>> @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>>        return 0;
>>>    }
>>> 
>>> -    return 0;
>>> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
>>> +    /* get dirty data of device memory */
>>> +    return vfio_save_data_device_memory(vdev, f);
>>> }
>>> 
>>> static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
>>> @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>>            len = qemu_get_be64(f);
>>>            vfio_load_data_device_config(vdev, f, len);
>>>            break;
>>> +        case VFIO_SAVE_FLAG_DEVMEMORY:
>>> +            len = qemu_get_be64(f);
>>> +            vfio_load_data_device_memory(vdev, f, len);
>>> +            break;
>>>        default:
>>>            ret = -EINVAL;
>>>        }
>>> @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>    VFIOPCIDevice *vdev = opaque;
>>>    int rc = 0;
>>> 
>>> +    if (vfio_device_data_cap_device_memory(vdev)) {
>>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
>>> +        /* get dirty data of device memory */
>>> +        vfio_get_device_memory_size(vdev);
>>> +        rc = vfio_save_data_device_memory(vdev, f);
>>> +    }
>>> +
>>>    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
>>>    vfio_pci_save_config(vdev, f);
>>> 
>>> @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>> 
>>> static int vfio_save_setup(QEMUFile *f, void *opaque)
>>> {
>>> +    int rc = 0;
>>>    VFIOPCIDevice *vdev = opaque;
>>> -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
>>> +
>>> +    if (vfio_device_data_cap_device_memory(vdev)) {
>>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
>>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
>>> +        /* get whole snapshot of device memory */
>>> +        vfio_get_device_memory_size(vdev);
>>> +        rc = vfio_save_data_device_memory(vdev, f);
>>> +    } else {
>>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
>>> +    }
>>> 
>>>    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
>>>            VFIO_DEVICE_STATE_LOGGING);
>>> -    return 0;
>>> +    return rc;
>>> }
>>> 
>>> static int vfio_load_setup(QEMUFile *f, void *opaque)
>>> @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
>>>        goto error;
>>>    }
>>> 
>>> -    if (vfio_device_data_cap_device_memory(vdev)) {
>>> -        error_report("No suppport of data cap device memory Yet");
>>> +    if (vfio_device_data_cap_device_memory(vdev) &&
>>> +            vfio_device_state_region_setup(vdev,
>>> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
>>> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
>>> +              "device-state-data-device-memory")) {
>>>        goto error;
>>>    }
>>> 
>>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>>> index 4b7b1bb..a2cc64b 100644
>>> --- a/hw/vfio/pci.h
>>> +++ b/hw/vfio/pci.h
>>> @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
>>>    uint32_t data_caps;
>>>    uint32_t device_state;
>>>    uint64_t devconfig_size;
>>> +    uint64_t devmem_size;
>>>    VMChangeStateEntry *vm_state;
>>> } VFIOMigration;
>>> 
>>> -- 
>>> 2.7.4
>>> 
>>> _______________________________________________
>>> intel-gvt-dev mailing list
>>> intel-gvt-dev@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
>> 
>> _______________________________________________
>> intel-gvt-dev mailing list
>> intel-gvt-dev@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 2/5] vfio/migration: support device of device config capability
  2019-02-20  5:12       ` [Qemu-devel] " Zhao Yan
@ 2019-02-20 10:57         ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 133+ messages in thread
From: Dr. David Alan Gilbert @ 2019-02-20 10:57 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, zhi.a.wang, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck, Ken.Xue,
	jonathan.davies

* Zhao Yan (yan.y.zhao@intel.com) wrote:
> On Tue, Feb 19, 2019 at 11:01:45AM +0000, Dr. David Alan Gilbert wrote:
> > * Yan Zhao (yan.y.zhao@intel.com) wrote:

<snip>

> > > +    msi_data = qemu_get_be32(f);
> > > +    vfio_pci_write_config(pdev,
> > > +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > > +            msi_data, 2);
> > > +
> > > +    vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > > +            ctl | PCI_MSI_FLAGS_ENABLE, 2);
> > 
> > It would probably be best to use a VMStateDescription and the macros
> > for this if possible; I bet you'll want to add more fields in the future
> > for example.
> yes, it is also a good idea to use VMStateDescription for pci general
> config data, which are saved one time in stop-and-copy phase.
> But it's a little hard to maintain two type of interfaces.
> Maybe use the way as what VMStateDescription did, i.e. using field list and
> macros?

Maybe, but if you can actually use the VMStateDescription it would give
you the versioning and conditional fields and things like that.

> > Also what happens if the data read from the migration stream is bad or
> > doesn't agree with this devices hardware? How does this fail?
> >
> right, errors in migration stream needs to be checked. I'll add the error
> check.
> For the hardware incompatiable issue, seems vfio_pci_write_config does not
> return error if device driver returns error for this write.

It would be great if we could make that fail properly.

> Seems it needs to be addressed in somewhere else like checking
> compitibility before lauching migration.

That would be good; but we should still guard against it being wrong
if possible - these type of checks for compatibility are probably quite
difficult.

Dave

> > > +}
> > > +
> > > +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> > > +{
> > > +    VFIOPCIDevice *vdev = opaque;
> > > +    int flag;
> > > +    uint64_t len;
> > > +    int ret = 0;
> > > +
> > > +    if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> > > +        return -EINVAL;
> > > +    }
> > > +
> > > +    do {
> > > +        flag = qemu_get_byte(f);
> > > +
> > > +        switch (flag & ~VFIO_SAVE_FLAG_CONTINUE) {
> > > +        case VFIO_SAVE_FLAG_SETUP:
> > > +            break;
> > > +        case VFIO_SAVE_FLAG_PCI:
> > > +            vfio_pci_load_config(vdev, f);
> > > +            break;
> > > +        case VFIO_SAVE_FLAG_DEVCONFIG:
> > > +            len = qemu_get_be64(f);
> > > +            vfio_load_data_device_config(vdev, f, len);
> > > +            break;
> > > +        default:
> > > +            ret = -EINVAL;
> > > +        }
> > > +    } while (flag & VFIO_SAVE_FLAG_CONTINUE);
> > > +
> > > +    return ret;
> > > +}
> > > +
> > > +static void vfio_pci_save_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > > +{
> > > +    PCIDevice *pdev = &vdev->pdev;
> > > +    uint32_t msi_cfg, msi_lo, msi_hi, msi_data, bar_cfg, i;
> > > +    bool msi_64bit;
> > > +
> > > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > +        bar_cfg = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> > > +        qemu_put_be32(f, bar_cfg);
> > > +    }
> > > +
> > > +    msi_cfg = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > > +    msi_64bit = !!(msi_cfg & PCI_MSI_FLAGS_64BIT);
> > > +
> > > +    msi_lo = pci_default_read_config(pdev,
> > > +            pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > > +    qemu_put_be32(f, msi_lo);
> > > +
> > > +    if (msi_64bit) {
> > > +        msi_hi = pci_default_read_config(pdev,
> > > +                pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > > +                4);
> > > +        qemu_put_be32(f, msi_hi);
> > > +    }
> > > +
> > > +    msi_data = pci_default_read_config(pdev,
> > > +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > > +            2);
> > > +    qemu_put_be32(f, msi_data);
> > > +
> > > +}
> > > +
> > > +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > > +{
> > > +    VFIOPCIDevice *vdev = opaque;
> > > +    int rc = 0;
> > > +
> > > +    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> > > +    vfio_pci_save_config(vdev, f);
> > > +
> > > +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVCONFIG);
> > > +    rc += vfio_get_device_config_size(vdev);
> > > +    rc += vfio_save_data_device_config(vdev, f);
> > > +
> > > +    return rc;
> > > +}
> > > +
> > > +static int vfio_save_setup(QEMUFile *f, void *opaque)
> > > +{
> > > +    VFIOPCIDevice *vdev = opaque;
> > > +    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > > +
> > > +    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> > > +            VFIO_DEVICE_STATE_LOGGING);
> > > +    return 0;
> > > +}
> > > +
> > > +static int vfio_load_setup(QEMUFile *f, void *opaque)
> > > +{
> > > +    return 0;
> > > +}
> > > +
> > > +static void vfio_save_cleanup(void *opaque)
> > > +{
> > > +    VFIOPCIDevice *vdev = opaque;
> > > +    uint32_t dev_state = vdev->migration->device_state;
> > > +
> > > +    dev_state &= ~VFIO_DEVICE_STATE_LOGGING;
> > > +
> > > +    vfio_set_device_state(vdev, dev_state);
> > > +}
> > > +
> > > +static SaveVMHandlers savevm_vfio_handlers = {
> > > +    .save_setup = vfio_save_setup,
> > > +    .save_live_pending = vfio_save_live_pending,
> > > +    .save_live_iterate = vfio_save_iterate,
> > > +    .save_live_complete_precopy = vfio_save_complete_precopy,
> > > +    .save_cleanup = vfio_save_cleanup,
> > > +    .load_setup = vfio_load_setup,
> > > +    .load_state = vfio_load_state,
> > > +};
> > > +
> > > +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> > > +{
> > > +    int ret;
> > > +    Error *local_err = NULL;
> > > +    vdev->migration = g_new0(VFIOMigration, 1);
> > > +
> > > +    if (vfio_device_state_region_setup(vdev,
> > > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL],
> > > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL,
> > > +              "device-state-ctl")) {
> > > +        goto error;
> > > +    }
> > > +
> > > +    if (vfio_check_devstate_version(vdev)) {
> > > +        goto error;
> > > +    }
> > > +
> > > +    if (vfio_get_device_data_caps(vdev)) {
> > > +        goto error;
> > > +    }
> > > +
> > > +    if (vfio_device_state_region_setup(vdev,
> > > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG],
> > > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG,
> > > +              "device-state-data-device-config")) {
> > > +        goto error;
> > > +    }
> > > +
> > > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > > +        error_report("No suppport of data cap device memory Yet");
> > > +        goto error;
> > > +    }
> > > +
> > > +    if (vfio_device_data_cap_system_memory(vdev) &&
> > > +            vfio_device_state_region_setup(vdev,
> > > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP],
> > > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP,
> > > +              "device-state-data-dirtybitmap")) {
> > > +        goto error;
> > > +    }
> > > +
> > > +    vdev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> > > +
> > > +    register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
> > > +            VFIO_DEVICE_STATE_INTERFACE_VERSION,
> > > +            &savevm_vfio_handlers,
> > > +            vdev);
> > > +
> > > +    vdev->migration->vm_state =
> > > +        qemu_add_vm_change_state_handler(vfio_vm_change_state_handler, vdev);
> > > +
> > > +    return 0;
> > > +error:
> > > +    error_setg(&vdev->migration_blocker,
> > > +            "VFIO device doesn't support migration");
> > > +    ret = migrate_add_blocker(vdev->migration_blocker, &local_err);
> > > +    if (local_err) {
> > > +        error_propagate(errp, local_err);
> > > +        error_free(vdev->migration_blocker);
> > > +    }
> > > +
> > > +    g_free(vdev->migration);
> > > +    vdev->migration = NULL;
> > > +
> > > +    return ret;
> > > +}
> > > +
> > > +void vfio_migration_finalize(VFIOPCIDevice *vdev)
> > > +{
> > > +    if (vdev->migration) {
> > > +        int i;
> > > +        qemu_del_vm_change_state_handler(vdev->migration->vm_state);
> > > +        unregister_savevm(NULL, TYPE_VFIO_PCI, vdev);
> > > +        for (i = 0; i < VFIO_DEVSTATE_REGION_NUM; i++) {
> > > +            vfio_region_finalize(&vdev->migration->region[i]);
> > > +        }
> > > +        g_free(vdev->migration);
> > > +        vdev->migration = NULL;
> > > +    } else if (vdev->migration_blocker) {
> > > +        migrate_del_blocker(vdev->migration_blocker);
> > > +        error_free(vdev->migration_blocker);
> > > +    }
> > > +}
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index c0cb1ec..b8e006b 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -37,7 +37,6 @@
> > >  
> > >  #define MSIX_CAP_LENGTH 12
> > >  
> > > -#define TYPE_VFIO_PCI "vfio-pci"
> > >  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> > >  
> > >  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> > > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > > index b1ae4c0..4b7b1bb 100644
> > > --- a/hw/vfio/pci.h
> > > +++ b/hw/vfio/pci.h
> > > @@ -19,6 +19,7 @@
> > >  #include "qemu/event_notifier.h"
> > >  #include "qemu/queue.h"
> > >  #include "qemu/timer.h"
> > > +#include "sysemu/sysemu.h"
> > >  
> > >  #define PCI_ANY_ID (~0)
> > >  
> > > @@ -56,6 +57,21 @@ typedef struct VFIOBAR {
> > >      QLIST_HEAD(, VFIOQuirk) quirks;
> > >  } VFIOBAR;
> > >  
> > > +enum {
> > > +    VFIO_DEVSTATE_REGION_CTL = 0,
> > > +    VFIO_DEVSTATE_REGION_DATA_CONFIG,
> > > +    VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY,
> > > +    VFIO_DEVSTATE_REGION_DATA_BITMAP,
> > > +    VFIO_DEVSTATE_REGION_NUM,
> > > +};
> > > +typedef struct VFIOMigration {
> > > +    VFIORegion region[VFIO_DEVSTATE_REGION_NUM];
> > > +    uint32_t data_caps;
> > > +    uint32_t device_state;
> > > +    uint64_t devconfig_size;
> > > +    VMChangeStateEntry *vm_state;
> > > +} VFIOMigration;
> > > +
> > >  typedef struct VFIOVGARegion {
> > >      MemoryRegion mem;
> > >      off_t offset;
> > > @@ -132,6 +148,8 @@ typedef struct VFIOPCIDevice {
> > >      VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
> > >      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
> > >      void *igd_opregion;
> > > +    VFIOMigration *migration;
> > > +    Error *migration_blocker;
> > >      PCIHostDeviceAddress host;
> > >      EventNotifier err_notifier;
> > >      EventNotifier req_notifier;
> > > @@ -198,5 +216,10 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> > >  void vfio_display_reset(VFIOPCIDevice *vdev);
> > >  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> > >  void vfio_display_finalize(VFIOPCIDevice *vdev);
> > > -
> > > +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev);
> > > +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev);
> > > +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> > > +         uint64_t start_addr, uint64_t page_nr);
> > > +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp);
> > > +void vfio_migration_finalize(VFIOPCIDevice *vdev);
> > >  #endif /* HW_VFIO_VFIO_PCI_H */
> > > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > > index 1b434d0..ed43613 100644
> > > --- a/include/hw/vfio/vfio-common.h
> > > +++ b/include/hw/vfio/vfio-common.h
> > > @@ -32,6 +32,7 @@
> > >  #endif
> > >  
> > >  #define VFIO_MSG_PREFIX "vfio %s: "
> > > +#define TYPE_VFIO_PCI "vfio-pci"
> > >  
> > >  enum {
> > >      VFIO_DEVICE_TYPE_PCI = 0,
> > > -- 
> > > 2.7.4
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > _______________________________________________
> > intel-gvt-dev mailing list
> > intel-gvt-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] vfio/migration: support device of device config capability
@ 2019-02-20 10:57         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 133+ messages in thread
From: Dr. David Alan Gilbert @ 2019-02-20 10:57 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck,
	zhi.a.wang, jonathan.davies

* Zhao Yan (yan.y.zhao@intel.com) wrote:
> On Tue, Feb 19, 2019 at 11:01:45AM +0000, Dr. David Alan Gilbert wrote:
> > * Yan Zhao (yan.y.zhao@intel.com) wrote:

<snip>

> > > +    msi_data = qemu_get_be32(f);
> > > +    vfio_pci_write_config(pdev,
> > > +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > > +            msi_data, 2);
> > > +
> > > +    vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > > +            ctl | PCI_MSI_FLAGS_ENABLE, 2);
> > 
> > It would probably be best to use a VMStateDescription and the macros
> > for this if possible; I bet you'll want to add more fields in the future
> > for example.
> yes, it is also a good idea to use VMStateDescription for pci general
> config data, which are saved one time in stop-and-copy phase.
> But it's a little hard to maintain two type of interfaces.
> Maybe use the way as what VMStateDescription did, i.e. using field list and
> macros?

Maybe, but if you can actually use the VMStateDescription it would give
you the versioning and conditional fields and things like that.

> > Also what happens if the data read from the migration stream is bad or
> > doesn't agree with this devices hardware? How does this fail?
> >
> right, errors in migration stream needs to be checked. I'll add the error
> check.
> For the hardware incompatiable issue, seems vfio_pci_write_config does not
> return error if device driver returns error for this write.

It would be great if we could make that fail properly.

> Seems it needs to be addressed in somewhere else like checking
> compitibility before lauching migration.

That would be good; but we should still guard against it being wrong
if possible - these type of checks for compatibility are probably quite
difficult.

Dave

> > > +}
> > > +
> > > +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> > > +{
> > > +    VFIOPCIDevice *vdev = opaque;
> > > +    int flag;
> > > +    uint64_t len;
> > > +    int ret = 0;
> > > +
> > > +    if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> > > +        return -EINVAL;
> > > +    }
> > > +
> > > +    do {
> > > +        flag = qemu_get_byte(f);
> > > +
> > > +        switch (flag & ~VFIO_SAVE_FLAG_CONTINUE) {
> > > +        case VFIO_SAVE_FLAG_SETUP:
> > > +            break;
> > > +        case VFIO_SAVE_FLAG_PCI:
> > > +            vfio_pci_load_config(vdev, f);
> > > +            break;
> > > +        case VFIO_SAVE_FLAG_DEVCONFIG:
> > > +            len = qemu_get_be64(f);
> > > +            vfio_load_data_device_config(vdev, f, len);
> > > +            break;
> > > +        default:
> > > +            ret = -EINVAL;
> > > +        }
> > > +    } while (flag & VFIO_SAVE_FLAG_CONTINUE);
> > > +
> > > +    return ret;
> > > +}
> > > +
> > > +static void vfio_pci_save_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > > +{
> > > +    PCIDevice *pdev = &vdev->pdev;
> > > +    uint32_t msi_cfg, msi_lo, msi_hi, msi_data, bar_cfg, i;
> > > +    bool msi_64bit;
> > > +
> > > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > +        bar_cfg = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> > > +        qemu_put_be32(f, bar_cfg);
> > > +    }
> > > +
> > > +    msi_cfg = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > > +    msi_64bit = !!(msi_cfg & PCI_MSI_FLAGS_64BIT);
> > > +
> > > +    msi_lo = pci_default_read_config(pdev,
> > > +            pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > > +    qemu_put_be32(f, msi_lo);
> > > +
> > > +    if (msi_64bit) {
> > > +        msi_hi = pci_default_read_config(pdev,
> > > +                pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > > +                4);
> > > +        qemu_put_be32(f, msi_hi);
> > > +    }
> > > +
> > > +    msi_data = pci_default_read_config(pdev,
> > > +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > > +            2);
> > > +    qemu_put_be32(f, msi_data);
> > > +
> > > +}
> > > +
> > > +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > > +{
> > > +    VFIOPCIDevice *vdev = opaque;
> > > +    int rc = 0;
> > > +
> > > +    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> > > +    vfio_pci_save_config(vdev, f);
> > > +
> > > +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVCONFIG);
> > > +    rc += vfio_get_device_config_size(vdev);
> > > +    rc += vfio_save_data_device_config(vdev, f);
> > > +
> > > +    return rc;
> > > +}
> > > +
> > > +static int vfio_save_setup(QEMUFile *f, void *opaque)
> > > +{
> > > +    VFIOPCIDevice *vdev = opaque;
> > > +    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > > +
> > > +    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> > > +            VFIO_DEVICE_STATE_LOGGING);
> > > +    return 0;
> > > +}
> > > +
> > > +static int vfio_load_setup(QEMUFile *f, void *opaque)
> > > +{
> > > +    return 0;
> > > +}
> > > +
> > > +static void vfio_save_cleanup(void *opaque)
> > > +{
> > > +    VFIOPCIDevice *vdev = opaque;
> > > +    uint32_t dev_state = vdev->migration->device_state;
> > > +
> > > +    dev_state &= ~VFIO_DEVICE_STATE_LOGGING;
> > > +
> > > +    vfio_set_device_state(vdev, dev_state);
> > > +}
> > > +
> > > +static SaveVMHandlers savevm_vfio_handlers = {
> > > +    .save_setup = vfio_save_setup,
> > > +    .save_live_pending = vfio_save_live_pending,
> > > +    .save_live_iterate = vfio_save_iterate,
> > > +    .save_live_complete_precopy = vfio_save_complete_precopy,
> > > +    .save_cleanup = vfio_save_cleanup,
> > > +    .load_setup = vfio_load_setup,
> > > +    .load_state = vfio_load_state,
> > > +};
> > > +
> > > +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> > > +{
> > > +    int ret;
> > > +    Error *local_err = NULL;
> > > +    vdev->migration = g_new0(VFIOMigration, 1);
> > > +
> > > +    if (vfio_device_state_region_setup(vdev,
> > > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL],
> > > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL,
> > > +              "device-state-ctl")) {
> > > +        goto error;
> > > +    }
> > > +
> > > +    if (vfio_check_devstate_version(vdev)) {
> > > +        goto error;
> > > +    }
> > > +
> > > +    if (vfio_get_device_data_caps(vdev)) {
> > > +        goto error;
> > > +    }
> > > +
> > > +    if (vfio_device_state_region_setup(vdev,
> > > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG],
> > > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG,
> > > +              "device-state-data-device-config")) {
> > > +        goto error;
> > > +    }
> > > +
> > > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > > +        error_report("No suppport of data cap device memory Yet");
> > > +        goto error;
> > > +    }
> > > +
> > > +    if (vfio_device_data_cap_system_memory(vdev) &&
> > > +            vfio_device_state_region_setup(vdev,
> > > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP],
> > > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP,
> > > +              "device-state-data-dirtybitmap")) {
> > > +        goto error;
> > > +    }
> > > +
> > > +    vdev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> > > +
> > > +    register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
> > > +            VFIO_DEVICE_STATE_INTERFACE_VERSION,
> > > +            &savevm_vfio_handlers,
> > > +            vdev);
> > > +
> > > +    vdev->migration->vm_state =
> > > +        qemu_add_vm_change_state_handler(vfio_vm_change_state_handler, vdev);
> > > +
> > > +    return 0;
> > > +error:
> > > +    error_setg(&vdev->migration_blocker,
> > > +            "VFIO device doesn't support migration");
> > > +    ret = migrate_add_blocker(vdev->migration_blocker, &local_err);
> > > +    if (local_err) {
> > > +        error_propagate(errp, local_err);
> > > +        error_free(vdev->migration_blocker);
> > > +    }
> > > +
> > > +    g_free(vdev->migration);
> > > +    vdev->migration = NULL;
> > > +
> > > +    return ret;
> > > +}
> > > +
> > > +void vfio_migration_finalize(VFIOPCIDevice *vdev)
> > > +{
> > > +    if (vdev->migration) {
> > > +        int i;
> > > +        qemu_del_vm_change_state_handler(vdev->migration->vm_state);
> > > +        unregister_savevm(NULL, TYPE_VFIO_PCI, vdev);
> > > +        for (i = 0; i < VFIO_DEVSTATE_REGION_NUM; i++) {
> > > +            vfio_region_finalize(&vdev->migration->region[i]);
> > > +        }
> > > +        g_free(vdev->migration);
> > > +        vdev->migration = NULL;
> > > +    } else if (vdev->migration_blocker) {
> > > +        migrate_del_blocker(vdev->migration_blocker);
> > > +        error_free(vdev->migration_blocker);
> > > +    }
> > > +}
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index c0cb1ec..b8e006b 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -37,7 +37,6 @@
> > >  
> > >  #define MSIX_CAP_LENGTH 12
> > >  
> > > -#define TYPE_VFIO_PCI "vfio-pci"
> > >  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> > >  
> > >  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> > > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > > index b1ae4c0..4b7b1bb 100644
> > > --- a/hw/vfio/pci.h
> > > +++ b/hw/vfio/pci.h
> > > @@ -19,6 +19,7 @@
> > >  #include "qemu/event_notifier.h"
> > >  #include "qemu/queue.h"
> > >  #include "qemu/timer.h"
> > > +#include "sysemu/sysemu.h"
> > >  
> > >  #define PCI_ANY_ID (~0)
> > >  
> > > @@ -56,6 +57,21 @@ typedef struct VFIOBAR {
> > >      QLIST_HEAD(, VFIOQuirk) quirks;
> > >  } VFIOBAR;
> > >  
> > > +enum {
> > > +    VFIO_DEVSTATE_REGION_CTL = 0,
> > > +    VFIO_DEVSTATE_REGION_DATA_CONFIG,
> > > +    VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY,
> > > +    VFIO_DEVSTATE_REGION_DATA_BITMAP,
> > > +    VFIO_DEVSTATE_REGION_NUM,
> > > +};
> > > +typedef struct VFIOMigration {
> > > +    VFIORegion region[VFIO_DEVSTATE_REGION_NUM];
> > > +    uint32_t data_caps;
> > > +    uint32_t device_state;
> > > +    uint64_t devconfig_size;
> > > +    VMChangeStateEntry *vm_state;
> > > +} VFIOMigration;
> > > +
> > >  typedef struct VFIOVGARegion {
> > >      MemoryRegion mem;
> > >      off_t offset;
> > > @@ -132,6 +148,8 @@ typedef struct VFIOPCIDevice {
> > >      VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
> > >      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
> > >      void *igd_opregion;
> > > +    VFIOMigration *migration;
> > > +    Error *migration_blocker;
> > >      PCIHostDeviceAddress host;
> > >      EventNotifier err_notifier;
> > >      EventNotifier req_notifier;
> > > @@ -198,5 +216,10 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> > >  void vfio_display_reset(VFIOPCIDevice *vdev);
> > >  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> > >  void vfio_display_finalize(VFIOPCIDevice *vdev);
> > > -
> > > +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev);
> > > +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev);
> > > +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> > > +         uint64_t start_addr, uint64_t page_nr);
> > > +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp);
> > > +void vfio_migration_finalize(VFIOPCIDevice *vdev);
> > >  #endif /* HW_VFIO_VFIO_PCI_H */
> > > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > > index 1b434d0..ed43613 100644
> > > --- a/include/hw/vfio/vfio-common.h
> > > +++ b/include/hw/vfio/vfio-common.h
> > > @@ -32,6 +32,7 @@
> > >  #endif
> > >  
> > >  #define VFIO_MSG_PREFIX "vfio %s: "
> > > +#define TYPE_VFIO_PCI "vfio-pci"
> > >  
> > >  enum {
> > >      VFIO_DEVICE_TYPE_PCI = 0,
> > > -- 
> > > 2.7.4
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > _______________________________________________
> > intel-gvt-dev mailing list
> > intel-gvt-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-20  5:28     ` [Qemu-devel] " Zhao Yan
@ 2019-02-20 11:01       ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 133+ messages in thread
From: Dr. David Alan Gilbert @ 2019-02-20 11:01 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, zhi.a.wang, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck, Ken.Xue,
	jonathan.davies

* Zhao Yan (yan.y.zhao@intel.com) wrote:
> On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > This patchset enables VFIO devices to have live migration capability.
> > > Currently it does not support post-copy phase.
> > > 
> > > It follows Alex's comments on last version of VFIO live migration patches,
> > > including device states, VFIO device state region layout, dirty bitmap's
> > > query.
> > 
> > Hi,
> >   I've sent minor comments to later patches; but some minor general
> > comments:
> > 
> >   a) Never trust the incoming migrations stream - it might be corrupt,
> >     so check when you can.
> hi Dave
> Thanks for this suggestion. I'll add more checks for migration streams.
> 
> 
> >   b) How do we detect if we're migrating from/to the wrong device or
> > version of device?  Or say to a device with older firmware or perhaps
> > a device that has less device memory ?
> Actually it's still an open for VFIO migration. Need to think about
> whether it's better to check that in libvirt or qemu (like a device magic
> along with verion ?).
> This patchset is intended to settle down the main device state interfaces
> for VFIO migration. So that we can work on that and improve it.
> 
> 
> >   c) Consider using the trace_ mechanism - it's really useful to
> > add to loops writing/reading data so that you can see when it fails.
> > 
> > Dave
> >
> Got it. many thanks~~
> 
> 
> > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > 'migrtion'
> 
> sorry :)

No problem.

Given the mails, I'm guessing you've mostly tested this on graphics
devices?  Have you also checked with VFIO network cards?

Also see the mail I sent in reply to Kirti's series; we need to boil
these down to one solution.

Dave

> > 
> > > Device Data
> > > -----------
> > > Device data is divided into three types: device memory, device config,
> > > and system memory dirty pages produced by device.
> > > 
> > > Device config: data like MMIOs, page tables...
> > >         Every device is supposed to possess device config data.
> > >     	Usually device config's size is small (no big than 10M), and it
> > >         needs to be loaded in certain strict order.
> > >         Therefore, device config only needs to be saved/loaded in
> > >         stop-and-copy phase.
> > >         The data of device config is held in device config region.
> > >         Size of device config data is smaller than or equal to that of
> > >         device config region.
> > > 
> > > Device Memory: device's internal memory, standalone and outside system
> > >         memory. It is usually very big.
> > >         This kind of data needs to be saved / loaded in pre-copy and
> > >         stop-and-copy phase.
> > >         The data of device memory is held in device memory region.
> > >         Size of devie memory is usually larger than that of device
> > >         memory region. qemu needs to save/load it in chunks of size of
> > >         device memory region.
> > >         Not all device has device memory. Like IGD only uses system memory.
> > > 
> > > System memory dirty pages: If a device produces dirty pages in system
> > >         memory, it is able to get dirty bitmap for certain range of system
> > >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > >         callback, dirty pages in system memory will be save/loaded by ram's
> > >         live migration code.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > >         If system memory range is larger than that dirty bitmap region can
> > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > >         succession.
> > > 
> > > 
> > > Device State Regions
> > > --------------------
> > > Vendor driver is required to expose two mandatory regions and another two
> > > optional regions if it plans to support device state management.
> > > 
> > > So, there are up to four regions in total.
> > > One control region: mandatory.
> > >         Get access via read/write system call.
> > >         Its layout is defined in struct vfio_device_state_ctl
> > > Three data regions: mmaped into qemu.
> > >         device config region: mandatory, holding data of device config
> > >         device memory region: optional, holding data of device memory
> > >         dirty bitmap region: optional, holding bitmap of system memory
> > >                             dirty pages
> > > 
> > > (The reason why four seperate regions are defined is that the unit of mmap
> > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > control and three mmaped regions for data seems better than one big region
> > > padded and sparse mmaped).
> > > 
> > > 
> > > kernel device state interface [1]
> > > --------------------------------------
> > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > 
> > > #define VFIO_DEVICE_STATE_RUNNING 0 
> > > #define VFIO_DEVICE_STATE_STOP 1
> > > #define VFIO_DEVICE_STATE_LOGGING 2
> > > 
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > 
> > > struct vfio_device_state_ctl {
> > > 	__u32 version;		  /* ro */
> > > 	__u32 device_state;       /* VFIO device state, wo */
> > > 	__u32 caps;		 /* ro */
> > >         struct {
> > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;    /*rw*/
> > > 	} device_config;
> > > 	struct {
> > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;     /* rw */  
> > >                 __u64 pos; /*the offset in total buffer of device memory*/
> > > 	} device_memory;
> > > 	struct {
> > > 		__u64 start_addr; /* wo */
> > > 		__u64 page_nr;   /* wo */
> > > 	} system_memory;
> > > };
> > > 
> > > Devcie States
> > > ------------- 
> > > After migration is initialzed, it will set device state via writing to
> > > device_state field of control region.
> > > 
> > > Four states are defined for a VFIO device:
> > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > > 
> > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > >         commands from device driver.
> > >         It is the default state that a VFIO device enters initially.
> > > 
> > > STOP:  In this state, a VFIO device is deactivated to interact with
> > >        device driver.
> > > 
> > > LOGGING: a special state that it CANNOT exist independently. It must be
> > >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > >        STOP & LOGGING).
> > >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > >        driver can start dirty data logging for device memory and system
> > >        memory.
> > >        LOGGING only impacts device/system memory. They return whole
> > >        snapshot outside LOGGING and dirty data since last get operation
> > >        inside LOGGING.
> > >        Device config should be always accessible and return whole config
> > >        snapshot regardless of LOGGING state.
> > >        
> > > Note:
> > > The reason why RUNNING is the default state is that device's active state
> > > must not depend on device state interface.
> > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > In that condition, a device needs be in active state by default. 
> > > 
> > > Get Version & Get Caps
> > > ----------------------
> > > On migration init phase, qemu will probe the existence of device state
> > > regions of vendor driver, then get version of the device state interface
> > > from the r/w control region.
> > > 
> > > Then it will probe VFIO device's data capability by reading caps field of
> > > control region.
> > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > >         device memory in pre-copy and stop-and-copy phase. The data of
> > >         device memory is held in device memory region.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > >         produced by VFIO device during pre-copy and stop-and-copy phase.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > > 
> > > If failing to find two mandatory regions and optional data regions
> > > corresponding to data caps or version mismatching, it will setup a
> > > migration blocker and disable live migration for VFIO device.
> > > 
> > > 
> > > Flows to call device state interface for VFIO live migration
> > > ------------------------------------------------------------
> > > 
> > > Live migration save path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_SAVE_SETUP
> > >  |
> > >  .save_setup callback -->
> > >  get device memory size (whole snapshot size)
> > >  get device memory buffer (whole snapshot data)
> > >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > >  .save_live_pending callback --> get device memory size (dirty data)
> > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > >  .log_sync callback --> get system memory dirty bitmap
> > >  |
> > > (vcpu stops) --> set device state -->
> > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > .save_live_complete_precopy callback -->
> > >  get device memory size (dirty data)
> > >  get device memory buffer (dirty data)
> > >  get device config size (whole snapshot size)
> > >  get device config buffer (whole snapshot data)
> > >  |
> > > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > MIGRATION_STATUS_CANCELLED or
> > > MIGRATION_STATUS_FAILED
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > 
> > > 
> > > Live migration load path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > > .load state callback -->
> > >  set device memory size, set device memory buffer, set device config size,
> > >  set device config buffer
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > 
> > > 
> > > In source VM side,
> > > In precopy phase,
> > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > qemu will first get whole snapshot of device memory in .save_setup
> > > callback, and then it will get total size of dirty data in device memory in
> > > .save_live_pending callback by reading device_memory.size field of control
> > > region.
> > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > dirty data chunk by chunk from device memory region by writing pos &
> > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > control region. (size of each chunk is the size of device memory data
> > > region).
> > > .save_live_pending and .save_live_iteration may be called several times in
> > > precopy phase to get dirty data in device memory.
> > > 
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > vendor driver's device state interface to get data from devcie memory.
> > > 
> > > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > region by writing system memory's start address, page count and action 
> > > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > > "system_memory.action" fields of control region.
> > > If page count passed in .log_sync callback is larger than the bitmap size
> > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > vendor driver's get system memory dirty bitmap interface.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > > returns without call to vendor driver.
> > > 
> > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > in save_live_complete_precopy callback,
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > get device memory size and get device memory buffer will be called again.
> > > After that,
> > > device config data is get from device config region by reading
> > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > device_config.action of control region.
> > > Then after migration completes, in cleanup handler, LOGGING state will be
> > > cleared (i.e. deivce state is set to STOP).
> > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > of "migration failed" and "migration cancelled". They can also leverage
> > > the cleanup handler to unset LOGGING state.
> > > 
> > > 
> > > References
> > > ----------
> > > 1. kernel side implementation of Device state interfaces:
> > > https://patchwork.freedesktop.org/series/56876/
> > > 
> > > 
> > > Yan Zhao (5):
> > >   vfio/migration: define kernel interfaces
> > >   vfio/migration: support device of device config capability
> > >   vfio/migration: tracking of dirty page in system memory
> > >   vfio/migration: turn on migration
> > >   vfio/migration: support device memory capability
> > > 
> > >  hw/vfio/Makefile.objs         |   2 +-
> > >  hw/vfio/common.c              |  26 ++
> > >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> > >  hw/vfio/pci.c                 |  10 +-
> > >  hw/vfio/pci.h                 |  26 +-
> > >  include/hw/vfio/vfio-common.h |   1 +
> > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > >  create mode 100644 hw/vfio/migration.c
> > > 
> > > -- 
> > > 2.7.4
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > _______________________________________________
> > intel-gvt-dev mailing list
> > intel-gvt-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-20 11:01       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 133+ messages in thread
From: Dr. David Alan Gilbert @ 2019-02-20 11:01 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck,
	zhi.a.wang, jonathan.davies

* Zhao Yan (yan.y.zhao@intel.com) wrote:
> On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > This patchset enables VFIO devices to have live migration capability.
> > > Currently it does not support post-copy phase.
> > > 
> > > It follows Alex's comments on last version of VFIO live migration patches,
> > > including device states, VFIO device state region layout, dirty bitmap's
> > > query.
> > 
> > Hi,
> >   I've sent minor comments to later patches; but some minor general
> > comments:
> > 
> >   a) Never trust the incoming migrations stream - it might be corrupt,
> >     so check when you can.
> hi Dave
> Thanks for this suggestion. I'll add more checks for migration streams.
> 
> 
> >   b) How do we detect if we're migrating from/to the wrong device or
> > version of device?  Or say to a device with older firmware or perhaps
> > a device that has less device memory ?
> Actually it's still an open for VFIO migration. Need to think about
> whether it's better to check that in libvirt or qemu (like a device magic
> along with verion ?).
> This patchset is intended to settle down the main device state interfaces
> for VFIO migration. So that we can work on that and improve it.
> 
> 
> >   c) Consider using the trace_ mechanism - it's really useful to
> > add to loops writing/reading data so that you can see when it fails.
> > 
> > Dave
> >
> Got it. many thanks~~
> 
> 
> > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > 'migrtion'
> 
> sorry :)

No problem.

Given the mails, I'm guessing you've mostly tested this on graphics
devices?  Have you also checked with VFIO network cards?

Also see the mail I sent in reply to Kirti's series; we need to boil
these down to one solution.

Dave

> > 
> > > Device Data
> > > -----------
> > > Device data is divided into three types: device memory, device config,
> > > and system memory dirty pages produced by device.
> > > 
> > > Device config: data like MMIOs, page tables...
> > >         Every device is supposed to possess device config data.
> > >     	Usually device config's size is small (no big than 10M), and it
> > >         needs to be loaded in certain strict order.
> > >         Therefore, device config only needs to be saved/loaded in
> > >         stop-and-copy phase.
> > >         The data of device config is held in device config region.
> > >         Size of device config data is smaller than or equal to that of
> > >         device config region.
> > > 
> > > Device Memory: device's internal memory, standalone and outside system
> > >         memory. It is usually very big.
> > >         This kind of data needs to be saved / loaded in pre-copy and
> > >         stop-and-copy phase.
> > >         The data of device memory is held in device memory region.
> > >         Size of devie memory is usually larger than that of device
> > >         memory region. qemu needs to save/load it in chunks of size of
> > >         device memory region.
> > >         Not all device has device memory. Like IGD only uses system memory.
> > > 
> > > System memory dirty pages: If a device produces dirty pages in system
> > >         memory, it is able to get dirty bitmap for certain range of system
> > >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > >         callback, dirty pages in system memory will be save/loaded by ram's
> > >         live migration code.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > >         If system memory range is larger than that dirty bitmap region can
> > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > >         succession.
> > > 
> > > 
> > > Device State Regions
> > > --------------------
> > > Vendor driver is required to expose two mandatory regions and another two
> > > optional regions if it plans to support device state management.
> > > 
> > > So, there are up to four regions in total.
> > > One control region: mandatory.
> > >         Get access via read/write system call.
> > >         Its layout is defined in struct vfio_device_state_ctl
> > > Three data regions: mmaped into qemu.
> > >         device config region: mandatory, holding data of device config
> > >         device memory region: optional, holding data of device memory
> > >         dirty bitmap region: optional, holding bitmap of system memory
> > >                             dirty pages
> > > 
> > > (The reason why four seperate regions are defined is that the unit of mmap
> > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > control and three mmaped regions for data seems better than one big region
> > > padded and sparse mmaped).
> > > 
> > > 
> > > kernel device state interface [1]
> > > --------------------------------------
> > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > 
> > > #define VFIO_DEVICE_STATE_RUNNING 0 
> > > #define VFIO_DEVICE_STATE_STOP 1
> > > #define VFIO_DEVICE_STATE_LOGGING 2
> > > 
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > 
> > > struct vfio_device_state_ctl {
> > > 	__u32 version;		  /* ro */
> > > 	__u32 device_state;       /* VFIO device state, wo */
> > > 	__u32 caps;		 /* ro */
> > >         struct {
> > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;    /*rw*/
> > > 	} device_config;
> > > 	struct {
> > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;     /* rw */  
> > >                 __u64 pos; /*the offset in total buffer of device memory*/
> > > 	} device_memory;
> > > 	struct {
> > > 		__u64 start_addr; /* wo */
> > > 		__u64 page_nr;   /* wo */
> > > 	} system_memory;
> > > };
> > > 
> > > Devcie States
> > > ------------- 
> > > After migration is initialzed, it will set device state via writing to
> > > device_state field of control region.
> > > 
> > > Four states are defined for a VFIO device:
> > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > > 
> > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > >         commands from device driver.
> > >         It is the default state that a VFIO device enters initially.
> > > 
> > > STOP:  In this state, a VFIO device is deactivated to interact with
> > >        device driver.
> > > 
> > > LOGGING: a special state that it CANNOT exist independently. It must be
> > >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > >        STOP & LOGGING).
> > >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > >        driver can start dirty data logging for device memory and system
> > >        memory.
> > >        LOGGING only impacts device/system memory. They return whole
> > >        snapshot outside LOGGING and dirty data since last get operation
> > >        inside LOGGING.
> > >        Device config should be always accessible and return whole config
> > >        snapshot regardless of LOGGING state.
> > >        
> > > Note:
> > > The reason why RUNNING is the default state is that device's active state
> > > must not depend on device state interface.
> > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > In that condition, a device needs be in active state by default. 
> > > 
> > > Get Version & Get Caps
> > > ----------------------
> > > On migration init phase, qemu will probe the existence of device state
> > > regions of vendor driver, then get version of the device state interface
> > > from the r/w control region.
> > > 
> > > Then it will probe VFIO device's data capability by reading caps field of
> > > control region.
> > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > >         device memory in pre-copy and stop-and-copy phase. The data of
> > >         device memory is held in device memory region.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > >         produced by VFIO device during pre-copy and stop-and-copy phase.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > > 
> > > If failing to find two mandatory regions and optional data regions
> > > corresponding to data caps or version mismatching, it will setup a
> > > migration blocker and disable live migration for VFIO device.
> > > 
> > > 
> > > Flows to call device state interface for VFIO live migration
> > > ------------------------------------------------------------
> > > 
> > > Live migration save path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_SAVE_SETUP
> > >  |
> > >  .save_setup callback -->
> > >  get device memory size (whole snapshot size)
> > >  get device memory buffer (whole snapshot data)
> > >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > >  .save_live_pending callback --> get device memory size (dirty data)
> > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > >  .log_sync callback --> get system memory dirty bitmap
> > >  |
> > > (vcpu stops) --> set device state -->
> > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > .save_live_complete_precopy callback -->
> > >  get device memory size (dirty data)
> > >  get device memory buffer (dirty data)
> > >  get device config size (whole snapshot size)
> > >  get device config buffer (whole snapshot data)
> > >  |
> > > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > MIGRATION_STATUS_CANCELLED or
> > > MIGRATION_STATUS_FAILED
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > 
> > > 
> > > Live migration load path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > > .load state callback -->
> > >  set device memory size, set device memory buffer, set device config size,
> > >  set device config buffer
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > 
> > > 
> > > In source VM side,
> > > In precopy phase,
> > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > qemu will first get whole snapshot of device memory in .save_setup
> > > callback, and then it will get total size of dirty data in device memory in
> > > .save_live_pending callback by reading device_memory.size field of control
> > > region.
> > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > dirty data chunk by chunk from device memory region by writing pos &
> > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > control region. (size of each chunk is the size of device memory data
> > > region).
> > > .save_live_pending and .save_live_iteration may be called several times in
> > > precopy phase to get dirty data in device memory.
> > > 
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > vendor driver's device state interface to get data from devcie memory.
> > > 
> > > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > region by writing system memory's start address, page count and action 
> > > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > > "system_memory.action" fields of control region.
> > > If page count passed in .log_sync callback is larger than the bitmap size
> > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > vendor driver's get system memory dirty bitmap interface.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > > returns without call to vendor driver.
> > > 
> > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > in save_live_complete_precopy callback,
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > get device memory size and get device memory buffer will be called again.
> > > After that,
> > > device config data is get from device config region by reading
> > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > device_config.action of control region.
> > > Then after migration completes, in cleanup handler, LOGGING state will be
> > > cleared (i.e. deivce state is set to STOP).
> > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > of "migration failed" and "migration cancelled". They can also leverage
> > > the cleanup handler to unset LOGGING state.
> > > 
> > > 
> > > References
> > > ----------
> > > 1. kernel side implementation of Device state interfaces:
> > > https://patchwork.freedesktop.org/series/56876/
> > > 
> > > 
> > > Yan Zhao (5):
> > >   vfio/migration: define kernel interfaces
> > >   vfio/migration: support device of device config capability
> > >   vfio/migration: tracking of dirty page in system memory
> > >   vfio/migration: turn on migration
> > >   vfio/migration: support device memory capability
> > > 
> > >  hw/vfio/Makefile.objs         |   2 +-
> > >  hw/vfio/common.c              |  26 ++
> > >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> > >  hw/vfio/pci.c                 |  10 +-
> > >  hw/vfio/pci.h                 |  26 +-
> > >  include/hw/vfio/vfio-common.h |   1 +
> > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > >  create mode 100644 hw/vfio/migration.c
> > > 
> > > -- 
> > > 2.7.4
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > _______________________________________________
> > intel-gvt-dev mailing list
> > intel-gvt-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-20 11:01       ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2019-02-20 11:28         ` Gonglei (Arei)
  -1 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-20 11:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, alex.williamson,
	intel-gvt-dev,


> -----Original Message-----
> From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> Sent: Wednesday, February 20, 2019 7:02 PM
> To: Zhao Yan <yan.y.zhao@intel.com>
> Cc: cjia@nvidia.com; kvm@vger.kernel.org; aik@ozlabs.ru;
> Zhengxiao.zx@alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> qemu-devel@nongnu.org; kwankhede@nvidia.com; eauger@redhat.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> mlevitsk@redhat.com; pasic@linux.ibm.com; Gonglei (Arei)
> <arei.gonglei@huawei.com>; felipe@nutanix.com; Ken.Xue@amd.com;
> kevin.tian@intel.com; alex.williamson@redhat.com;
> intel-gvt-dev@lists.freedesktop.org; changpeng.liu@intel.com;
> cohuck@redhat.com; zhi.a.wang@intel.com; jonathan.davies@nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > > This patchset enables VFIO devices to have live migration capability.
> > > > Currently it does not support post-copy phase.
> > > >
> > > > It follows Alex's comments on last version of VFIO live migration patches,
> > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > query.
> > >
> > > Hi,
> > >   I've sent minor comments to later patches; but some minor general
> > > comments:
> > >
> > >   a) Never trust the incoming migrations stream - it might be corrupt,
> > >     so check when you can.
> > hi Dave
> > Thanks for this suggestion. I'll add more checks for migration streams.
> >
> >
> > >   b) How do we detect if we're migrating from/to the wrong device or
> > > version of device?  Or say to a device with older firmware or perhaps
> > > a device that has less device memory ?
> > Actually it's still an open for VFIO migration. Need to think about
> > whether it's better to check that in libvirt or qemu (like a device magic
> > along with verion ?).

We must keep the hardware generation is the same with one POD of public cloud
providers. But we still think about the live migration between from the the lower
generation of hardware migrated to the higher generation.

> > This patchset is intended to settle down the main device state interfaces
> > for VFIO migration. So that we can work on that and improve it.
> >
> >
> > >   c) Consider using the trace_ mechanism - it's really useful to
> > > add to loops writing/reading data so that you can see when it fails.
> > >
> > > Dave
> > >
> > Got it. many thanks~~
> >
> >
> > > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > > 'migrtion'
> >
> > sorry :)
> 
> No problem.
> 
> Given the mails, I'm guessing you've mostly tested this on graphics
> devices?  Have you also checked with VFIO network cards?
> 
> Also see the mail I sent in reply to Kirti's series; we need to boil
> these down to one solution.
> 
> Dave
> 
> > >
> > > > Device Data
> > > > -----------
> > > > Device data is divided into three types: device memory, device config,
> > > > and system memory dirty pages produced by device.
> > > >
> > > > Device config: data like MMIOs, page tables...
> > > >         Every device is supposed to possess device config data.
> > > >     	Usually device config's size is small (no big than 10M), and it
> > > >         needs to be loaded in certain strict order.
> > > >         Therefore, device config only needs to be saved/loaded in
> > > >         stop-and-copy phase.
> > > >         The data of device config is held in device config region.
> > > >         Size of device config data is smaller than or equal to that of
> > > >         device config region.
> > > >
> > > > Device Memory: device's internal memory, standalone and outside
> system
> > > >         memory. It is usually very big.
> > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > >         stop-and-copy phase.
> > > >         The data of device memory is held in device memory region.
> > > >         Size of devie memory is usually larger than that of device
> > > >         memory region. qemu needs to save/load it in chunks of size of
> > > >         device memory region.
> > > >         Not all device has device memory. Like IGD only uses system
> memory.
> > > >
> > > > System memory dirty pages: If a device produces dirty pages in system
> > > >         memory, it is able to get dirty bitmap for certain range of
> system
> > > >         memory. This dirty bitmap is queried in pre-copy and
> stop-and-copy
> > > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > > >         callback, dirty pages in system memory will be save/loaded by
> ram's
> > > >         live migration code.
> > > >         The dirty bitmap of system memory is held in dirty bitmap
> region.
> > > >         If system memory range is larger than that dirty bitmap region
> can
> > > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > > >         succession.
> > > >
> > > >
> > > > Device State Regions
> > > > --------------------
> > > > Vendor driver is required to expose two mandatory regions and another
> two
> > > > optional regions if it plans to support device state management.
> > > >
> > > > So, there are up to four regions in total.
> > > > One control region: mandatory.
> > > >         Get access via read/write system call.
> > > >         Its layout is defined in struct vfio_device_state_ctl
> > > > Three data regions: mmaped into qemu.
> > > >         device config region: mandatory, holding data of device config
> > > >         device memory region: optional, holding data of device memory
> > > >         dirty bitmap region: optional, holding bitmap of system
> memory
> > > >                             dirty pages
> > > >
> > > > (The reason why four seperate regions are defined is that the unit of
> mmap
> > > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > > control and three mmaped regions for data seems better than one big
> region
> > > > padded and sparse mmaped).
> > > >
> > > >
> > > > kernel device state interface [1]
> > > > --------------------------------------
> > > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > >
> > > > #define VFIO_DEVICE_STATE_RUNNING 0
> > > > #define VFIO_DEVICE_STATE_STOP 1
> > > > #define VFIO_DEVICE_STATE_LOGGING 2
> > > >
> > > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > >
> > > > struct vfio_device_state_ctl {
> > > > 	__u32 version;		  /* ro */
> > > > 	__u32 device_state;       /* VFIO device state, wo */
> > > > 	__u32 caps;		 /* ro */
> > > >         struct {
> > > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> > > > 		__u64 size;    /*rw*/
> > > > 	} device_config;
> > > > 	struct {
> > > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> > > > 		__u64 size;     /* rw */
> > > >                 __u64 pos; /*the offset in total buffer of device
> memory*/
> > > > 	} device_memory;
> > > > 	struct {
> > > > 		__u64 start_addr; /* wo */
> > > > 		__u64 page_nr;   /* wo */
> > > > 	} system_memory;
> > > > };
> > > >
> > > > Devcie States
> > > > -------------
> > > > After migration is initialzed, it will set device state via writing to
> > > > device_state field of control region.
> > > >
> > > > Four states are defined for a VFIO device:
> > > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP
> > > >
> > > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > > >         commands from device driver.
> > > >         It is the default state that a VFIO device enters initially.
> > > >
> > > > STOP:  In this state, a VFIO device is deactivated to interact with
> > > >        device driver.
> > > >
> > > > LOGGING: a special state that it CANNOT exist independently. It must be
> > > >        set alongside with state RUNNING or STOP (i.e. RUNNING &
> LOGGING,
> > > >        STOP & LOGGING).
> > > >        Qemu will set LOGGING state on in .save_setup callbacks, then
> vendor
> > > >        driver can start dirty data logging for device memory and system
> > > >        memory.
> > > >        LOGGING only impacts device/system memory. They return
> whole
> > > >        snapshot outside LOGGING and dirty data since last get
> operation
> > > >        inside LOGGING.
> > > >        Device config should be always accessible and return whole
> config
> > > >        snapshot regardless of LOGGING state.
> > > >
> > > > Note:
> > > > The reason why RUNNING is the default state is that device's active state
> > > > must not depend on device state interface.
> > > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > > In that condition, a device needs be in active state by default.
> > > >
> > > > Get Version & Get Caps
> > > > ----------------------
> > > > On migration init phase, qemu will probe the existence of device state
> > > > regions of vendor driver, then get version of the device state interface
> > > > from the r/w control region.
> > > >
> > > > Then it will probe VFIO device's data capability by reading caps field of
> > > > control region.
> > > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load
> data of
> > > >         device memory in pre-copy and stop-and-copy phase. The data
> of
> > > >         device memory is held in device memory region.
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty
> pages
> > > >         produced by VFIO device during pre-copy and stop-and-copy
> phase.
> > > >         The dirty bitmap of system memory is held in dirty bitmap
> region.
> > > >
> > > > If failing to find two mandatory regions and optional data regions
> > > > corresponding to data caps or version mismatching, it will setup a
> > > > migration blocker and disable live migration for VFIO device.
> > > >
> > > >
> > > > Flows to call device state interface for VFIO live migration
> > > > ------------------------------------------------------------
> > > >
> > > > Live migration save path:
> > > >
> > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE
> STATE)
> > > >
> > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > MIGRATION_STATUS_SAVE_SETUP
> > > >  |
> > > >  .save_setup callback -->
> > > >  get device memory size (whole snapshot size)
> > > >  get device memory buffer (whole snapshot data)
> > > >  set device state --> VFIO_DEVICE_STATE_RUNNING &
> VFIO_DEVICE_STATE_LOGGING
> > > >  |
> > > > MIGRATION_STATUS_ACTIVE
> > > >  |
> > > >  .save_live_pending callback --> get device memory size (dirty data)
> > > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > > >  .log_sync callback --> get system memory dirty bitmap
> > > >  |
> > > > (vcpu stops) --> set device state -->
> > > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > > >  |
> > > > .save_live_complete_precopy callback -->
> > > >  get device memory size (dirty data)
> > > >  get device memory buffer (dirty data)
> > > >  get device config size (whole snapshot size)
> > > >  get device config buffer (whole snapshot data)
> > > >  |
> > > > .save_cleanup callback -->  set device state -->
> VFIO_DEVICE_STATE_STOP
> > > > MIGRATION_STATUS_COMPLETED
> > > >
> > > > MIGRATION_STATUS_CANCELLED or
> > > > MIGRATION_STATUS_FAILED
> > > >  |
> > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > >
> > > >
> > > > Live migration load path:
> > > >
> > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE
> STATE)
> > > >
> > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > > >  |
> > > > MIGRATION_STATUS_ACTIVE
> > > >  |
> > > > .load state callback -->
> > > >  set device memory size, set device memory buffer, set device config
> size,
> > > >  set device config buffer
> > > >  |
> > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > MIGRATION_STATUS_COMPLETED
> > > >
> > > >
> > > >
> > > > In source VM side,
> > > > In precopy phase,
> > > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > > qemu will first get whole snapshot of device memory in .save_setup
> > > > callback, and then it will get total size of dirty data in device memory in
> > > > .save_live_pending callback by reading device_memory.size field of
> control
> > > > region.
> > > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > > dirty data chunk by chunk from device memory region by writing pos &
> > > > action (GET_BUFFER) to device_memory.pos & device_memory.action
> fields of
> > > > control region. (size of each chunk is the size of device memory data
> > > > region).
> > > > .save_live_pending and .save_live_iteration may be called several times in
> > > > precopy phase to get dirty data in device memory.
> > > >
> > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy
> phase
> > > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > > vendor driver's device state interface to get data from devcie memory.
> > > >
> > > > In precopy phase, if a device has
> VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > > region by writing system memory's start address, page count and action
> > > > (GET_BITMAP) to "system_memory.start_addr",
> "system_memory.page_nr", and
> > > > "system_memory.action" fields of control region.
> > > > If page count passed in .log_sync callback is larger than the bitmap size
> > > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > > vendor driver's get system memory dirty bitmap interface.
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback
> just
> > > > returns without call to vendor driver.
> > > >
> > > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > > in save_live_complete_precopy callback,
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > > get device memory size and get device memory buffer will be called again.
> > > > After that,
> > > > device config data is get from device config region by reading
> > > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > > device_config.action of control region.
> > > > Then after migration completes, in cleanup handler, LOGGING state will
> be
> > > > cleared (i.e. deivce state is set to STOP).
> > > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > > of "migration failed" and "migration cancelled". They can also leverage
> > > > the cleanup handler to unset LOGGING state.
> > > >
> > > >
> > > > References
> > > > ----------
> > > > 1. kernel side implementation of Device state interfaces:
> > > > https://patchwork.freedesktop.org/series/56876/
> > > >
> > > >
> > > > Yan Zhao (5):
> > > >   vfio/migration: define kernel interfaces
> > > >   vfio/migration: support device of device config capability
> > > >   vfio/migration: tracking of dirty page in system memory
> > > >   vfio/migration: turn on migration
> > > >   vfio/migration: support device memory capability
> > > >
> > > >  hw/vfio/Makefile.objs         |   2 +-
> > > >  hw/vfio/common.c              |  26 ++
> > > >  hw/vfio/migration.c           | 858
> ++++++++++++++++++++++++++++++++++++++++++
> > > >  hw/vfio/pci.c                 |  10 +-
> > > >  hw/vfio/pci.h                 |  26 +-
> > > >  include/hw/vfio/vfio-common.h |   1 +
> > > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > > >  create mode 100644 hw/vfio/migration.c
> > > >
> > > > --
> > > > 2.7.4
> > > >
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > _______________________________________________
> > > intel-gvt-dev mailing list
> > > intel-gvt-dev@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-20 11:28         ` Gonglei (Arei)
  0 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-20 11:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, Ken.Xue, kevin.tian, alex.williamson,
	intel-gvt-dev, changpeng.liu, cohuck, zhi.a.wang,
	jonathan.davies


> -----Original Message-----
> From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> Sent: Wednesday, February 20, 2019 7:02 PM
> To: Zhao Yan <yan.y.zhao@intel.com>
> Cc: cjia@nvidia.com; kvm@vger.kernel.org; aik@ozlabs.ru;
> Zhengxiao.zx@alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> qemu-devel@nongnu.org; kwankhede@nvidia.com; eauger@redhat.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> mlevitsk@redhat.com; pasic@linux.ibm.com; Gonglei (Arei)
> <arei.gonglei@huawei.com>; felipe@nutanix.com; Ken.Xue@amd.com;
> kevin.tian@intel.com; alex.williamson@redhat.com;
> intel-gvt-dev@lists.freedesktop.org; changpeng.liu@intel.com;
> cohuck@redhat.com; zhi.a.wang@intel.com; jonathan.davies@nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > > This patchset enables VFIO devices to have live migration capability.
> > > > Currently it does not support post-copy phase.
> > > >
> > > > It follows Alex's comments on last version of VFIO live migration patches,
> > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > query.
> > >
> > > Hi,
> > >   I've sent minor comments to later patches; but some minor general
> > > comments:
> > >
> > >   a) Never trust the incoming migrations stream - it might be corrupt,
> > >     so check when you can.
> > hi Dave
> > Thanks for this suggestion. I'll add more checks for migration streams.
> >
> >
> > >   b) How do we detect if we're migrating from/to the wrong device or
> > > version of device?  Or say to a device with older firmware or perhaps
> > > a device that has less device memory ?
> > Actually it's still an open for VFIO migration. Need to think about
> > whether it's better to check that in libvirt or qemu (like a device magic
> > along with verion ?).

We must keep the hardware generation is the same with one POD of public cloud
providers. But we still think about the live migration between from the the lower
generation of hardware migrated to the higher generation.

> > This patchset is intended to settle down the main device state interfaces
> > for VFIO migration. So that we can work on that and improve it.
> >
> >
> > >   c) Consider using the trace_ mechanism - it's really useful to
> > > add to loops writing/reading data so that you can see when it fails.
> > >
> > > Dave
> > >
> > Got it. many thanks~~
> >
> >
> > > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > > 'migrtion'
> >
> > sorry :)
> 
> No problem.
> 
> Given the mails, I'm guessing you've mostly tested this on graphics
> devices?  Have you also checked with VFIO network cards?
> 
> Also see the mail I sent in reply to Kirti's series; we need to boil
> these down to one solution.
> 
> Dave
> 
> > >
> > > > Device Data
> > > > -----------
> > > > Device data is divided into three types: device memory, device config,
> > > > and system memory dirty pages produced by device.
> > > >
> > > > Device config: data like MMIOs, page tables...
> > > >         Every device is supposed to possess device config data.
> > > >     	Usually device config's size is small (no big than 10M), and it
> > > >         needs to be loaded in certain strict order.
> > > >         Therefore, device config only needs to be saved/loaded in
> > > >         stop-and-copy phase.
> > > >         The data of device config is held in device config region.
> > > >         Size of device config data is smaller than or equal to that of
> > > >         device config region.
> > > >
> > > > Device Memory: device's internal memory, standalone and outside
> system
> > > >         memory. It is usually very big.
> > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > >         stop-and-copy phase.
> > > >         The data of device memory is held in device memory region.
> > > >         Size of devie memory is usually larger than that of device
> > > >         memory region. qemu needs to save/load it in chunks of size of
> > > >         device memory region.
> > > >         Not all device has device memory. Like IGD only uses system
> memory.
> > > >
> > > > System memory dirty pages: If a device produces dirty pages in system
> > > >         memory, it is able to get dirty bitmap for certain range of
> system
> > > >         memory. This dirty bitmap is queried in pre-copy and
> stop-and-copy
> > > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > > >         callback, dirty pages in system memory will be save/loaded by
> ram's
> > > >         live migration code.
> > > >         The dirty bitmap of system memory is held in dirty bitmap
> region.
> > > >         If system memory range is larger than that dirty bitmap region
> can
> > > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > > >         succession.
> > > >
> > > >
> > > > Device State Regions
> > > > --------------------
> > > > Vendor driver is required to expose two mandatory regions and another
> two
> > > > optional regions if it plans to support device state management.
> > > >
> > > > So, there are up to four regions in total.
> > > > One control region: mandatory.
> > > >         Get access via read/write system call.
> > > >         Its layout is defined in struct vfio_device_state_ctl
> > > > Three data regions: mmaped into qemu.
> > > >         device config region: mandatory, holding data of device config
> > > >         device memory region: optional, holding data of device memory
> > > >         dirty bitmap region: optional, holding bitmap of system
> memory
> > > >                             dirty pages
> > > >
> > > > (The reason why four seperate regions are defined is that the unit of
> mmap
> > > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > > control and three mmaped regions for data seems better than one big
> region
> > > > padded and sparse mmaped).
> > > >
> > > >
> > > > kernel device state interface [1]
> > > > --------------------------------------
> > > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > >
> > > > #define VFIO_DEVICE_STATE_RUNNING 0
> > > > #define VFIO_DEVICE_STATE_STOP 1
> > > > #define VFIO_DEVICE_STATE_LOGGING 2
> > > >
> > > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > >
> > > > struct vfio_device_state_ctl {
> > > > 	__u32 version;		  /* ro */
> > > > 	__u32 device_state;       /* VFIO device state, wo */
> > > > 	__u32 caps;		 /* ro */
> > > >         struct {
> > > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> > > > 		__u64 size;    /*rw*/
> > > > 	} device_config;
> > > > 	struct {
> > > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> > > > 		__u64 size;     /* rw */
> > > >                 __u64 pos; /*the offset in total buffer of device
> memory*/
> > > > 	} device_memory;
> > > > 	struct {
> > > > 		__u64 start_addr; /* wo */
> > > > 		__u64 page_nr;   /* wo */
> > > > 	} system_memory;
> > > > };
> > > >
> > > > Devcie States
> > > > -------------
> > > > After migration is initialzed, it will set device state via writing to
> > > > device_state field of control region.
> > > >
> > > > Four states are defined for a VFIO device:
> > > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP
> > > >
> > > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > > >         commands from device driver.
> > > >         It is the default state that a VFIO device enters initially.
> > > >
> > > > STOP:  In this state, a VFIO device is deactivated to interact with
> > > >        device driver.
> > > >
> > > > LOGGING: a special state that it CANNOT exist independently. It must be
> > > >        set alongside with state RUNNING or STOP (i.e. RUNNING &
> LOGGING,
> > > >        STOP & LOGGING).
> > > >        Qemu will set LOGGING state on in .save_setup callbacks, then
> vendor
> > > >        driver can start dirty data logging for device memory and system
> > > >        memory.
> > > >        LOGGING only impacts device/system memory. They return
> whole
> > > >        snapshot outside LOGGING and dirty data since last get
> operation
> > > >        inside LOGGING.
> > > >        Device config should be always accessible and return whole
> config
> > > >        snapshot regardless of LOGGING state.
> > > >
> > > > Note:
> > > > The reason why RUNNING is the default state is that device's active state
> > > > must not depend on device state interface.
> > > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > > In that condition, a device needs be in active state by default.
> > > >
> > > > Get Version & Get Caps
> > > > ----------------------
> > > > On migration init phase, qemu will probe the existence of device state
> > > > regions of vendor driver, then get version of the device state interface
> > > > from the r/w control region.
> > > >
> > > > Then it will probe VFIO device's data capability by reading caps field of
> > > > control region.
> > > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load
> data of
> > > >         device memory in pre-copy and stop-and-copy phase. The data
> of
> > > >         device memory is held in device memory region.
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty
> pages
> > > >         produced by VFIO device during pre-copy and stop-and-copy
> phase.
> > > >         The dirty bitmap of system memory is held in dirty bitmap
> region.
> > > >
> > > > If failing to find two mandatory regions and optional data regions
> > > > corresponding to data caps or version mismatching, it will setup a
> > > > migration blocker and disable live migration for VFIO device.
> > > >
> > > >
> > > > Flows to call device state interface for VFIO live migration
> > > > ------------------------------------------------------------
> > > >
> > > > Live migration save path:
> > > >
> > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE
> STATE)
> > > >
> > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > MIGRATION_STATUS_SAVE_SETUP
> > > >  |
> > > >  .save_setup callback -->
> > > >  get device memory size (whole snapshot size)
> > > >  get device memory buffer (whole snapshot data)
> > > >  set device state --> VFIO_DEVICE_STATE_RUNNING &
> VFIO_DEVICE_STATE_LOGGING
> > > >  |
> > > > MIGRATION_STATUS_ACTIVE
> > > >  |
> > > >  .save_live_pending callback --> get device memory size (dirty data)
> > > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > > >  .log_sync callback --> get system memory dirty bitmap
> > > >  |
> > > > (vcpu stops) --> set device state -->
> > > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > > >  |
> > > > .save_live_complete_precopy callback -->
> > > >  get device memory size (dirty data)
> > > >  get device memory buffer (dirty data)
> > > >  get device config size (whole snapshot size)
> > > >  get device config buffer (whole snapshot data)
> > > >  |
> > > > .save_cleanup callback -->  set device state -->
> VFIO_DEVICE_STATE_STOP
> > > > MIGRATION_STATUS_COMPLETED
> > > >
> > > > MIGRATION_STATUS_CANCELLED or
> > > > MIGRATION_STATUS_FAILED
> > > >  |
> > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > >
> > > >
> > > > Live migration load path:
> > > >
> > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE
> STATE)
> > > >
> > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > > >  |
> > > > MIGRATION_STATUS_ACTIVE
> > > >  |
> > > > .load state callback -->
> > > >  set device memory size, set device memory buffer, set device config
> size,
> > > >  set device config buffer
> > > >  |
> > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > MIGRATION_STATUS_COMPLETED
> > > >
> > > >
> > > >
> > > > In source VM side,
> > > > In precopy phase,
> > > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > > qemu will first get whole snapshot of device memory in .save_setup
> > > > callback, and then it will get total size of dirty data in device memory in
> > > > .save_live_pending callback by reading device_memory.size field of
> control
> > > > region.
> > > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > > dirty data chunk by chunk from device memory region by writing pos &
> > > > action (GET_BUFFER) to device_memory.pos & device_memory.action
> fields of
> > > > control region. (size of each chunk is the size of device memory data
> > > > region).
> > > > .save_live_pending and .save_live_iteration may be called several times in
> > > > precopy phase to get dirty data in device memory.
> > > >
> > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy
> phase
> > > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > > vendor driver's device state interface to get data from devcie memory.
> > > >
> > > > In precopy phase, if a device has
> VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > > region by writing system memory's start address, page count and action
> > > > (GET_BITMAP) to "system_memory.start_addr",
> "system_memory.page_nr", and
> > > > "system_memory.action" fields of control region.
> > > > If page count passed in .log_sync callback is larger than the bitmap size
> > > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > > vendor driver's get system memory dirty bitmap interface.
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback
> just
> > > > returns without call to vendor driver.
> > > >
> > > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > > in save_live_complete_precopy callback,
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > > get device memory size and get device memory buffer will be called again.
> > > > After that,
> > > > device config data is get from device config region by reading
> > > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > > device_config.action of control region.
> > > > Then after migration completes, in cleanup handler, LOGGING state will
> be
> > > > cleared (i.e. deivce state is set to STOP).
> > > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > > of "migration failed" and "migration cancelled". They can also leverage
> > > > the cleanup handler to unset LOGGING state.
> > > >
> > > >
> > > > References
> > > > ----------
> > > > 1. kernel side implementation of Device state interfaces:
> > > > https://patchwork.freedesktop.org/series/56876/
> > > >
> > > >
> > > > Yan Zhao (5):
> > > >   vfio/migration: define kernel interfaces
> > > >   vfio/migration: support device of device config capability
> > > >   vfio/migration: tracking of dirty page in system memory
> > > >   vfio/migration: turn on migration
> > > >   vfio/migration: support device memory capability
> > > >
> > > >  hw/vfio/Makefile.objs         |   2 +-
> > > >  hw/vfio/common.c              |  26 ++
> > > >  hw/vfio/migration.c           | 858
> ++++++++++++++++++++++++++++++++++++++++++
> > > >  hw/vfio/pci.c                 |  10 +-
> > > >  hw/vfio/pci.h                 |  26 +-
> > > >  include/hw/vfio/vfio-common.h |   1 +
> > > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > > >  create mode 100644 hw/vfio/migration.c
> > > >
> > > > --
> > > > 2.7.4
> > > >
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > _______________________________________________
> > > intel-gvt-dev mailing list
> > > intel-gvt-dev@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-20 11:28         ` [Qemu-devel] " Gonglei (Arei)
@ 2019-02-20 11:42           ` Cornelia Huck
  -1 siblings, 0 replies; 133+ messages in thread
From: Cornelia Huck @ 2019-02-20 11:42 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, Zhao Yan,
	Dr. David Alan Gilbert, alex.williamson

On Wed, 20 Feb 2019 11:28:46 +0000
"Gonglei (Arei)" <arei.gonglei@huawei.com> wrote:

> > -----Original Message-----
> > From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> > Sent: Wednesday, February 20, 2019 7:02 PM
> > To: Zhao Yan <yan.y.zhao@intel.com>
> > Cc: cjia@nvidia.com; kvm@vger.kernel.org; aik@ozlabs.ru;
> > Zhengxiao.zx@alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> > qemu-devel@nongnu.org; kwankhede@nvidia.com; eauger@redhat.com;
> > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > mlevitsk@redhat.com; pasic@linux.ibm.com; Gonglei (Arei)
> > <arei.gonglei@huawei.com>; felipe@nutanix.com; Ken.Xue@amd.com;
> > kevin.tian@intel.com; alex.williamson@redhat.com;
> > intel-gvt-dev@lists.freedesktop.org; changpeng.liu@intel.com;
> > cohuck@redhat.com; zhi.a.wang@intel.com; jonathan.davies@nutanix.com
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > * Zhao Yan (yan.y.zhao@intel.com) wrote:  
> > > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:  
> > > > * Yan Zhao (yan.y.zhao@intel.com) wrote:  
> > > > > This patchset enables VFIO devices to have live migration capability.
> > > > > Currently it does not support post-copy phase.
> > > > >
> > > > > It follows Alex's comments on last version of VFIO live migration patches,
> > > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > > query.  

> > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > version of device?  Or say to a device with older firmware or perhaps
> > > > a device that has less device memory ?  
> > > Actually it's still an open for VFIO migration. Need to think about
> > > whether it's better to check that in libvirt or qemu (like a device magic
> > > along with verion ?).  
> 
> We must keep the hardware generation is the same with one POD of public cloud
> providers. But we still think about the live migration between from the the lower
> generation of hardware migrated to the higher generation.

Agreed, lower->higher is the one direction that might make sense to
support.

But regardless of that, I think we need to make sure that incompatible
devices/versions fail directly instead of failing in a subtle, hard to
debug way. Might be useful to do some initial sanity checks in libvirt
as well.

How easy is it to obtain that information in a form that can be
consumed by higher layers? Can we find out the device type at least?
What about some kind of revision?

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-20 11:42           ` Cornelia Huck
  0 siblings, 0 replies; 133+ messages in thread
From: Cornelia Huck @ 2019-02-20 11:42 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: Dr. David Alan Gilbert, Zhao Yan, cjia, kvm, aik, Zhengxiao.zx,
	shuangtai.tst, qemu-devel, kwankhede, eauger, yi.l.liu, eskultet,
	ziye.yang, mlevitsk, pasic, felipe, Ken.Xue, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, zhi.a.wang,
	jonathan.davies

On Wed, 20 Feb 2019 11:28:46 +0000
"Gonglei (Arei)" <arei.gonglei@huawei.com> wrote:

> > -----Original Message-----
> > From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> > Sent: Wednesday, February 20, 2019 7:02 PM
> > To: Zhao Yan <yan.y.zhao@intel.com>
> > Cc: cjia@nvidia.com; kvm@vger.kernel.org; aik@ozlabs.ru;
> > Zhengxiao.zx@alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> > qemu-devel@nongnu.org; kwankhede@nvidia.com; eauger@redhat.com;
> > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > mlevitsk@redhat.com; pasic@linux.ibm.com; Gonglei (Arei)
> > <arei.gonglei@huawei.com>; felipe@nutanix.com; Ken.Xue@amd.com;
> > kevin.tian@intel.com; alex.williamson@redhat.com;
> > intel-gvt-dev@lists.freedesktop.org; changpeng.liu@intel.com;
> > cohuck@redhat.com; zhi.a.wang@intel.com; jonathan.davies@nutanix.com
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > * Zhao Yan (yan.y.zhao@intel.com) wrote:  
> > > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:  
> > > > * Yan Zhao (yan.y.zhao@intel.com) wrote:  
> > > > > This patchset enables VFIO devices to have live migration capability.
> > > > > Currently it does not support post-copy phase.
> > > > >
> > > > > It follows Alex's comments on last version of VFIO live migration patches,
> > > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > > query.  

> > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > version of device?  Or say to a device with older firmware or perhaps
> > > > a device that has less device memory ?  
> > > Actually it's still an open for VFIO migration. Need to think about
> > > whether it's better to check that in libvirt or qemu (like a device magic
> > > along with verion ?).  
> 
> We must keep the hardware generation is the same with one POD of public cloud
> providers. But we still think about the live migration between from the the lower
> generation of hardware migrated to the higher generation.

Agreed, lower->higher is the one direction that might make sense to
support.

But regardless of that, I think we need to make sure that incompatible
devices/versions fail directly instead of failing in a subtle, hard to
debug way. Might be useful to do some initial sanity checks in libvirt
as well.

How easy is it to obtain that information in a form that can be
consumed by higher layers? Can we find out the device type at least?
What about some kind of revision?

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-19  8:50 ` [Qemu-devel] " Yan Zhao
@ 2019-02-20 11:56   ` Gonglei (Arei)
  -1 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-20 11:56 UTC (permalink / raw)
  To: Yan Zhao, alex.williamson, qemu-devel
  Cc: kevin.tian, Zhengxiao.zx, Ken.Xue, kvm, eskultet, ziye.yang,
	cohuck, shuangtai.tst, dgilbert, mlevitsk, pasic, aik, kwankhede,
	yi.l.liu, eauger, felipe, jonathan.davies, cjia,
	intel-gvt-dev@lists.freedesktop.org

Hi yan,

Thanks for your work.

I have some suggestions or questions:

1) Would you add msix mode support,? if not, pls add a check in vfio_pci_save_config(), likes Nvidia's solution.
2) We should start vfio devices before vcpu resumes, so we can't rely on vm start change handler completely.
3) We'd better support live migration rollback since have many failure scenarios,
 register a migration notifier is a good choice.
4) Four memory region for live migration is too complicated IMHO. 
5) About log sync, why not register log_global_start/stop in vfio_memory_listener?


Regards,
-Gonglei


> -----Original Message-----
> From: Yan Zhao [mailto:yan.y.zhao@intel.com]
> Sent: Tuesday, February 19, 2019 4:51 PM
> To: alex.williamson@redhat.com; qemu-devel@nongnu.org
> Cc: intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com; Gonglei (Arei)
> <arei.gonglei@huawei.com>; kvm@vger.kernel.org; Yan Zhao
> <yan.y.zhao@intel.com>
> Subject: [PATCH 0/5] QEMU VFIO live migration
> 
> This patchset enables VFIO devices to have live migration capability.
> Currently it does not support post-copy phase.
> 
> It follows Alex's comments on last version of VFIO live migration patches,
> including device states, VFIO device state region layout, dirty bitmap's
> query.
> 
> Device Data
> -----------
> Device data is divided into three types: device memory, device config,
> and system memory dirty pages produced by device.
> 
> Device config: data like MMIOs, page tables...
>         Every device is supposed to possess device config data.
>     	Usually device config's size is small (no big than 10M), and it
>         needs to be loaded in certain strict order.
>         Therefore, device config only needs to be saved/loaded in
>         stop-and-copy phase.
>         The data of device config is held in device config region.
>         Size of device config data is smaller than or equal to that of
>         device config region.
> 
> Device Memory: device's internal memory, standalone and outside system
>         memory. It is usually very big.
>         This kind of data needs to be saved / loaded in pre-copy and
>         stop-and-copy phase.
>         The data of device memory is held in device memory region.
>         Size of devie memory is usually larger than that of device
>         memory region. qemu needs to save/load it in chunks of size of
>         device memory region.
>         Not all device has device memory. Like IGD only uses system memory.
> 
> System memory dirty pages: If a device produces dirty pages in system
>         memory, it is able to get dirty bitmap for certain range of system
>         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
>         phase in .log_sync callback. By setting dirty bitmap in .log_sync
>         callback, dirty pages in system memory will be save/loaded by ram's
>         live migration code.
>         The dirty bitmap of system memory is held in dirty bitmap region.
>         If system memory range is larger than that dirty bitmap region can
>         hold, qemu will cut it into several chunks and get dirty bitmap in
>         succession.
> 
> 
> Device State Regions
> --------------------
> Vendor driver is required to expose two mandatory regions and another two
> optional regions if it plans to support device state management.
> 
> So, there are up to four regions in total.
> One control region: mandatory.
>         Get access via read/write system call.
>         Its layout is defined in struct vfio_device_state_ctl
> Three data regions: mmaped into qemu.
>         device config region: mandatory, holding data of device config
>         device memory region: optional, holding data of device memory
>         dirty bitmap region: optional, holding bitmap of system memory
>                             dirty pages
> 
> (The reason why four seperate regions are defined is that the unit of mmap
> system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> control and three mmaped regions for data seems better than one big region
> padded and sparse mmaped).
> 
> 
> kernel device state interface [1]
> --------------------------------------
> #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> 
> #define VFIO_DEVICE_STATE_RUNNING 0
> #define VFIO_DEVICE_STATE_STOP 1
> #define VFIO_DEVICE_STATE_LOGGING 2
> 
> #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> 
> struct vfio_device_state_ctl {
> 	__u32 version;		  /* ro */
> 	__u32 device_state;       /* VFIO device state, wo */
> 	__u32 caps;		 /* ro */
>         struct {
> 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> 		__u64 size;    /*rw*/
> 	} device_config;
> 	struct {
> 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> 		__u64 size;     /* rw */
>                 __u64 pos; /*the offset in total buffer of device memory*/
> 	} device_memory;
> 	struct {
> 		__u64 start_addr; /* wo */
> 		__u64 page_nr;   /* wo */
> 	} system_memory;
> };
> 
> Devcie States
> -------------
> After migration is initialzed, it will set device state via writing to
> device_state field of control region.
> 
> Four states are defined for a VFIO device:
>         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP
> 
> RUNNING: In this state, a VFIO device is in active state ready to receive
>         commands from device driver.
>         It is the default state that a VFIO device enters initially.
> 
> STOP:  In this state, a VFIO device is deactivated to interact with
>        device driver.
> 
> LOGGING: a special state that it CANNOT exist independently. It must be
>        set alongside with state RUNNING or STOP (i.e. RUNNING &
> LOGGING,
>        STOP & LOGGING).
>        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
>        driver can start dirty data logging for device memory and system
>        memory.
>        LOGGING only impacts device/system memory. They return whole
>        snapshot outside LOGGING and dirty data since last get operation
>        inside LOGGING.
>        Device config should be always accessible and return whole config
>        snapshot regardless of LOGGING state.
> 
> Note:
> The reason why RUNNING is the default state is that device's active state
> must not depend on device state interface.
> It is possible that region vfio_device_state_ctl fails to get registered.
> In that condition, a device needs be in active state by default.
> 
> Get Version & Get Caps
> ----------------------
> On migration init phase, qemu will probe the existence of device state
> regions of vendor driver, then get version of the device state interface
> from the r/w control region.
> 
> Then it will probe VFIO device's data capability by reading caps field of
> control region.
>         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
>         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
>         device memory in pre-copy and stop-and-copy phase. The data of
>         device memory is held in device memory region.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty
> pages
>         produced by VFIO device during pre-copy and stop-and-copy phase.
>         The dirty bitmap of system memory is held in dirty bitmap region.
> 
> If failing to find two mandatory regions and optional data regions
> corresponding to data caps or version mismatching, it will setup a
> migration blocker and disable live migration for VFIO device.
> 
> 
> Flows to call device state interface for VFIO live migration
> ------------------------------------------------------------
> 
> Live migration save path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_SAVE_SETUP
>  |
>  .save_setup callback -->
>  get device memory size (whole snapshot size)
>  get device memory buffer (whole snapshot data)
>  set device state --> VFIO_DEVICE_STATE_RUNNING &
> VFIO_DEVICE_STATE_LOGGING
>  |
> MIGRATION_STATUS_ACTIVE
>  |
>  .save_live_pending callback --> get device memory size (dirty data)
>  .save_live_iteration callback --> get device memory buffer (dirty data)
>  .log_sync callback --> get system memory dirty bitmap
>  |
> (vcpu stops) --> set device state -->
>  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
>  |
> .save_live_complete_precopy callback -->
>  get device memory size (dirty data)
>  get device memory buffer (dirty data)
>  get device config size (whole snapshot size)
>  get device config buffer (whole snapshot data)
>  |
> .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> MIGRATION_STATUS_COMPLETED
> 
> MIGRATION_STATUS_CANCELLED or
> MIGRATION_STATUS_FAILED
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> 
> 
> Live migration load path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
>  |
> MIGRATION_STATUS_ACTIVE
>  |
> .load state callback -->
>  set device memory size, set device memory buffer, set device config size,
>  set device config buffer
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_COMPLETED
> 
> 
> 
> In source VM side,
> In precopy phase,
> if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> qemu will first get whole snapshot of device memory in .save_setup
> callback, and then it will get total size of dirty data in device memory in
> .save_live_pending callback by reading device_memory.size field of control
> region.
> Then in .save_live_iteration callback, it will get buffer of device memory's
> dirty data chunk by chunk from device memory region by writing pos &
> action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> control region. (size of each chunk is the size of device memory data
> region).
> .save_live_pending and .save_live_iteration may be called several times in
> precopy phase to get dirty data in device memory.
> 
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy
> phase
> like .save_setup, .save_live_pending, .save_live_iteration will not call
> vendor driver's device state interface to get data from devcie memory.
> 
> In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY
> on,
> .log_sync callback will get system memory dirty bitmap from dirty bitmap
> region by writing system memory's start address, page count and action
> (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr",
> and
> "system_memory.action" fields of control region.
> If page count passed in .log_sync callback is larger than the bitmap size
> the dirty bitmap region supports, Qemu will cut it into chunks and call
> vendor driver's get system memory dirty bitmap interface.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> returns without call to vendor driver.
> 
> In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> in save_live_complete_precopy callback,
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> get device memory size and get device memory buffer will be called again.
> After that,
> device config data is get from device config region by reading
> devcie_config.size of control region and writing action (GET_BITMAP) to
> device_config.action of control region.
> Then after migration completes, in cleanup handler, LOGGING state will be
> cleared (i.e. deivce state is set to STOP).
> Clearing LOGGING state in cleanup handler is in consideration of the case
> of "migration failed" and "migration cancelled". They can also leverage
> the cleanup handler to unset LOGGING state.
> 
> 
> References
> ----------
> 1. kernel side implementation of Device state interfaces:
> https://patchwork.freedesktop.org/series/56876/
> 
> 
> Yan Zhao (5):
>   vfio/migration: define kernel interfaces
>   vfio/migration: support device of device config capability
>   vfio/migration: tracking of dirty page in system memory
>   vfio/migration: turn on migration
>   vfio/migration: support device memory capability
> 
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/common.c              |  26 ++
>  hw/vfio/migration.c           | 858
> ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |  10 +-
>  hw/vfio/pci.h                 |  26 +-
>  include/hw/vfio/vfio-common.h |   1 +
>  linux-headers/linux/vfio.h    | 260 +++++++++++++
>  7 files changed, 1174 insertions(+), 9 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-20 11:56   ` Gonglei (Arei)
  0 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-20 11:56 UTC (permalink / raw)
  To: Yan Zhao, alex.williamson, qemu-devel
  Cc: intel-gvt-dev, Zhengxiao.zx, yi.l.liu, eskultet, ziye.yang,
	cohuck, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk, pasic,
	aik, eauger, felipe, jonathan.davies, changpeng.liu, Ken.Xue,
	kwankhede, kevin.tian, cjia, kvm

Hi yan,

Thanks for your work.

I have some suggestions or questions:

1) Would you add msix mode support,? if not, pls add a check in vfio_pci_save_config(), likes Nvidia's solution.
2) We should start vfio devices before vcpu resumes, so we can't rely on vm start change handler completely.
3) We'd better support live migration rollback since have many failure scenarios,
 register a migration notifier is a good choice.
4) Four memory region for live migration is too complicated IMHO. 
5) About log sync, why not register log_global_start/stop in vfio_memory_listener?


Regards,
-Gonglei


> -----Original Message-----
> From: Yan Zhao [mailto:yan.y.zhao@intel.com]
> Sent: Tuesday, February 19, 2019 4:51 PM
> To: alex.williamson@redhat.com; qemu-devel@nongnu.org
> Cc: intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com; Gonglei (Arei)
> <arei.gonglei@huawei.com>; kvm@vger.kernel.org; Yan Zhao
> <yan.y.zhao@intel.com>
> Subject: [PATCH 0/5] QEMU VFIO live migration
> 
> This patchset enables VFIO devices to have live migration capability.
> Currently it does not support post-copy phase.
> 
> It follows Alex's comments on last version of VFIO live migration patches,
> including device states, VFIO device state region layout, dirty bitmap's
> query.
> 
> Device Data
> -----------
> Device data is divided into three types: device memory, device config,
> and system memory dirty pages produced by device.
> 
> Device config: data like MMIOs, page tables...
>         Every device is supposed to possess device config data.
>     	Usually device config's size is small (no big than 10M), and it
>         needs to be loaded in certain strict order.
>         Therefore, device config only needs to be saved/loaded in
>         stop-and-copy phase.
>         The data of device config is held in device config region.
>         Size of device config data is smaller than or equal to that of
>         device config region.
> 
> Device Memory: device's internal memory, standalone and outside system
>         memory. It is usually very big.
>         This kind of data needs to be saved / loaded in pre-copy and
>         stop-and-copy phase.
>         The data of device memory is held in device memory region.
>         Size of devie memory is usually larger than that of device
>         memory region. qemu needs to save/load it in chunks of size of
>         device memory region.
>         Not all device has device memory. Like IGD only uses system memory.
> 
> System memory dirty pages: If a device produces dirty pages in system
>         memory, it is able to get dirty bitmap for certain range of system
>         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
>         phase in .log_sync callback. By setting dirty bitmap in .log_sync
>         callback, dirty pages in system memory will be save/loaded by ram's
>         live migration code.
>         The dirty bitmap of system memory is held in dirty bitmap region.
>         If system memory range is larger than that dirty bitmap region can
>         hold, qemu will cut it into several chunks and get dirty bitmap in
>         succession.
> 
> 
> Device State Regions
> --------------------
> Vendor driver is required to expose two mandatory regions and another two
> optional regions if it plans to support device state management.
> 
> So, there are up to four regions in total.
> One control region: mandatory.
>         Get access via read/write system call.
>         Its layout is defined in struct vfio_device_state_ctl
> Three data regions: mmaped into qemu.
>         device config region: mandatory, holding data of device config
>         device memory region: optional, holding data of device memory
>         dirty bitmap region: optional, holding bitmap of system memory
>                             dirty pages
> 
> (The reason why four seperate regions are defined is that the unit of mmap
> system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> control and three mmaped regions for data seems better than one big region
> padded and sparse mmaped).
> 
> 
> kernel device state interface [1]
> --------------------------------------
> #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> 
> #define VFIO_DEVICE_STATE_RUNNING 0
> #define VFIO_DEVICE_STATE_STOP 1
> #define VFIO_DEVICE_STATE_LOGGING 2
> 
> #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> 
> struct vfio_device_state_ctl {
> 	__u32 version;		  /* ro */
> 	__u32 device_state;       /* VFIO device state, wo */
> 	__u32 caps;		 /* ro */
>         struct {
> 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> 		__u64 size;    /*rw*/
> 	} device_config;
> 	struct {
> 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> 		__u64 size;     /* rw */
>                 __u64 pos; /*the offset in total buffer of device memory*/
> 	} device_memory;
> 	struct {
> 		__u64 start_addr; /* wo */
> 		__u64 page_nr;   /* wo */
> 	} system_memory;
> };
> 
> Devcie States
> -------------
> After migration is initialzed, it will set device state via writing to
> device_state field of control region.
> 
> Four states are defined for a VFIO device:
>         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP
> 
> RUNNING: In this state, a VFIO device is in active state ready to receive
>         commands from device driver.
>         It is the default state that a VFIO device enters initially.
> 
> STOP:  In this state, a VFIO device is deactivated to interact with
>        device driver.
> 
> LOGGING: a special state that it CANNOT exist independently. It must be
>        set alongside with state RUNNING or STOP (i.e. RUNNING &
> LOGGING,
>        STOP & LOGGING).
>        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
>        driver can start dirty data logging for device memory and system
>        memory.
>        LOGGING only impacts device/system memory. They return whole
>        snapshot outside LOGGING and dirty data since last get operation
>        inside LOGGING.
>        Device config should be always accessible and return whole config
>        snapshot regardless of LOGGING state.
> 
> Note:
> The reason why RUNNING is the default state is that device's active state
> must not depend on device state interface.
> It is possible that region vfio_device_state_ctl fails to get registered.
> In that condition, a device needs be in active state by default.
> 
> Get Version & Get Caps
> ----------------------
> On migration init phase, qemu will probe the existence of device state
> regions of vendor driver, then get version of the device state interface
> from the r/w control region.
> 
> Then it will probe VFIO device's data capability by reading caps field of
> control region.
>         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
>         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
>         device memory in pre-copy and stop-and-copy phase. The data of
>         device memory is held in device memory region.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty
> pages
>         produced by VFIO device during pre-copy and stop-and-copy phase.
>         The dirty bitmap of system memory is held in dirty bitmap region.
> 
> If failing to find two mandatory regions and optional data regions
> corresponding to data caps or version mismatching, it will setup a
> migration blocker and disable live migration for VFIO device.
> 
> 
> Flows to call device state interface for VFIO live migration
> ------------------------------------------------------------
> 
> Live migration save path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_SAVE_SETUP
>  |
>  .save_setup callback -->
>  get device memory size (whole snapshot size)
>  get device memory buffer (whole snapshot data)
>  set device state --> VFIO_DEVICE_STATE_RUNNING &
> VFIO_DEVICE_STATE_LOGGING
>  |
> MIGRATION_STATUS_ACTIVE
>  |
>  .save_live_pending callback --> get device memory size (dirty data)
>  .save_live_iteration callback --> get device memory buffer (dirty data)
>  .log_sync callback --> get system memory dirty bitmap
>  |
> (vcpu stops) --> set device state -->
>  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
>  |
> .save_live_complete_precopy callback -->
>  get device memory size (dirty data)
>  get device memory buffer (dirty data)
>  get device config size (whole snapshot size)
>  get device config buffer (whole snapshot data)
>  |
> .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> MIGRATION_STATUS_COMPLETED
> 
> MIGRATION_STATUS_CANCELLED or
> MIGRATION_STATUS_FAILED
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> 
> 
> Live migration load path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
>  |
> MIGRATION_STATUS_ACTIVE
>  |
> .load state callback -->
>  set device memory size, set device memory buffer, set device config size,
>  set device config buffer
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_COMPLETED
> 
> 
> 
> In source VM side,
> In precopy phase,
> if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> qemu will first get whole snapshot of device memory in .save_setup
> callback, and then it will get total size of dirty data in device memory in
> .save_live_pending callback by reading device_memory.size field of control
> region.
> Then in .save_live_iteration callback, it will get buffer of device memory's
> dirty data chunk by chunk from device memory region by writing pos &
> action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> control region. (size of each chunk is the size of device memory data
> region).
> .save_live_pending and .save_live_iteration may be called several times in
> precopy phase to get dirty data in device memory.
> 
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy
> phase
> like .save_setup, .save_live_pending, .save_live_iteration will not call
> vendor driver's device state interface to get data from devcie memory.
> 
> In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY
> on,
> .log_sync callback will get system memory dirty bitmap from dirty bitmap
> region by writing system memory's start address, page count and action
> (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr",
> and
> "system_memory.action" fields of control region.
> If page count passed in .log_sync callback is larger than the bitmap size
> the dirty bitmap region supports, Qemu will cut it into chunks and call
> vendor driver's get system memory dirty bitmap interface.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> returns without call to vendor driver.
> 
> In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> in save_live_complete_precopy callback,
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> get device memory size and get device memory buffer will be called again.
> After that,
> device config data is get from device config region by reading
> devcie_config.size of control region and writing action (GET_BITMAP) to
> device_config.action of control region.
> Then after migration completes, in cleanup handler, LOGGING state will be
> cleared (i.e. deivce state is set to STOP).
> Clearing LOGGING state in cleanup handler is in consideration of the case
> of "migration failed" and "migration cancelled". They can also leverage
> the cleanup handler to unset LOGGING state.
> 
> 
> References
> ----------
> 1. kernel side implementation of Device state interfaces:
> https://patchwork.freedesktop.org/series/56876/
> 
> 
> Yan Zhao (5):
>   vfio/migration: define kernel interfaces
>   vfio/migration: support device of device config capability
>   vfio/migration: tracking of dirty page in system memory
>   vfio/migration: turn on migration
>   vfio/migration: support device memory capability
> 
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/common.c              |  26 ++
>  hw/vfio/migration.c           | 858
> ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |  10 +-
>  hw/vfio/pci.h                 |  26 +-
>  include/hw/vfio/vfio-common.h |   1 +
>  linux-headers/linux/vfio.h    | 260 +++++++++++++
>  7 files changed, 1174 insertions(+), 9 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-20 11:42           ` [Qemu-devel] " Cornelia Huck
@ 2019-02-20 12:07             ` Gonglei (Arei)
  -1 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-20 12:07 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, Zhao Yan,
	Dr. David Alan Gilbert, alex.williamson



> -----Original Message-----
> From: Cornelia Huck [mailto:cohuck@redhat.com]
> Sent: Wednesday, February 20, 2019 7:43 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>; Zhao Yan
> <yan.y.zhao@intel.com>; cjia@nvidia.com; kvm@vger.kernel.org;
> aik@ozlabs.ru; Zhengxiao.zx@alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> qemu-devel@nongnu.org; kwankhede@nvidia.com; eauger@redhat.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> mlevitsk@redhat.com; pasic@linux.ibm.com; felipe@nutanix.com;
> Ken.Xue@amd.com; kevin.tian@intel.com; alex.williamson@redhat.com;
> intel-gvt-dev@lists.freedesktop.org; changpeng.liu@intel.com;
> zhi.a.wang@intel.com; jonathan.davies@nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Wed, 20 Feb 2019 11:28:46 +0000
> "Gonglei (Arei)" <arei.gonglei@huawei.com> wrote:
> 
> > > -----Original Message-----
> > > From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> > > Sent: Wednesday, February 20, 2019 7:02 PM
> > > To: Zhao Yan <yan.y.zhao@intel.com>
> > > Cc: cjia@nvidia.com; kvm@vger.kernel.org; aik@ozlabs.ru;
> > > Zhengxiao.zx@alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> > > qemu-devel@nongnu.org; kwankhede@nvidia.com; eauger@redhat.com;
> > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > mlevitsk@redhat.com; pasic@linux.ibm.com; Gonglei (Arei)
> > > <arei.gonglei@huawei.com>; felipe@nutanix.com; Ken.Xue@amd.com;
> > > kevin.tian@intel.com; alex.williamson@redhat.com;
> > > intel-gvt-dev@lists.freedesktop.org; changpeng.liu@intel.com;
> > > cohuck@redhat.com; zhi.a.wang@intel.com;
> jonathan.davies@nutanix.com
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > > > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > > > > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > > > > This patchset enables VFIO devices to have live migration capability.
> > > > > > Currently it does not support post-copy phase.
> > > > > >
> > > > > > It follows Alex's comments on last version of VFIO live migration
> patches,
> > > > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > > > query.
> 
> > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > a device that has less device memory ?
> > > > Actually it's still an open for VFIO migration. Need to think about
> > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > along with verion ?).
> >
> > We must keep the hardware generation is the same with one POD of public
> cloud
> > providers. But we still think about the live migration between from the the
> lower
> > generation of hardware migrated to the higher generation.
> 
> Agreed, lower->higher is the one direction that might make sense to
> support.
> 
> But regardless of that, I think we need to make sure that incompatible
> devices/versions fail directly instead of failing in a subtle, hard to
> debug way. Might be useful to do some initial sanity checks in libvirt
> as well.
> 
> How easy is it to obtain that information in a form that can be
> consumed by higher layers? Can we find out the device type at least?
> What about some kind of revision?

We can provide an interface to query if the VM support live migration or not
in prepare phase of Libvirt.

Can we get the revision_id from the vendor driver ? before invoking

register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
            revision_id,
            &savevm_vfio_handlers,
            vdev);

then limit the live migration form higher gens to lower gens?

Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-20 12:07             ` Gonglei (Arei)
  0 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-20 12:07 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Dr. David Alan Gilbert, Zhao Yan, cjia, kvm, aik, Zhengxiao.zx,
	shuangtai.tst, qemu-devel, kwankhede, eauger, yi.l.liu, eskultet,
	ziye.yang, mlevitsk, pasic, felipe, Ken.Xue, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, zhi.a.wang,
	jonathan.davies



> -----Original Message-----
> From: Cornelia Huck [mailto:cohuck@redhat.com]
> Sent: Wednesday, February 20, 2019 7:43 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>; Zhao Yan
> <yan.y.zhao@intel.com>; cjia@nvidia.com; kvm@vger.kernel.org;
> aik@ozlabs.ru; Zhengxiao.zx@alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> qemu-devel@nongnu.org; kwankhede@nvidia.com; eauger@redhat.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> mlevitsk@redhat.com; pasic@linux.ibm.com; felipe@nutanix.com;
> Ken.Xue@amd.com; kevin.tian@intel.com; alex.williamson@redhat.com;
> intel-gvt-dev@lists.freedesktop.org; changpeng.liu@intel.com;
> zhi.a.wang@intel.com; jonathan.davies@nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Wed, 20 Feb 2019 11:28:46 +0000
> "Gonglei (Arei)" <arei.gonglei@huawei.com> wrote:
> 
> > > -----Original Message-----
> > > From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> > > Sent: Wednesday, February 20, 2019 7:02 PM
> > > To: Zhao Yan <yan.y.zhao@intel.com>
> > > Cc: cjia@nvidia.com; kvm@vger.kernel.org; aik@ozlabs.ru;
> > > Zhengxiao.zx@alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> > > qemu-devel@nongnu.org; kwankhede@nvidia.com; eauger@redhat.com;
> > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > mlevitsk@redhat.com; pasic@linux.ibm.com; Gonglei (Arei)
> > > <arei.gonglei@huawei.com>; felipe@nutanix.com; Ken.Xue@amd.com;
> > > kevin.tian@intel.com; alex.williamson@redhat.com;
> > > intel-gvt-dev@lists.freedesktop.org; changpeng.liu@intel.com;
> > > cohuck@redhat.com; zhi.a.wang@intel.com;
> jonathan.davies@nutanix.com
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > > > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > > > > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > > > > This patchset enables VFIO devices to have live migration capability.
> > > > > > Currently it does not support post-copy phase.
> > > > > >
> > > > > > It follows Alex's comments on last version of VFIO live migration
> patches,
> > > > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > > > query.
> 
> > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > a device that has less device memory ?
> > > > Actually it's still an open for VFIO migration. Need to think about
> > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > along with verion ?).
> >
> > We must keep the hardware generation is the same with one POD of public
> cloud
> > providers. But we still think about the live migration between from the the
> lower
> > generation of hardware migrated to the higher generation.
> 
> Agreed, lower->higher is the one direction that might make sense to
> support.
> 
> But regardless of that, I think we need to make sure that incompatible
> devices/versions fail directly instead of failing in a subtle, hard to
> debug way. Might be useful to do some initial sanity checks in libvirt
> as well.
> 
> How easy is it to obtain that information in a form that can be
> consumed by higher layers? Can we find out the device type at least?
> What about some kind of revision?

We can provide an interface to query if the VM support live migration or not
in prepare phase of Libvirt.

Can we get the revision_id from the vendor driver ? before invoking

register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
            revision_id,
            &savevm_vfio_handlers,
            vdev);

then limit the live migration form higher gens to lower gens?

Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 1/5] vfio/migration: define kernel interfaces
  2019-02-20  7:36       ` [Qemu-devel] " Zhao Yan
@ 2019-02-20 17:08         ` Cornelia Huck
  -1 siblings, 0 replies; 133+ messages in thread
From: Cornelia Huck @ 2019-02-20 17:08 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian, dgilbert,
	alex.williamson, intel-gvt-dev, changpeng.liu, zhi.a.wang,
	jonathan.davies

On Wed, 20 Feb 2019 02:36:36 -0500
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Tue, Feb 19, 2019 at 02:09:18PM +0100, Cornelia Huck wrote:
> > On Tue, 19 Feb 2019 16:52:14 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
(...)
> > > + *          Size of device config data is smaller than or equal to that of
> > > + *          device config region.  
> > 
> > Not sure if I understand that sentence correctly... but what if a
> > device has more config state than fits into this region? Is that
> > supposed to be covered by the device memory region? Or is this assumed
> > to be something so exotic that we don't need to plan for it?
> >   
> Device config data and device config region are all provided by vendor
> driver, so vendor driver is always able to create a large enough device
> config region to hold device config data.
> So, if a device has data that are better to be saved after device stop and
> saved/loaded in strict order, the data needs to be in device config region.
> This kind of data is supposed to be small.
> If the device data can be saved/loaded several times, it can also be put
> into device memory region.

So, it is the vendor driver's decision which device information should
go via which region? With the device config data supposed to be
saved/loaded in one go?

(...)
> > > +/* version number of the device state interface */
> > > +#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1  
> > 
> > Hm. Is this supposed to be backwards-compatible, should we need to bump
> > this?
> >  
> currently no backwords-compatible. we can discuss on that.

It might be useful if we discover that we need some extensions. But I'm
not sure how much work it would be.

(...)
> > > +/*
> > > + * DEVICE STATES
> > > + *
> > > + * Four states are defined for a VFIO device:
> > > + * RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP.
> > > + * They can be set by writing to device_state field of
> > > + * vfio_device_state_ctl region.  
> > 
> > Who controls this? Userspace?  
> 
> Yes. Userspace notifies vendor driver to do the state switching.

Might be good to mention this (just to make it obvious).

> > > + * LOGGING state is a special state that it CANNOT exist
> > > + * independently.  
> > 
> > So it's not a state, but rather a modifier?
> >   
> yes. or thinking LOGGING/not LOGGING as bit 1 of a device state,
> whereas RUNNING/STOPPED is bit 0 of a device state.
> They have to be got as a whole.

So it is (on a bit level):
RUNNING -> 00
STOPPED -> 01
LOGGING/RUNNING -> 10
LOGGING/STOPPED -> 11
 
> > > + * It must be set alongside with state RUNNING or STOP, i.e,
> > > + * RUNNING & LOGGING, STOP & LOGGING.
> > > + * It is used for dirty data logging both for device memory
> > > + * and system memory.
> > > + *
> > > + * LOGGING only impacts device/system memory. In LOGGING state, get buffer
> > > + * of device memory returns dirty pages since last call; outside LOGGING
> > > + * state, get buffer of device memory returns whole snapshot of device
> > > + * memory. system memory's dirty page is only available in LOGGING state.
> > > + *
> > > + * Device config should be always accessible and return whole config snapshot
> > > + * regardless of LOGGING state.
> > > + * */
> > > +#define VFIO_DEVICE_STATE_RUNNING 0
> > > +#define VFIO_DEVICE_STATE_STOP 1
> > > +#define VFIO_DEVICE_STATE_LOGGING 2

This makes it look a bit like LOGGING were an individual state, while 2
is in reality LOGGING/RUNNING... not sure how to make that more
obvious. Maybe (as we are dealing with a u32):

#define VFIO_DEVICE_STATE_RUNNING 0x00000000
#define VFIO_DEVICE_STATE_STOPPED 0x00000001
#define VFIO_DEVICE_STATE_LOGGING_RUNNING 0x00000002
#define VFIO_DEVICE_STATE_LOGGING_STOPPED 0x00000003
#define VFIO_DEVICE_STATE_LOGGING_MASK 0x00000002

> > > +
> > > +/* action to get data from device memory or device config
> > > + * the action is write to device state's control region, and data is read
> > > + * from device memory region or device config region.
> > > + * Each time before read device memory region or device config region,
> > > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > > + * field in control region. That is because device memory and devie config
> > > + * region is mmaped into user space. vendor driver has to be notified of
> > > + * the the GET_BUFFER action in advance.
> > > + */
> > > +#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > +
> > > +/* action to set data to device memory or device config
> > > + * the action is write to device state's control region, and data is
> > > + * written to device memory region or device config region.
> > > + * Each time after write to device memory region or device config region,
> > > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > > + * field in control region. That is because device memory and devie config
> > > + * region is mmaped into user space. vendor driver has to be notified of
> > > + * the the SET_BUFFER action after data written.
> > > + */
> > > +#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2  
> > 
> > Let me describe this in my own words to make sure that I understand
> > this correctly.
> > 
> > - The actions are set by userspace to notify the kernel that it is
> >   going to get data or that it has just written data.
> > - This is needed as a notification that the mmapped data should not be
> >   changed resp. just has changed.  
> we need this notification is because when userspace read the mmapped data,
> it's from the ptr returned from mmap(). So, when userspace reads that ptr,
> there will be no page fault or read/write system calls, so vendor driver
> does not know whether read/write opertation happens or not. 
> Therefore, before userspace reads the ptr from mmap, it first writes action
> field in control region (through write system call), and vendor driver
> will not return the write system call until data prepared.
> 
> When userspace writes to that ptr from mmap, it writes data to the data
> region first, then writes the action field in control region (through write
> system call) to notify vendor driver. vendor driver will return the system
> call after it copies the buffer completely.
> > 
> > So, how does the kernel know whether the read action has finished resp.
> > whether the write action has started? Even if userspace reads/writes it
> > as a whole.
> >   
> kernel does not touch the data region except when in response to the
> "action" write system call.

Thanks for the explanation, that makes sense.
(...)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] vfio/migration: define kernel interfaces
@ 2019-02-20 17:08         ` Cornelia Huck
  0 siblings, 0 replies; 133+ messages in thread
From: Cornelia Huck @ 2019-02-20 17:08 UTC (permalink / raw)
  To: Zhao Yan
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue, kwankhede, kevin.tian,
	cjia, arei.gonglei, kvm

On Wed, 20 Feb 2019 02:36:36 -0500
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Tue, Feb 19, 2019 at 02:09:18PM +0100, Cornelia Huck wrote:
> > On Tue, 19 Feb 2019 16:52:14 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
(...)
> > > + *          Size of device config data is smaller than or equal to that of
> > > + *          device config region.  
> > 
> > Not sure if I understand that sentence correctly... but what if a
> > device has more config state than fits into this region? Is that
> > supposed to be covered by the device memory region? Or is this assumed
> > to be something so exotic that we don't need to plan for it?
> >   
> Device config data and device config region are all provided by vendor
> driver, so vendor driver is always able to create a large enough device
> config region to hold device config data.
> So, if a device has data that are better to be saved after device stop and
> saved/loaded in strict order, the data needs to be in device config region.
> This kind of data is supposed to be small.
> If the device data can be saved/loaded several times, it can also be put
> into device memory region.

So, it is the vendor driver's decision which device information should
go via which region? With the device config data supposed to be
saved/loaded in one go?

(...)
> > > +/* version number of the device state interface */
> > > +#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1  
> > 
> > Hm. Is this supposed to be backwards-compatible, should we need to bump
> > this?
> >  
> currently no backwords-compatible. we can discuss on that.

It might be useful if we discover that we need some extensions. But I'm
not sure how much work it would be.

(...)
> > > +/*
> > > + * DEVICE STATES
> > > + *
> > > + * Four states are defined for a VFIO device:
> > > + * RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP.
> > > + * They can be set by writing to device_state field of
> > > + * vfio_device_state_ctl region.  
> > 
> > Who controls this? Userspace?  
> 
> Yes. Userspace notifies vendor driver to do the state switching.

Might be good to mention this (just to make it obvious).

> > > + * LOGGING state is a special state that it CANNOT exist
> > > + * independently.  
> > 
> > So it's not a state, but rather a modifier?
> >   
> yes. or thinking LOGGING/not LOGGING as bit 1 of a device state,
> whereas RUNNING/STOPPED is bit 0 of a device state.
> They have to be got as a whole.

So it is (on a bit level):
RUNNING -> 00
STOPPED -> 01
LOGGING/RUNNING -> 10
LOGGING/STOPPED -> 11
 
> > > + * It must be set alongside with state RUNNING or STOP, i.e,
> > > + * RUNNING & LOGGING, STOP & LOGGING.
> > > + * It is used for dirty data logging both for device memory
> > > + * and system memory.
> > > + *
> > > + * LOGGING only impacts device/system memory. In LOGGING state, get buffer
> > > + * of device memory returns dirty pages since last call; outside LOGGING
> > > + * state, get buffer of device memory returns whole snapshot of device
> > > + * memory. system memory's dirty page is only available in LOGGING state.
> > > + *
> > > + * Device config should be always accessible and return whole config snapshot
> > > + * regardless of LOGGING state.
> > > + * */
> > > +#define VFIO_DEVICE_STATE_RUNNING 0
> > > +#define VFIO_DEVICE_STATE_STOP 1
> > > +#define VFIO_DEVICE_STATE_LOGGING 2

This makes it look a bit like LOGGING were an individual state, while 2
is in reality LOGGING/RUNNING... not sure how to make that more
obvious. Maybe (as we are dealing with a u32):

#define VFIO_DEVICE_STATE_RUNNING 0x00000000
#define VFIO_DEVICE_STATE_STOPPED 0x00000001
#define VFIO_DEVICE_STATE_LOGGING_RUNNING 0x00000002
#define VFIO_DEVICE_STATE_LOGGING_STOPPED 0x00000003
#define VFIO_DEVICE_STATE_LOGGING_MASK 0x00000002

> > > +
> > > +/* action to get data from device memory or device config
> > > + * the action is write to device state's control region, and data is read
> > > + * from device memory region or device config region.
> > > + * Each time before read device memory region or device config region,
> > > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > > + * field in control region. That is because device memory and devie config
> > > + * region is mmaped into user space. vendor driver has to be notified of
> > > + * the the GET_BUFFER action in advance.
> > > + */
> > > +#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > +
> > > +/* action to set data to device memory or device config
> > > + * the action is write to device state's control region, and data is
> > > + * written to device memory region or device config region.
> > > + * Each time after write to device memory region or device config region,
> > > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > > + * field in control region. That is because device memory and devie config
> > > + * region is mmaped into user space. vendor driver has to be notified of
> > > + * the the SET_BUFFER action after data written.
> > > + */
> > > +#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2  
> > 
> > Let me describe this in my own words to make sure that I understand
> > this correctly.
> > 
> > - The actions are set by userspace to notify the kernel that it is
> >   going to get data or that it has just written data.
> > - This is needed as a notification that the mmapped data should not be
> >   changed resp. just has changed.  
> we need this notification is because when userspace read the mmapped data,
> it's from the ptr returned from mmap(). So, when userspace reads that ptr,
> there will be no page fault or read/write system calls, so vendor driver
> does not know whether read/write opertation happens or not. 
> Therefore, before userspace reads the ptr from mmap, it first writes action
> field in control region (through write system call), and vendor driver
> will not return the write system call until data prepared.
> 
> When userspace writes to that ptr from mmap, it writes data to the data
> region first, then writes the action field in control region (through write
> system call) to notify vendor driver. vendor driver will return the system
> call after it copies the buffer completely.
> > 
> > So, how does the kernel know whether the read action has finished resp.
> > whether the write action has started? Even if userspace reads/writes it
> > as a whole.
> >   
> kernel does not touch the data region except when in response to the
> "action" write system call.

Thanks for the explanation, that makes sense.
(...)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 2/5] vfio/migration: support device of device config capability
  2019-02-19 14:37     ` [Qemu-devel] " Cornelia Huck
@ 2019-02-20 22:54       ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-20 22:54 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian, dgilbert,
	alex.williamson, intel-gvt-dev, changpeng.liu, zhi.a.wang,
	jonathan.davies

On Tue, Feb 19, 2019 at 03:37:24PM +0100, Cornelia Huck wrote:
> On Tue, 19 Feb 2019 16:52:27 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > Device config is the default data that every device should have. so
> > device config capability is by default on, no need to set.
> > 
> > - Currently two type of resources are saved/loaded for device of device
> >   config capability:
> >   General PCI config data, and Device config data.
> >   They are copies as a whole when precopy is stopped.
> > 
> > Migration setup flow:
> > - Setup device state regions, check its device state version and capabilities.
> >   Mmap Device Config Region and Dirty Bitmap Region, if available.
> > - If device state regions are failed to get setup, a migration blocker is
> >   registered instead.
> > - Added SaveVMHandlers to register device state save/load handlers.
> > - Register VM state change handler to set device's running/stop states.
> > - On migration startup on source machine, set device's state to
> >   VFIO_DEVICE_STATE_LOGGING
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> > ---
> >  hw/vfio/Makefile.objs         |   2 +-
> >  hw/vfio/migration.c           | 633 ++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c                 |   1 -
> >  hw/vfio/pci.h                 |  25 +-
> >  include/hw/vfio/vfio-common.h |   1 +
> >  5 files changed, 659 insertions(+), 3 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> > diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> > index 8b3f664..f32ff19 100644
> > --- a/hw/vfio/Makefile.objs
> > +++ b/hw/vfio/Makefile.objs
> > @@ -1,6 +1,6 @@
> >  ifeq ($(CONFIG_LINUX), y)
> >  obj-$(CONFIG_SOFTMMU) += common.o
> > -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> > +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o
> 
> I think you want to split the migration code: The type-independent
> code, and the pci-specific code.
>
ok. actually, now only saving/loading of pci generic config data is
pci-specific. the data getting/setting through device state
interfaces are type-independent.

> >  obj-$(CONFIG_VFIO_CCW) += ccw.o
> >  obj-$(CONFIG_SOFTMMU) += platform.o
> >  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > new file mode 100644
> > index 0000000..16d6395
> > --- /dev/null
> > +++ b/hw/vfio/migration.c
> > @@ -0,0 +1,633 @@
> > +#include "qemu/osdep.h"
> > +
> > +#include "hw/vfio/vfio-common.h"
> > +#include "migration/blocker.h"
> > +#include "migration/register.h"
> > +#include "qapi/error.h"
> > +#include "pci.h"
> > +#include "sysemu/kvm.h"
> > +#include "exec/ram_addr.h"
> > +
> > +#define VFIO_SAVE_FLAG_SETUP 0
> > +#define VFIO_SAVE_FLAG_PCI 1
> > +#define VFIO_SAVE_FLAG_DEVCONFIG 2
> > +#define VFIO_SAVE_FLAG_DEVMEMORY 4
> > +#define VFIO_SAVE_FLAG_CONTINUE 8
> > +
> > +static int vfio_device_state_region_setup(VFIOPCIDevice *vdev,
> > +        VFIORegion *region, uint32_t subtype, const char *name)
> 
> This function looks like it should be more generic and e.g. take a
> VFIODevice instead of a VFIOPCIDevice as argument.
> 
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    struct vfio_region_info *info;
> > +    int ret;
> > +
> > +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_DEVICE_STATE,
> > +            subtype, &info);
> > +    if (ret) {
> > +        error_report("Failed to get info of region %s", name);
> > +        return ret;
> > +    }
> > +
> > +    if (vfio_region_setup(OBJECT(vdev), vbasedev,
> > +            region, info->index, name)) {
> > +        error_report("Failed to setup migrtion region %s", name);
> > +        return ret;
> > +    }
> > +
> > +    if (vfio_region_mmap(region)) {
> > +        error_report("Failed to mmap migrtion region %s", name);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev)
> > +{
> > +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY);
> > +}
> > +
> > +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev)
> > +{
> > +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY);
> > +}
> 
> These two as well. The migration structure should probably hang off the
> VFIODevice instead.
>
ok.

> > +
> > +static bool vfio_device_state_region_mmaped(VFIORegion *region)
> > +{
> > +    bool mmaped = true;
> > +    if (region->nr_mmaps != 1 || region->mmaps[0].offset ||
> > +            (region->size != region->mmaps[0].size) ||
> > +            (region->mmaps[0].mmap == NULL)) {
> > +        mmaped = false;
> > +    }
> > +
> > +    return mmaped;
> > +}
> 
> s/mmaped/mmapped/ ?

yes :)
> 
> > +
> > +static int vfio_get_device_config_size(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> 
> This looks like it should not depend on pci, either.
> 
right.

> > +    uint64_t len;
> > +    int sz;
> > +
> > +    sz = sizeof(len);
> > +    if (pread(vbasedev->fd, &len, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to get length of device config");
> > +        return -1;
> > +    }
> > +    if (len > region_config->size) {
> > +        error_report("vfio: Error device config length");
> > +        return -1;
> > +    }
> > +    vdev->migration->devconfig_size = len;
> > +
> > +    return 0;
> > +}
> > +
> > +static int vfio_set_device_config_size(VFIOPCIDevice *vdev, uint64_t size)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> 
> Ditto. Also for the functions below.
> 
> > +    int sz;
> > +
> > +    if (size > region_config->size) {
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(size);
> > +    if (pwrite(vbasedev->fd, &size, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to set length of device config");
> > +        return -1;
> > +    }
> > +    vdev->migration->devconfig_size = size;
> > +    return 0;
> > +}
> > +
> > +static int vfio_save_data_device_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> > +    uint64_t len = vdev->migration->devconfig_size;
> > +
> > +    qemu_put_be64(f, len);
> 
> Why big endian? (Generally, do we need any endianness considerations?)
> 
I think big endian is the endian for qemu to save file.
as long as qemu_put and qemu_get use the same endian, it will be no
problem.
do you think so?

> > +
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.action))
> > +            != sz) {
> > +        error_report("vfio: action failure for device config get buffer");
> > +        return -1;
> > +    }
> 
> Might make sense to wrap this into a set_action() helper that takes a
> SET_BUFFER/GET_BUFFER argument.
> 
right. it makes sense :)
> > +
> > +    if (!vfio_device_state_region_mmaped(region_config)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> > +        if (pread(vbasedev->fd, buf, len, region_config->fd_offset) != len) {
> > +            error_report("vfio: Failed read device config buffer");
> > +            return -1;
> > +        }
> > +        qemu_put_buffer(f, buf, len);
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_config->mmaps[0].mmap;
> > +        qemu_put_buffer(f, dest, len);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> > +                            QEMUFile *f, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> > +
> > +    vfio_set_device_config_size(vdev, len);
> > +
> > +    if (!vfio_device_state_region_mmaped(region_config)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> > +        qemu_get_buffer(f, buf, len);
> > +        if (pwrite(vbasedev->fd, buf, len,
> > +                    region_config->fd_offset) != len) {
> > +            error_report("vfio: Failed to write devie config buffer");
> > +            return -1;
> > +        }
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_config->mmaps[0].mmap;
> > +        qemu_get_buffer(f, dest, len);
> > +    }
> > +
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.action))
> > +            != sz) {
> > +        error_report("vfio: action failure for device config set buffer");
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> > +        uint64_t start_addr, uint64_t page_nr)
> > +{
> > +
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_bitmap =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> > +    unsigned long bitmap_size =
> > +                    BITS_TO_LONGS(page_nr) * sizeof(unsigned long);
> > +    uint32_t sz;
> > +
> > +    struct {
> > +        __u64 start_addr;
> > +        __u64 page_nr;
> > +    } system_memory;
> > +    system_memory.start_addr = start_addr;
> > +    system_memory.page_nr = page_nr;
> > +    sz = sizeof(system_memory);
> > +    if (pwrite(vbasedev->fd, &system_memory, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, system_memory))
> > +            != sz) {
> > +        error_report("vfio: Failed to set system memory range for dirty pages");
> > +        return -1;
> > +    }
> > +
> > +    if (!vfio_device_state_region_mmaped(region_bitmap)) {
> > +        void *bitmap = g_malloc0(bitmap_size);
> > +
> > +        if (pread(vbasedev->fd, bitmap, bitmap_size,
> > +                    region_bitmap->fd_offset) != bitmap_size) {
> > +            error_report("vfio: Failed to read dirty bitmap data");
> > +            return -1;
> > +        }
> > +
> > +        cpu_physical_memory_set_dirty_lebitmap(bitmap, start_addr, page_nr);
> > +
> > +        g_free(bitmap);
> > +    } else {
> > +        cpu_physical_memory_set_dirty_lebitmap(
> > +                    region_bitmap->mmaps[0].mmap,
> > +                    start_addr, page_nr);
> > +    }
> > +   return 0;
> > +}
> > +
> > +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> > +        uint64_t start_addr, uint64_t page_nr)
> > +{
> > +    VFIORegion *region_bitmap =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> > +    unsigned long chunk_size = region_bitmap->size;
> > +    uint64_t chunk_pg_nr = (chunk_size / sizeof(unsigned long)) *
> > +                                BITS_PER_LONG;
> > +
> > +    uint64_t cnt_left;
> > +    int rc = 0;
> > +
> > +    cnt_left = page_nr;
> > +
> > +    while (cnt_left >= chunk_pg_nr) {
> > +        rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, chunk_pg_nr);
> > +        if (rc) {
> > +            goto exit;
> > +        }
> > +        cnt_left -= chunk_pg_nr;
> > +        start_addr += start_addr;
> > +   }
> > +   rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, cnt_left);
> > +
> > +exit:
> > +   return rc;
> > +}
> > +
> > +static int vfio_set_device_state(VFIOPCIDevice *vdev,
> > +        uint32_t dev_state)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    uint32_t sz = sizeof(dev_state);
> > +
> > +    if (!vdev->migration) {
> > +        return -1;
> > +    }
> > +
> > +    if (pwrite(vbasedev->fd, &dev_state, sz,
> > +              region->fd_offset +
> > +              offsetof(struct vfio_device_state_ctl, device_state))
> > +            != sz) {
> > +        error_report("vfio: Failed to set device state %d", dev_state);
> 
> Can the kernel reject this if a state transition is not allowed (or are
> all transitions allowed?)
> 
yes, kernel can reject some state transition if it's not allowed.
But currently all transitions are allowed.
Maybe a check of self-to-self transition is needed in kernel.

> > +        return -1;
> > +    }
> > +    vdev->migration->device_state = dev_state;
> > +    return 0;
> > +}
> > +
> > +static int vfio_get_device_data_caps(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +
> > +    uint32_t caps;
> > +    uint32_t size = sizeof(caps);
> > +
> > +    if (pread(vbasedev->fd, &caps, size,
> > +                region->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, caps))
> > +            != size) {
> > +        error_report("%s Failed to read data caps of device states",
> > +                vbasedev->name);
> > +        return -1;
> > +    }
> > +    vdev->migration->data_caps = caps;
> > +    return 0;
> > +}
> > +
> > +
> > +static int vfio_check_devstate_version(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +
> > +    uint32_t version;
> > +    uint32_t size = sizeof(version);
> > +
> > +    if (pread(vbasedev->fd, &version, size,
> > +                region->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, version))
> > +            != size) {
> > +        error_report("%s Failed to read version of device state interfaces",
> > +                vbasedev->name);
> > +        return -1;
> > +    }
> > +
> > +    if (version != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> > +        error_report("%s migration version mismatch, right version is %d",
> > +                vbasedev->name, VFIO_DEVICE_STATE_INTERFACE_VERSION);
> 
> So, we require an exact match... or should we allow to extend the
> interface in an backwards-compatible way, in which case we'd require
> (QEMU interface version) <= (kernel interface version)?
>
currently yes. we can discuss on that.
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static void vfio_vm_change_state_handler(void *pv, int running, RunState state)
> > +{
> > +    VFIOPCIDevice *vdev = pv;
> > +    uint32_t dev_state = vdev->migration->device_state;
> > +
> > +    if (!running) {
> > +        dev_state |= VFIO_DEVICE_STATE_STOP;
> > +    } else {
> > +        dev_state &= ~VFIO_DEVICE_STATE_STOP;
> > +    }
> > +
> > +    vfio_set_device_state(vdev, dev_state);
> > +}
> > +
> > +static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> > +                                   uint64_t max_size,
> > +                                   uint64_t *res_precopy_only,
> > +                                   uint64_t *res_compatible,
> > +                                   uint64_t *res_post_copy_only)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +
> > +    if (!vfio_device_data_cap_device_memory(vdev)) {
> > +        return;
> > +    }
> > +
> > +    return;
> > +}
> > +
> > +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +
> > +    if (!vfio_device_data_cap_device_memory(vdev)) {
> > +        return 0;
> > +    }
> > +
> > +    return 0;
> > +}
> 
> These look a bit weird...
>
patch 5 (adding device memory cap support ) will make it look better :)

> > +
> > +static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    PCIDevice *pdev = &vdev->pdev;
> > +    uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
> > +    bool msi_64bit;
> > +
> > +    /* retore pci bar configuration */
> > +    ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > +            ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > +        bar_cfg = qemu_get_be32(f);
> > +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar_cfg, 4);
> > +    }
> > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > +            ctl | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> > +
> > +    /* restore msi configuration */
> > +    ctl = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > +    msi_64bit = !!(ctl & PCI_MSI_FLAGS_64BIT);
> > +
> > +    vfio_pci_write_config(&vdev->pdev,
> > +            pdev->msi_cap + PCI_MSI_FLAGS,
> > +            ctl & (!PCI_MSI_FLAGS_ENABLE), 2);
> > +
> > +    msi_lo = qemu_get_be32(f);
> > +    vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO, msi_lo, 4);
> > +
> > +    if (msi_64bit) {
> > +        msi_hi = qemu_get_be32(f);
> > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > +                msi_hi, 4);
> > +    }
> > +    msi_data = qemu_get_be32(f);
> > +    vfio_pci_write_config(pdev,
> > +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > +            msi_data, 2);
> > +
> > +    vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > +            ctl | PCI_MSI_FLAGS_ENABLE, 2);
> > +
> 
> Ok, this function is indeed pci-specific and probably should be moved
> to the vfio-pci code (other types could hook themselves up in the same
> place, then).
> 
yes, this is the only pci-specific code.
maybe using VFIO_DEVICE_TYPE_PCI as a sign to decide whether to save/load
pci config data?
or as Dave said, put saving/loading of pci config data into
VMStateDescription interface?

> > +}
> > +
> > +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    int flag;
> > +    uint64_t len;
> > +    int ret = 0;
> > +
> > +    if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    do {
> > +        flag = qemu_get_byte(f);
> > +
> > +        switch (flag & ~VFIO_SAVE_FLAG_CONTINUE) {
> > +        case VFIO_SAVE_FLAG_SETUP:
> > +            break;
> > +        case VFIO_SAVE_FLAG_PCI:
> > +            vfio_pci_load_config(vdev, f);
> > +            break;
> > +        case VFIO_SAVE_FLAG_DEVCONFIG:
> > +            len = qemu_get_be64(f);
> > +            vfio_load_data_device_config(vdev, f, len);
> > +            break;
> > +        default:
> > +            ret = -EINVAL;
> > +        }
> > +    } while (flag & VFIO_SAVE_FLAG_CONTINUE);
> > +
> > +    return ret;
> > +}
> > +
> > +static void vfio_pci_save_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    PCIDevice *pdev = &vdev->pdev;
> > +    uint32_t msi_cfg, msi_lo, msi_hi, msi_data, bar_cfg, i;
> > +    bool msi_64bit;
> > +
> > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > +        bar_cfg = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> > +        qemu_put_be32(f, bar_cfg);
> > +    }
> > +
> > +    msi_cfg = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > +    msi_64bit = !!(msi_cfg & PCI_MSI_FLAGS_64BIT);
> > +
> > +    msi_lo = pci_default_read_config(pdev,
> > +            pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > +    qemu_put_be32(f, msi_lo);
> > +
> > +    if (msi_64bit) {
> > +        msi_hi = pci_default_read_config(pdev,
> > +                pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > +                4);
> > +        qemu_put_be32(f, msi_hi);
> > +    }
> > +
> > +    msi_data = pci_default_read_config(pdev,
> > +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > +            2);
> > +    qemu_put_be32(f, msi_data);
> > +
> > +}
> > +
> > +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    int rc = 0;
> > +
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> > +    vfio_pci_save_config(vdev, f);
> > +
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVCONFIG);
> > +    rc += vfio_get_device_config_size(vdev);
> > +    rc += vfio_save_data_device_config(vdev, f);
> > +
> > +    return rc;
> > +}
> > +
> > +static int vfio_save_setup(QEMUFile *f, void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +
> > +    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> > +            VFIO_DEVICE_STATE_LOGGING);
> > +    return 0;
> > +}
> > +
> > +static int vfio_load_setup(QEMUFile *f, void *opaque)
> > +{
> > +    return 0;
> > +}
> > +
> > +static void vfio_save_cleanup(void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    uint32_t dev_state = vdev->migration->device_state;
> > +
> > +    dev_state &= ~VFIO_DEVICE_STATE_LOGGING;
> > +
> > +    vfio_set_device_state(vdev, dev_state);
> > +}
> 
> These look like they should be type-independent, again.
>
right:)
> > +
> > +static SaveVMHandlers savevm_vfio_handlers = {
> > +    .save_setup = vfio_save_setup,
> > +    .save_live_pending = vfio_save_live_pending,
> > +    .save_live_iterate = vfio_save_iterate,
> > +    .save_live_complete_precopy = vfio_save_complete_precopy,
> > +    .save_cleanup = vfio_save_cleanup,
> > +    .load_setup = vfio_load_setup,
> > +    .load_state = vfio_load_state,
> > +};
> > +
> > +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> > +{
> > +    int ret;
> > +    Error *local_err = NULL;
> > +    vdev->migration = g_new0(VFIOMigration, 1);
> > +
> > +    if (vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL,
> > +              "device-state-ctl")) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_check_devstate_version(vdev)) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_get_device_data_caps(vdev)) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG,
> > +              "device-state-data-device-config")) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        error_report("No suppport of data cap device memory Yet");
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_device_data_cap_system_memory(vdev) &&
> > +            vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP,
> > +              "device-state-data-dirtybitmap")) {
> > +        goto error;
> > +    }
> > +
> > +    vdev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> > +
> > +    register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
> > +            VFIO_DEVICE_STATE_INTERFACE_VERSION,
> > +            &savevm_vfio_handlers,
> > +            vdev);
> > +
> > +    vdev->migration->vm_state =
> > +        qemu_add_vm_change_state_handler(vfio_vm_change_state_handler, vdev);
> > +
> > +    return 0;
> > +error:
> > +    error_setg(&vdev->migration_blocker,
> > +            "VFIO device doesn't support migration");
> > +    ret = migrate_add_blocker(vdev->migration_blocker, &local_err);
> > +    if (local_err) {
> > +        error_propagate(errp, local_err);
> > +        error_free(vdev->migration_blocker);
> > +    }
> > +
> > +    g_free(vdev->migration);
> > +    vdev->migration = NULL;
> > +
> > +    return ret;
> > +}
> > +
> > +void vfio_migration_finalize(VFIOPCIDevice *vdev)
> > +{
> > +    if (vdev->migration) {
> > +        int i;
> > +        qemu_del_vm_change_state_handler(vdev->migration->vm_state);
> > +        unregister_savevm(NULL, TYPE_VFIO_PCI, vdev);
> > +        for (i = 0; i < VFIO_DEVSTATE_REGION_NUM; i++) {
> > +            vfio_region_finalize(&vdev->migration->region[i]);
> > +        }
> > +        g_free(vdev->migration);
> > +        vdev->migration = NULL;
> > +    } else if (vdev->migration_blocker) {
> > +        migrate_del_blocker(vdev->migration_blocker);
> > +        error_free(vdev->migration_blocker);
> > +    }
> > +}
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index c0cb1ec..b8e006b 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -37,7 +37,6 @@
> >  
> >  #define MSIX_CAP_LENGTH 12
> >  
> > -#define TYPE_VFIO_PCI "vfio-pci"
> 
> Why do you need to move this? That looks like a sign that the layering
> needs work.
>
I moved it for passing it as an idstr for register_savevm_live.
Seems it's not necessary if we make it pci independent.
Thanks:)

> >  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> >  
> >  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index b1ae4c0..4b7b1bb 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -19,6 +19,7 @@
> >  #include "qemu/event_notifier.h"
> >  #include "qemu/queue.h"
> >  #include "qemu/timer.h"
> > +#include "sysemu/sysemu.h"
> >  
> >  #define PCI_ANY_ID (~0)
> >  
> > @@ -56,6 +57,21 @@ typedef struct VFIOBAR {
> >      QLIST_HEAD(, VFIOQuirk) quirks;
> >  } VFIOBAR;
> >  
> > +enum {
> > +    VFIO_DEVSTATE_REGION_CTL = 0,
> > +    VFIO_DEVSTATE_REGION_DATA_CONFIG,
> > +    VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY,
> > +    VFIO_DEVSTATE_REGION_DATA_BITMAP,
> > +    VFIO_DEVSTATE_REGION_NUM,
> > +};
> > +typedef struct VFIOMigration {
> > +    VFIORegion region[VFIO_DEVSTATE_REGION_NUM];
> > +    uint32_t data_caps;
> > +    uint32_t device_state;
> > +    uint64_t devconfig_size;
> > +    VMChangeStateEntry *vm_state;
> > +} VFIOMigration;
> > +
> >  typedef struct VFIOVGARegion {
> >      MemoryRegion mem;
> >      off_t offset;
> > @@ -132,6 +148,8 @@ typedef struct VFIOPCIDevice {
> >      VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
> >      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
> >      void *igd_opregion;
> > +    VFIOMigration *migration;
> 
> As said, it would probably be better to hang this off VFIODevice.
> 
ok.

> > +    Error *migration_blocker;
> >      PCIHostDeviceAddress host;
> >      EventNotifier err_notifier;
> >      EventNotifier req_notifier;
> > @@ -198,5 +216,10 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> >  void vfio_display_reset(VFIOPCIDevice *vdev);
> >  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> >  void vfio_display_finalize(VFIOPCIDevice *vdev);
> > -
> > +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev);
> > +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev);
> > +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> > +         uint64_t start_addr, uint64_t page_nr);
> > +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp);
> > +void vfio_migration_finalize(VFIOPCIDevice *vdev);
> 
> And the interfaces should be in vfio-common.
>
ok.

> >  #endif /* HW_VFIO_VFIO_PCI_H */
> > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > index 1b434d0..ed43613 100644
> > --- a/include/hw/vfio/vfio-common.h
> > +++ b/include/hw/vfio/vfio-common.h
> > @@ -32,6 +32,7 @@
> >  #endif
> >  
> >  #define VFIO_MSG_PREFIX "vfio %s: "
> > +#define TYPE_VFIO_PCI "vfio-pci"
> >  
> >  enum {
> >      VFIO_DEVICE_TYPE_PCI = 0,
> 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] vfio/migration: support device of device config capability
@ 2019-02-20 22:54       ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-20 22:54 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue, kwankhede, kevin.tian,
	cjia, arei.gonglei, kvm

On Tue, Feb 19, 2019 at 03:37:24PM +0100, Cornelia Huck wrote:
> On Tue, 19 Feb 2019 16:52:27 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > Device config is the default data that every device should have. so
> > device config capability is by default on, no need to set.
> > 
> > - Currently two type of resources are saved/loaded for device of device
> >   config capability:
> >   General PCI config data, and Device config data.
> >   They are copies as a whole when precopy is stopped.
> > 
> > Migration setup flow:
> > - Setup device state regions, check its device state version and capabilities.
> >   Mmap Device Config Region and Dirty Bitmap Region, if available.
> > - If device state regions are failed to get setup, a migration blocker is
> >   registered instead.
> > - Added SaveVMHandlers to register device state save/load handlers.
> > - Register VM state change handler to set device's running/stop states.
> > - On migration startup on source machine, set device's state to
> >   VFIO_DEVICE_STATE_LOGGING
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> > ---
> >  hw/vfio/Makefile.objs         |   2 +-
> >  hw/vfio/migration.c           | 633 ++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c                 |   1 -
> >  hw/vfio/pci.h                 |  25 +-
> >  include/hw/vfio/vfio-common.h |   1 +
> >  5 files changed, 659 insertions(+), 3 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> > diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> > index 8b3f664..f32ff19 100644
> > --- a/hw/vfio/Makefile.objs
> > +++ b/hw/vfio/Makefile.objs
> > @@ -1,6 +1,6 @@
> >  ifeq ($(CONFIG_LINUX), y)
> >  obj-$(CONFIG_SOFTMMU) += common.o
> > -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> > +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o
> 
> I think you want to split the migration code: The type-independent
> code, and the pci-specific code.
>
ok. actually, now only saving/loading of pci generic config data is
pci-specific. the data getting/setting through device state
interfaces are type-independent.

> >  obj-$(CONFIG_VFIO_CCW) += ccw.o
> >  obj-$(CONFIG_SOFTMMU) += platform.o
> >  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > new file mode 100644
> > index 0000000..16d6395
> > --- /dev/null
> > +++ b/hw/vfio/migration.c
> > @@ -0,0 +1,633 @@
> > +#include "qemu/osdep.h"
> > +
> > +#include "hw/vfio/vfio-common.h"
> > +#include "migration/blocker.h"
> > +#include "migration/register.h"
> > +#include "qapi/error.h"
> > +#include "pci.h"
> > +#include "sysemu/kvm.h"
> > +#include "exec/ram_addr.h"
> > +
> > +#define VFIO_SAVE_FLAG_SETUP 0
> > +#define VFIO_SAVE_FLAG_PCI 1
> > +#define VFIO_SAVE_FLAG_DEVCONFIG 2
> > +#define VFIO_SAVE_FLAG_DEVMEMORY 4
> > +#define VFIO_SAVE_FLAG_CONTINUE 8
> > +
> > +static int vfio_device_state_region_setup(VFIOPCIDevice *vdev,
> > +        VFIORegion *region, uint32_t subtype, const char *name)
> 
> This function looks like it should be more generic and e.g. take a
> VFIODevice instead of a VFIOPCIDevice as argument.
> 
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    struct vfio_region_info *info;
> > +    int ret;
> > +
> > +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_DEVICE_STATE,
> > +            subtype, &info);
> > +    if (ret) {
> > +        error_report("Failed to get info of region %s", name);
> > +        return ret;
> > +    }
> > +
> > +    if (vfio_region_setup(OBJECT(vdev), vbasedev,
> > +            region, info->index, name)) {
> > +        error_report("Failed to setup migrtion region %s", name);
> > +        return ret;
> > +    }
> > +
> > +    if (vfio_region_mmap(region)) {
> > +        error_report("Failed to mmap migrtion region %s", name);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev)
> > +{
> > +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY);
> > +}
> > +
> > +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev)
> > +{
> > +   return !!(vdev->migration->data_caps & VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY);
> > +}
> 
> These two as well. The migration structure should probably hang off the
> VFIODevice instead.
>
ok.

> > +
> > +static bool vfio_device_state_region_mmaped(VFIORegion *region)
> > +{
> > +    bool mmaped = true;
> > +    if (region->nr_mmaps != 1 || region->mmaps[0].offset ||
> > +            (region->size != region->mmaps[0].size) ||
> > +            (region->mmaps[0].mmap == NULL)) {
> > +        mmaped = false;
> > +    }
> > +
> > +    return mmaped;
> > +}
> 
> s/mmaped/mmapped/ ?

yes :)
> 
> > +
> > +static int vfio_get_device_config_size(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> 
> This looks like it should not depend on pci, either.
> 
right.

> > +    uint64_t len;
> > +    int sz;
> > +
> > +    sz = sizeof(len);
> > +    if (pread(vbasedev->fd, &len, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to get length of device config");
> > +        return -1;
> > +    }
> > +    if (len > region_config->size) {
> > +        error_report("vfio: Error device config length");
> > +        return -1;
> > +    }
> > +    vdev->migration->devconfig_size = len;
> > +
> > +    return 0;
> > +}
> > +
> > +static int vfio_set_device_config_size(VFIOPCIDevice *vdev, uint64_t size)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> 
> Ditto. Also for the functions below.
> 
> > +    int sz;
> > +
> > +    if (size > region_config->size) {
> > +        return -1;
> > +    }
> > +
> > +    sz = sizeof(size);
> > +    if (pwrite(vbasedev->fd, &size, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.size))
> > +            != sz) {
> > +        error_report("vfio: Failed to set length of device config");
> > +        return -1;
> > +    }
> > +    vdev->migration->devconfig_size = size;
> > +    return 0;
> > +}
> > +
> > +static int vfio_save_data_device_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> > +    uint64_t len = vdev->migration->devconfig_size;
> > +
> > +    qemu_put_be64(f, len);
> 
> Why big endian? (Generally, do we need any endianness considerations?)
> 
I think big endian is the endian for qemu to save file.
as long as qemu_put and qemu_get use the same endian, it will be no
problem.
do you think so?

> > +
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.action))
> > +            != sz) {
> > +        error_report("vfio: action failure for device config get buffer");
> > +        return -1;
> > +    }
> 
> Might make sense to wrap this into a set_action() helper that takes a
> SET_BUFFER/GET_BUFFER argument.
> 
right. it makes sense :)
> > +
> > +    if (!vfio_device_state_region_mmaped(region_config)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> > +        if (pread(vbasedev->fd, buf, len, region_config->fd_offset) != len) {
> > +            error_report("vfio: Failed read device config buffer");
> > +            return -1;
> > +        }
> > +        qemu_put_buffer(f, buf, len);
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_config->mmaps[0].mmap;
> > +        qemu_put_buffer(f, dest, len);
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> > +                            QEMUFile *f, uint64_t len)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_config =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > +    void *dest;
> > +    uint32_t sz;
> > +    uint8_t *buf = NULL;
> > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> > +
> > +    vfio_set_device_config_size(vdev, len);
> > +
> > +    if (!vfio_device_state_region_mmaped(region_config)) {
> > +        buf = g_malloc(len);
> > +        if (buf == NULL) {
> > +            error_report("vfio: Failed to allocate memory for migrate");
> > +            return -1;
> > +        }
> > +        qemu_get_buffer(f, buf, len);
> > +        if (pwrite(vbasedev->fd, buf, len,
> > +                    region_config->fd_offset) != len) {
> > +            error_report("vfio: Failed to write devie config buffer");
> > +            return -1;
> > +        }
> > +        g_free(buf);
> > +    } else {
> > +        dest = region_config->mmaps[0].mmap;
> > +        qemu_get_buffer(f, dest, len);
> > +    }
> > +
> > +    sz = sizeof(action);
> > +    if (pwrite(vbasedev->fd, &action, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, device_config.action))
> > +            != sz) {
> > +        error_report("vfio: action failure for device config set buffer");
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> > +        uint64_t start_addr, uint64_t page_nr)
> > +{
> > +
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region_ctl =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    VFIORegion *region_bitmap =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> > +    unsigned long bitmap_size =
> > +                    BITS_TO_LONGS(page_nr) * sizeof(unsigned long);
> > +    uint32_t sz;
> > +
> > +    struct {
> > +        __u64 start_addr;
> > +        __u64 page_nr;
> > +    } system_memory;
> > +    system_memory.start_addr = start_addr;
> > +    system_memory.page_nr = page_nr;
> > +    sz = sizeof(system_memory);
> > +    if (pwrite(vbasedev->fd, &system_memory, sz,
> > +                region_ctl->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, system_memory))
> > +            != sz) {
> > +        error_report("vfio: Failed to set system memory range for dirty pages");
> > +        return -1;
> > +    }
> > +
> > +    if (!vfio_device_state_region_mmaped(region_bitmap)) {
> > +        void *bitmap = g_malloc0(bitmap_size);
> > +
> > +        if (pread(vbasedev->fd, bitmap, bitmap_size,
> > +                    region_bitmap->fd_offset) != bitmap_size) {
> > +            error_report("vfio: Failed to read dirty bitmap data");
> > +            return -1;
> > +        }
> > +
> > +        cpu_physical_memory_set_dirty_lebitmap(bitmap, start_addr, page_nr);
> > +
> > +        g_free(bitmap);
> > +    } else {
> > +        cpu_physical_memory_set_dirty_lebitmap(
> > +                    region_bitmap->mmaps[0].mmap,
> > +                    start_addr, page_nr);
> > +    }
> > +   return 0;
> > +}
> > +
> > +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> > +        uint64_t start_addr, uint64_t page_nr)
> > +{
> > +    VFIORegion *region_bitmap =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP];
> > +    unsigned long chunk_size = region_bitmap->size;
> > +    uint64_t chunk_pg_nr = (chunk_size / sizeof(unsigned long)) *
> > +                                BITS_PER_LONG;
> > +
> > +    uint64_t cnt_left;
> > +    int rc = 0;
> > +
> > +    cnt_left = page_nr;
> > +
> > +    while (cnt_left >= chunk_pg_nr) {
> > +        rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, chunk_pg_nr);
> > +        if (rc) {
> > +            goto exit;
> > +        }
> > +        cnt_left -= chunk_pg_nr;
> > +        start_addr += start_addr;
> > +   }
> > +   rc = vfio_set_dirty_page_bitmap_chunk(vdev, start_addr, cnt_left);
> > +
> > +exit:
> > +   return rc;
> > +}
> > +
> > +static int vfio_set_device_state(VFIOPCIDevice *vdev,
> > +        uint32_t dev_state)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +    uint32_t sz = sizeof(dev_state);
> > +
> > +    if (!vdev->migration) {
> > +        return -1;
> > +    }
> > +
> > +    if (pwrite(vbasedev->fd, &dev_state, sz,
> > +              region->fd_offset +
> > +              offsetof(struct vfio_device_state_ctl, device_state))
> > +            != sz) {
> > +        error_report("vfio: Failed to set device state %d", dev_state);
> 
> Can the kernel reject this if a state transition is not allowed (or are
> all transitions allowed?)
> 
yes, kernel can reject some state transition if it's not allowed.
But currently all transitions are allowed.
Maybe a check of self-to-self transition is needed in kernel.

> > +        return -1;
> > +    }
> > +    vdev->migration->device_state = dev_state;
> > +    return 0;
> > +}
> > +
> > +static int vfio_get_device_data_caps(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +
> > +    uint32_t caps;
> > +    uint32_t size = sizeof(caps);
> > +
> > +    if (pread(vbasedev->fd, &caps, size,
> > +                region->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, caps))
> > +            != size) {
> > +        error_report("%s Failed to read data caps of device states",
> > +                vbasedev->name);
> > +        return -1;
> > +    }
> > +    vdev->migration->data_caps = caps;
> > +    return 0;
> > +}
> > +
> > +
> > +static int vfio_check_devstate_version(VFIOPCIDevice *vdev)
> > +{
> > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > +    VFIORegion *region =
> > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > +
> > +    uint32_t version;
> > +    uint32_t size = sizeof(version);
> > +
> > +    if (pread(vbasedev->fd, &version, size,
> > +                region->fd_offset +
> > +                offsetof(struct vfio_device_state_ctl, version))
> > +            != size) {
> > +        error_report("%s Failed to read version of device state interfaces",
> > +                vbasedev->name);
> > +        return -1;
> > +    }
> > +
> > +    if (version != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> > +        error_report("%s migration version mismatch, right version is %d",
> > +                vbasedev->name, VFIO_DEVICE_STATE_INTERFACE_VERSION);
> 
> So, we require an exact match... or should we allow to extend the
> interface in an backwards-compatible way, in which case we'd require
> (QEMU interface version) <= (kernel interface version)?
>
currently yes. we can discuss on that.
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static void vfio_vm_change_state_handler(void *pv, int running, RunState state)
> > +{
> > +    VFIOPCIDevice *vdev = pv;
> > +    uint32_t dev_state = vdev->migration->device_state;
> > +
> > +    if (!running) {
> > +        dev_state |= VFIO_DEVICE_STATE_STOP;
> > +    } else {
> > +        dev_state &= ~VFIO_DEVICE_STATE_STOP;
> > +    }
> > +
> > +    vfio_set_device_state(vdev, dev_state);
> > +}
> > +
> > +static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> > +                                   uint64_t max_size,
> > +                                   uint64_t *res_precopy_only,
> > +                                   uint64_t *res_compatible,
> > +                                   uint64_t *res_post_copy_only)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +
> > +    if (!vfio_device_data_cap_device_memory(vdev)) {
> > +        return;
> > +    }
> > +
> > +    return;
> > +}
> > +
> > +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +
> > +    if (!vfio_device_data_cap_device_memory(vdev)) {
> > +        return 0;
> > +    }
> > +
> > +    return 0;
> > +}
> 
> These look a bit weird...
>
patch 5 (adding device memory cap support ) will make it look better :)

> > +
> > +static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    PCIDevice *pdev = &vdev->pdev;
> > +    uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
> > +    bool msi_64bit;
> > +
> > +    /* retore pci bar configuration */
> > +    ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > +            ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > +        bar_cfg = qemu_get_be32(f);
> > +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar_cfg, 4);
> > +    }
> > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > +            ctl | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> > +
> > +    /* restore msi configuration */
> > +    ctl = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > +    msi_64bit = !!(ctl & PCI_MSI_FLAGS_64BIT);
> > +
> > +    vfio_pci_write_config(&vdev->pdev,
> > +            pdev->msi_cap + PCI_MSI_FLAGS,
> > +            ctl & (!PCI_MSI_FLAGS_ENABLE), 2);
> > +
> > +    msi_lo = qemu_get_be32(f);
> > +    vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO, msi_lo, 4);
> > +
> > +    if (msi_64bit) {
> > +        msi_hi = qemu_get_be32(f);
> > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > +                msi_hi, 4);
> > +    }
> > +    msi_data = qemu_get_be32(f);
> > +    vfio_pci_write_config(pdev,
> > +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > +            msi_data, 2);
> > +
> > +    vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > +            ctl | PCI_MSI_FLAGS_ENABLE, 2);
> > +
> 
> Ok, this function is indeed pci-specific and probably should be moved
> to the vfio-pci code (other types could hook themselves up in the same
> place, then).
> 
yes, this is the only pci-specific code.
maybe using VFIO_DEVICE_TYPE_PCI as a sign to decide whether to save/load
pci config data?
or as Dave said, put saving/loading of pci config data into
VMStateDescription interface?

> > +}
> > +
> > +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    int flag;
> > +    uint64_t len;
> > +    int ret = 0;
> > +
> > +    if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    do {
> > +        flag = qemu_get_byte(f);
> > +
> > +        switch (flag & ~VFIO_SAVE_FLAG_CONTINUE) {
> > +        case VFIO_SAVE_FLAG_SETUP:
> > +            break;
> > +        case VFIO_SAVE_FLAG_PCI:
> > +            vfio_pci_load_config(vdev, f);
> > +            break;
> > +        case VFIO_SAVE_FLAG_DEVCONFIG:
> > +            len = qemu_get_be64(f);
> > +            vfio_load_data_device_config(vdev, f, len);
> > +            break;
> > +        default:
> > +            ret = -EINVAL;
> > +        }
> > +    } while (flag & VFIO_SAVE_FLAG_CONTINUE);
> > +
> > +    return ret;
> > +}
> > +
> > +static void vfio_pci_save_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > +{
> > +    PCIDevice *pdev = &vdev->pdev;
> > +    uint32_t msi_cfg, msi_lo, msi_hi, msi_data, bar_cfg, i;
> > +    bool msi_64bit;
> > +
> > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > +        bar_cfg = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 4);
> > +        qemu_put_be32(f, bar_cfg);
> > +    }
> > +
> > +    msi_cfg = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > +    msi_64bit = !!(msi_cfg & PCI_MSI_FLAGS_64BIT);
> > +
> > +    msi_lo = pci_default_read_config(pdev,
> > +            pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > +    qemu_put_be32(f, msi_lo);
> > +
> > +    if (msi_64bit) {
> > +        msi_hi = pci_default_read_config(pdev,
> > +                pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > +                4);
> > +        qemu_put_be32(f, msi_hi);
> > +    }
> > +
> > +    msi_data = pci_default_read_config(pdev,
> > +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > +            2);
> > +    qemu_put_be32(f, msi_data);
> > +
> > +}
> > +
> > +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    int rc = 0;
> > +
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> > +    vfio_pci_save_config(vdev, f);
> > +
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVCONFIG);
> > +    rc += vfio_get_device_config_size(vdev);
> > +    rc += vfio_save_data_device_config(vdev, f);
> > +
> > +    return rc;
> > +}
> > +
> > +static int vfio_save_setup(QEMUFile *f, void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> > +
> > +    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> > +            VFIO_DEVICE_STATE_LOGGING);
> > +    return 0;
> > +}
> > +
> > +static int vfio_load_setup(QEMUFile *f, void *opaque)
> > +{
> > +    return 0;
> > +}
> > +
> > +static void vfio_save_cleanup(void *opaque)
> > +{
> > +    VFIOPCIDevice *vdev = opaque;
> > +    uint32_t dev_state = vdev->migration->device_state;
> > +
> > +    dev_state &= ~VFIO_DEVICE_STATE_LOGGING;
> > +
> > +    vfio_set_device_state(vdev, dev_state);
> > +}
> 
> These look like they should be type-independent, again.
>
right:)
> > +
> > +static SaveVMHandlers savevm_vfio_handlers = {
> > +    .save_setup = vfio_save_setup,
> > +    .save_live_pending = vfio_save_live_pending,
> > +    .save_live_iterate = vfio_save_iterate,
> > +    .save_live_complete_precopy = vfio_save_complete_precopy,
> > +    .save_cleanup = vfio_save_cleanup,
> > +    .load_setup = vfio_load_setup,
> > +    .load_state = vfio_load_state,
> > +};
> > +
> > +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> > +{
> > +    int ret;
> > +    Error *local_err = NULL;
> > +    vdev->migration = g_new0(VFIOMigration, 1);
> > +
> > +    if (vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_CTL,
> > +              "device-state-ctl")) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_check_devstate_version(vdev)) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_get_device_data_caps(vdev)) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_CONFIG,
> > +              "device-state-data-device-config")) {
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_device_data_cap_device_memory(vdev)) {
> > +        error_report("No suppport of data cap device memory Yet");
> > +        goto error;
> > +    }
> > +
> > +    if (vfio_device_data_cap_system_memory(vdev) &&
> > +            vfio_device_state_region_setup(vdev,
> > +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_BITMAP],
> > +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP,
> > +              "device-state-data-dirtybitmap")) {
> > +        goto error;
> > +    }
> > +
> > +    vdev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> > +
> > +    register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
> > +            VFIO_DEVICE_STATE_INTERFACE_VERSION,
> > +            &savevm_vfio_handlers,
> > +            vdev);
> > +
> > +    vdev->migration->vm_state =
> > +        qemu_add_vm_change_state_handler(vfio_vm_change_state_handler, vdev);
> > +
> > +    return 0;
> > +error:
> > +    error_setg(&vdev->migration_blocker,
> > +            "VFIO device doesn't support migration");
> > +    ret = migrate_add_blocker(vdev->migration_blocker, &local_err);
> > +    if (local_err) {
> > +        error_propagate(errp, local_err);
> > +        error_free(vdev->migration_blocker);
> > +    }
> > +
> > +    g_free(vdev->migration);
> > +    vdev->migration = NULL;
> > +
> > +    return ret;
> > +}
> > +
> > +void vfio_migration_finalize(VFIOPCIDevice *vdev)
> > +{
> > +    if (vdev->migration) {
> > +        int i;
> > +        qemu_del_vm_change_state_handler(vdev->migration->vm_state);
> > +        unregister_savevm(NULL, TYPE_VFIO_PCI, vdev);
> > +        for (i = 0; i < VFIO_DEVSTATE_REGION_NUM; i++) {
> > +            vfio_region_finalize(&vdev->migration->region[i]);
> > +        }
> > +        g_free(vdev->migration);
> > +        vdev->migration = NULL;
> > +    } else if (vdev->migration_blocker) {
> > +        migrate_del_blocker(vdev->migration_blocker);
> > +        error_free(vdev->migration_blocker);
> > +    }
> > +}
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index c0cb1ec..b8e006b 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -37,7 +37,6 @@
> >  
> >  #define MSIX_CAP_LENGTH 12
> >  
> > -#define TYPE_VFIO_PCI "vfio-pci"
> 
> Why do you need to move this? That looks like a sign that the layering
> needs work.
>
I moved it for passing it as an idstr for register_savevm_live.
Seems it's not necessary if we make it pci independent.
Thanks:)

> >  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> >  
> >  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index b1ae4c0..4b7b1bb 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -19,6 +19,7 @@
> >  #include "qemu/event_notifier.h"
> >  #include "qemu/queue.h"
> >  #include "qemu/timer.h"
> > +#include "sysemu/sysemu.h"
> >  
> >  #define PCI_ANY_ID (~0)
> >  
> > @@ -56,6 +57,21 @@ typedef struct VFIOBAR {
> >      QLIST_HEAD(, VFIOQuirk) quirks;
> >  } VFIOBAR;
> >  
> > +enum {
> > +    VFIO_DEVSTATE_REGION_CTL = 0,
> > +    VFIO_DEVSTATE_REGION_DATA_CONFIG,
> > +    VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY,
> > +    VFIO_DEVSTATE_REGION_DATA_BITMAP,
> > +    VFIO_DEVSTATE_REGION_NUM,
> > +};
> > +typedef struct VFIOMigration {
> > +    VFIORegion region[VFIO_DEVSTATE_REGION_NUM];
> > +    uint32_t data_caps;
> > +    uint32_t device_state;
> > +    uint64_t devconfig_size;
> > +    VMChangeStateEntry *vm_state;
> > +} VFIOMigration;
> > +
> >  typedef struct VFIOVGARegion {
> >      MemoryRegion mem;
> >      off_t offset;
> > @@ -132,6 +148,8 @@ typedef struct VFIOPCIDevice {
> >      VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
> >      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
> >      void *igd_opregion;
> > +    VFIOMigration *migration;
> 
> As said, it would probably be better to hang this off VFIODevice.
> 
ok.

> > +    Error *migration_blocker;
> >      PCIHostDeviceAddress host;
> >      EventNotifier err_notifier;
> >      EventNotifier req_notifier;
> > @@ -198,5 +216,10 @@ int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> >  void vfio_display_reset(VFIOPCIDevice *vdev);
> >  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> >  void vfio_display_finalize(VFIOPCIDevice *vdev);
> > -
> > +bool vfio_device_data_cap_system_memory(VFIOPCIDevice *vdev);
> > +bool vfio_device_data_cap_device_memory(VFIOPCIDevice *vdev);
> > +int vfio_set_dirty_page_bitmap(VFIOPCIDevice *vdev,
> > +         uint64_t start_addr, uint64_t page_nr);
> > +int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp);
> > +void vfio_migration_finalize(VFIOPCIDevice *vdev);
> 
> And the interfaces should be in vfio-common.
>
ok.

> >  #endif /* HW_VFIO_VFIO_PCI_H */
> > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > index 1b434d0..ed43613 100644
> > --- a/include/hw/vfio/vfio-common.h
> > +++ b/include/hw/vfio/vfio-common.h
> > @@ -32,6 +32,7 @@
> >  #endif
> >  
> >  #define VFIO_MSG_PREFIX "vfio %s: "
> > +#define TYPE_VFIO_PCI "vfio-pci"
> >  
> >  enum {
> >      VFIO_DEVICE_TYPE_PCI = 0,
> 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 5/5] vfio/migration: support device memory capability
  2019-02-20 10:14         ` [Qemu-devel] " Christophe de Dinechin
@ 2019-02-21  0:07           ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  0:07 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: cjia, KVM list, Alexey Kardashevskiy, Zhengxiao.zx,
	shuangtai.tst, qemu-devel, Kirti Wankhede, eauger, yi.l.liu,
	Erik Skultety, ziye.yang, mlevitsk, Halil Pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	Alex Williamson, intel-gvt-dev, changpeng.liu, Cornelia Huck,
	Zhi Wang, jonathan.davies

On Wed, Feb 20, 2019 at 11:14:24AM +0100, Christophe de Dinechin wrote:
> 
> 
> > On 20 Feb 2019, at 08:58, Zhao Yan <yan.y.zhao@intel.com> wrote:
> > 
> > On Tue, Feb 19, 2019 at 03:42:36PM +0100, Christophe de Dinechin wrote:
> >> 
> >> 
> >>> On 19 Feb 2019, at 09:53, Yan Zhao <yan.y.zhao@intel.com> wrote:
> >>> 
> >>> If a device has device memory capability, save/load data from device memory
> >>> in pre-copy and stop-and-copy phases.
> >>> 
> >>> LOGGING state is set for device memory for dirty page logging:
> >>> in LOGGING state, get device memory returns whole device memory snapshot;
> >>> outside LOGGING state, get device memory returns dirty data since last get
> >>> operation.
> >>> 
> >>> Usually, device memory is very big, qemu needs to chunk it into several
> >>> pieces each with size of device memory region.
> >>> 
> >>> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> >>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>> ---
> >>> hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >>> hw/vfio/pci.h       |   1 +
> >>> 2 files changed, 231 insertions(+), 5 deletions(-)
> >>> 
> >>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >>> index 16d6395..f1e9309 100644
> >>> --- a/hw/vfio/migration.c
> >>> +++ b/hw/vfio/migration.c
> >>> @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> >>>    return 0;
> >>> }
> >>> 
> >>> +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> >>> +{
> >>> +    VFIODevice *vbasedev = &vdev->vbasedev;
> >>> +    VFIORegion *region_ctl =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> >>> +    uint64_t len;
> >>> +    int sz;
> >>> +
> >>> +    sz = sizeof(len);
> >>> +    if (pread(vbasedev->fd, &len, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to get length of device memory”);
> >> 
> >> s/length/size/ ? (to be consistent with function name)
> > 
> > ok. thanks
> >>> +        return -1;
> >>> +    }
> >>> +    vdev->migration->devmem_size = len;
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> >>> +{
> >>> +    VFIODevice *vbasedev = &vdev->vbasedev;
> >>> +    VFIORegion *region_ctl =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> >>> +    int sz;
> >>> +
> >>> +    sz = sizeof(size);
> >>> +    if (pwrite(vbasedev->fd, &size, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to set length of device comemory”);
> >> 
> >> What is comemory? Typo?
> > 
> > Right, typo. should be "memory" :)
> >> 
> >> Same comment about length vs size
> >> 
> > got it. thanks
> > 
> >>> +        return -1;
> >>> +    }
> >>> +    vdev->migration->devmem_size = size;
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static
> >>> +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> >>> +                                    uint64_t pos, uint64_t len)
> >>> +{
> >>> +    VFIODevice *vbasedev = &vdev->vbasedev;
> >>> +    VFIORegion *region_ctl =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> >>> +    VFIORegion *region_devmem =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> >>> +    void *dest;
> >>> +    uint32_t sz;
> >>> +    uint8_t *buf = NULL;
> >>> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> >>> +
> >>> +    if (len > region_devmem->size) {
> >> 
> >> Is it intentional that there is no error_report here?
> >> 
> > an error_report here may be better.
> >>> +        return -1;
> >>> +    }
> >>> +
> >>> +    sz = sizeof(pos);
> >>> +    if (pwrite(vbasedev->fd, &pos, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to set save buffer pos");
> >>> +        return -1;
> >>> +    }
> >>> +    sz = sizeof(action);
> >>> +    if (pwrite(vbasedev->fd, &action, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to set save buffer action");
> >>> +        return -1;
> >>> +    }
> >>> +
> >>> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> >>> +        buf = g_malloc(len);
> >>> +        if (buf == NULL) {
> >>> +            error_report("vfio: Failed to allocate memory for migrate”);
> >> s/migrate/migration/ ?
> > 
> > yes, thanks
> >>> +            return -1;
> >>> +        }
> >>> +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
> >>> +            error_report("vfio: error load device memory buffer”);
> >> s/load/loading/ ?
> > error to load? :)
> 
> I’d check with a native speaker, but I believe it’s “error loading”.
> 
> To me (to be checked), the two sentences don’t have the same meaning:
> 
> “It is an error to load device memory buffer” -> “You are not allowed to do that”
> “I had an error loading device memory buffer” -> “I tried, but it failed"
>
haha, ok, I'll change it to loading, thanks :)
> > 
> >>> +            return -1;
> >>> +        }
> >>> +        qemu_put_be64(f, len);
> >>> +        qemu_put_be64(f, pos);
> >>> +        qemu_put_buffer(f, buf, len);
> >>> +        g_free(buf);
> >>> +    } else {
> >>> +        dest = region_devmem->mmaps[0].mmap;
> >>> +        qemu_put_be64(f, len);
> >>> +        qemu_put_be64(f, pos);
> >>> +        qemu_put_buffer(f, dest, len);
> >>> +    }
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> >>> +{
> >>> +    VFIORegion *region_devmem =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> >>> +    uint64_t total_len = vdev->migration->devmem_size;
> >>> +    uint64_t pos = 0;
> >>> +
> >>> +    qemu_put_be64(f, total_len);
> >>> +    while (pos < total_len) {
> >>> +        uint64_t len = region_devmem->size;
> >>> +
> >>> +        if (pos + len >= total_len) {
> >>> +            len = total_len - pos;
> >>> +        }
> >>> +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
> >>> +            return -1;
> >>> +        }
> >> 
> >> I don’t see where pos is incremented in this loop
> >> 
> > yes, missing one line "pos += len;"
> > Currently, code is not verified in hardware with device memory cap on.
> > Thanks:)
> >>> +    }
> >>> +
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static
> >>> +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> >>> +                                uint64_t pos, uint64_t len)
> >>> +{
> >>> +    VFIODevice *vbasedev = &vdev->vbasedev;
> >>> +    VFIORegion *region_ctl =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> >>> +    VFIORegion *region_devmem =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> >>> +
> >>> +    void *dest;
> >>> +    uint32_t sz;
> >>> +    uint8_t *buf = NULL;
> >>> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> >>> +
> >>> +    if (len > region_devmem->size) {
> >> 
> >> error_report?
> > 
> > seems better to add error_report.
> >>> +        return -1;
> >>> +    }
> >>> +
> >>> +    sz = sizeof(pos);
> >>> +    if (pwrite(vbasedev->fd, &pos, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to set device memory buffer pos");
> >>> +        return -1;
> >>> +    }
> >>> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> >>> +        buf = g_malloc(len);
> >>> +        if (buf == NULL) {
> >>> +            error_report("vfio: Failed to allocate memory for migrate");
> >>> +            return -1;
> >>> +        }
> >>> +        qemu_get_buffer(f, buf, len);
> >>> +        if (pwrite(vbasedev->fd, buf, len,
> >>> +                    region_devmem->fd_offset) != len) {
> >>> +            error_report("vfio: Failed to load devie memory buffer");
> >>> +            return -1;
> >>> +        }
> >>> +        g_free(buf);
> >>> +    } else {
> >>> +        dest = region_devmem->mmaps[0].mmap;
> >>> +        qemu_get_buffer(f, dest, len);
> >>> +    }
> >>> +
> >>> +    sz = sizeof(action);
> >>> +    if (pwrite(vbasedev->fd, &action, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to set load device memory buffer action");
> >>> +        return -1;
> >>> +    }
> >>> +
> >>> +    return 0;
> >>> +
> >>> +}
> >>> +
> >>> +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
> >>> +                        QEMUFile *f, uint64_t total_len)
> >>> +{
> >>> +    uint64_t pos = 0, len = 0;
> >>> +
> >>> +    vfio_set_device_memory_size(vdev, total_len);
> >>> +
> >>> +    while (pos + len < total_len) {
> >>> +        len = qemu_get_be64(f);
> >>> +        pos = qemu_get_be64(f);
> >> 
> >> Nit: load reads len/pos in the loop, whereas save does it in the
> >> inner function (vfio_save_data_device_memory_chunk)
> > right, load has to read len/pos in the loop.
> >> 
> >>> +
> >>> +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
> >>> +    }
> >>> +
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +
> >>> static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> >>>        uint64_t start_addr, uint64_t page_nr)
> >>> {
> >>> @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> >>>        return;
> >>>    }
> >>> 
> >>> +    /* get dirty data size of device memory */
> >>> +    vfio_get_device_memory_size(vdev);
> >>> +
> >>> +    *res_precopy_only += vdev->migration->devmem_size;
> >>>    return;
> >>> }
> >>> 
> >>> @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
> >>>        return 0;
> >>>    }
> >>> 
> >>> -    return 0;
> >>> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> >>> +    /* get dirty data of device memory */
> >>> +    return vfio_save_data_device_memory(vdev, f);
> >>> }
> >>> 
> >>> static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> >>> @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> >>>            len = qemu_get_be64(f);
> >>>            vfio_load_data_device_config(vdev, f, len);
> >>>            break;
> >>> +        case VFIO_SAVE_FLAG_DEVMEMORY:
> >>> +            len = qemu_get_be64(f);
> >>> +            vfio_load_data_device_memory(vdev, f, len);
> >>> +            break;
> >>>        default:
> >>>            ret = -EINVAL;
> >>>        }
> >>> @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >>>    VFIOPCIDevice *vdev = opaque;
> >>>    int rc = 0;
> >>> 
> >>> +    if (vfio_device_data_cap_device_memory(vdev)) {
> >>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
> >>> +        /* get dirty data of device memory */
> >>> +        vfio_get_device_memory_size(vdev);
> >>> +        rc = vfio_save_data_device_memory(vdev, f);
> >>> +    }
> >>> +
> >>>    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> >>>    vfio_pci_save_config(vdev, f);
> >>> 
> >>> @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >>> 
> >>> static int vfio_save_setup(QEMUFile *f, void *opaque)
> >>> {
> >>> +    int rc = 0;
> >>>    VFIOPCIDevice *vdev = opaque;
> >>> -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> >>> +
> >>> +    if (vfio_device_data_cap_device_memory(vdev)) {
> >>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
> >>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> >>> +        /* get whole snapshot of device memory */
> >>> +        vfio_get_device_memory_size(vdev);
> >>> +        rc = vfio_save_data_device_memory(vdev, f);
> >>> +    } else {
> >>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> >>> +    }
> >>> 
> >>>    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> >>>            VFIO_DEVICE_STATE_LOGGING);
> >>> -    return 0;
> >>> +    return rc;
> >>> }
> >>> 
> >>> static int vfio_load_setup(QEMUFile *f, void *opaque)
> >>> @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> >>>        goto error;
> >>>    }
> >>> 
> >>> -    if (vfio_device_data_cap_device_memory(vdev)) {
> >>> -        error_report("No suppport of data cap device memory Yet");
> >>> +    if (vfio_device_data_cap_device_memory(vdev) &&
> >>> +            vfio_device_state_region_setup(vdev,
> >>> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
> >>> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
> >>> +              "device-state-data-device-memory")) {
> >>>        goto error;
> >>>    }
> >>> 
> >>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> >>> index 4b7b1bb..a2cc64b 100644
> >>> --- a/hw/vfio/pci.h
> >>> +++ b/hw/vfio/pci.h
> >>> @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
> >>>    uint32_t data_caps;
> >>>    uint32_t device_state;
> >>>    uint64_t devconfig_size;
> >>> +    uint64_t devmem_size;
> >>>    VMChangeStateEntry *vm_state;
> >>> } VFIOMigration;
> >>> 
> >>> -- 
> >>> 2.7.4
> >>> 
> >>> _______________________________________________
> >>> intel-gvt-dev mailing list
> >>> intel-gvt-dev@lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> >> 
> >> _______________________________________________
> >> intel-gvt-dev mailing list
> >> intel-gvt-dev@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] vfio/migration: support device memory capability
@ 2019-02-21  0:07           ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  0:07 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: cjia, KVM list, Alexey Kardashevskiy, Zhengxiao.zx,
	shuangtai.tst, qemu-devel, Kirti Wankhede, eauger, yi.l.liu,
	Erik Skultety, ziye.yang, mlevitsk, Halil Pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	Alex Williamson, intel-gvt-dev, changpeng.liu, Cornelia Huck,
	Zhi Wang, jonathan.davies

On Wed, Feb 20, 2019 at 11:14:24AM +0100, Christophe de Dinechin wrote:
> 
> 
> > On 20 Feb 2019, at 08:58, Zhao Yan <yan.y.zhao@intel.com> wrote:
> > 
> > On Tue, Feb 19, 2019 at 03:42:36PM +0100, Christophe de Dinechin wrote:
> >> 
> >> 
> >>> On 19 Feb 2019, at 09:53, Yan Zhao <yan.y.zhao@intel.com> wrote:
> >>> 
> >>> If a device has device memory capability, save/load data from device memory
> >>> in pre-copy and stop-and-copy phases.
> >>> 
> >>> LOGGING state is set for device memory for dirty page logging:
> >>> in LOGGING state, get device memory returns whole device memory snapshot;
> >>> outside LOGGING state, get device memory returns dirty data since last get
> >>> operation.
> >>> 
> >>> Usually, device memory is very big, qemu needs to chunk it into several
> >>> pieces each with size of device memory region.
> >>> 
> >>> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> >>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>> ---
> >>> hw/vfio/migration.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >>> hw/vfio/pci.h       |   1 +
> >>> 2 files changed, 231 insertions(+), 5 deletions(-)
> >>> 
> >>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >>> index 16d6395..f1e9309 100644
> >>> --- a/hw/vfio/migration.c
> >>> +++ b/hw/vfio/migration.c
> >>> @@ -203,6 +203,201 @@ static int vfio_load_data_device_config(VFIOPCIDevice *vdev,
> >>>    return 0;
> >>> }
> >>> 
> >>> +static int vfio_get_device_memory_size(VFIOPCIDevice *vdev)
> >>> +{
> >>> +    VFIODevice *vbasedev = &vdev->vbasedev;
> >>> +    VFIORegion *region_ctl =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> >>> +    uint64_t len;
> >>> +    int sz;
> >>> +
> >>> +    sz = sizeof(len);
> >>> +    if (pread(vbasedev->fd, &len, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to get length of device memory”);
> >> 
> >> s/length/size/ ? (to be consistent with function name)
> > 
> > ok. thanks
> >>> +        return -1;
> >>> +    }
> >>> +    vdev->migration->devmem_size = len;
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static int vfio_set_device_memory_size(VFIOPCIDevice *vdev, uint64_t size)
> >>> +{
> >>> +    VFIODevice *vbasedev = &vdev->vbasedev;
> >>> +    VFIORegion *region_ctl =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> >>> +    int sz;
> >>> +
> >>> +    sz = sizeof(size);
> >>> +    if (pwrite(vbasedev->fd, &size, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.size))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to set length of device comemory”);
> >> 
> >> What is comemory? Typo?
> > 
> > Right, typo. should be "memory" :)
> >> 
> >> Same comment about length vs size
> >> 
> > got it. thanks
> > 
> >>> +        return -1;
> >>> +    }
> >>> +    vdev->migration->devmem_size = size;
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static
> >>> +int vfio_save_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> >>> +                                    uint64_t pos, uint64_t len)
> >>> +{
> >>> +    VFIODevice *vbasedev = &vdev->vbasedev;
> >>> +    VFIORegion *region_ctl =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> >>> +    VFIORegion *region_devmem =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> >>> +    void *dest;
> >>> +    uint32_t sz;
> >>> +    uint8_t *buf = NULL;
> >>> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> >>> +
> >>> +    if (len > region_devmem->size) {
> >> 
> >> Is it intentional that there is no error_report here?
> >> 
> > an error_report here may be better.
> >>> +        return -1;
> >>> +    }
> >>> +
> >>> +    sz = sizeof(pos);
> >>> +    if (pwrite(vbasedev->fd, &pos, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to set save buffer pos");
> >>> +        return -1;
> >>> +    }
> >>> +    sz = sizeof(action);
> >>> +    if (pwrite(vbasedev->fd, &action, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to set save buffer action");
> >>> +        return -1;
> >>> +    }
> >>> +
> >>> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> >>> +        buf = g_malloc(len);
> >>> +        if (buf == NULL) {
> >>> +            error_report("vfio: Failed to allocate memory for migrate”);
> >> s/migrate/migration/ ?
> > 
> > yes, thanks
> >>> +            return -1;
> >>> +        }
> >>> +        if (pread(vbasedev->fd, buf, len, region_devmem->fd_offset) != len) {
> >>> +            error_report("vfio: error load device memory buffer”);
> >> s/load/loading/ ?
> > error to load? :)
> 
> I’d check with a native speaker, but I believe it’s “error loading”.
> 
> To me (to be checked), the two sentences don’t have the same meaning:
> 
> “It is an error to load device memory buffer” -> “You are not allowed to do that”
> “I had an error loading device memory buffer” -> “I tried, but it failed"
>
haha, ok, I'll change it to loading, thanks :)
> > 
> >>> +            return -1;
> >>> +        }
> >>> +        qemu_put_be64(f, len);
> >>> +        qemu_put_be64(f, pos);
> >>> +        qemu_put_buffer(f, buf, len);
> >>> +        g_free(buf);
> >>> +    } else {
> >>> +        dest = region_devmem->mmaps[0].mmap;
> >>> +        qemu_put_be64(f, len);
> >>> +        qemu_put_be64(f, pos);
> >>> +        qemu_put_buffer(f, dest, len);
> >>> +    }
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static int vfio_save_data_device_memory(VFIOPCIDevice *vdev, QEMUFile *f)
> >>> +{
> >>> +    VFIORegion *region_devmem =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> >>> +    uint64_t total_len = vdev->migration->devmem_size;
> >>> +    uint64_t pos = 0;
> >>> +
> >>> +    qemu_put_be64(f, total_len);
> >>> +    while (pos < total_len) {
> >>> +        uint64_t len = region_devmem->size;
> >>> +
> >>> +        if (pos + len >= total_len) {
> >>> +            len = total_len - pos;
> >>> +        }
> >>> +        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
> >>> +            return -1;
> >>> +        }
> >> 
> >> I don’t see where pos is incremented in this loop
> >> 
> > yes, missing one line "pos += len;"
> > Currently, code is not verified in hardware with device memory cap on.
> > Thanks:)
> >>> +    }
> >>> +
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static
> >>> +int vfio_load_data_device_memory_chunk(VFIOPCIDevice *vdev, QEMUFile *f,
> >>> +                                uint64_t pos, uint64_t len)
> >>> +{
> >>> +    VFIODevice *vbasedev = &vdev->vbasedev;
> >>> +    VFIORegion *region_ctl =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> >>> +    VFIORegion *region_devmem =
> >>> +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY];
> >>> +
> >>> +    void *dest;
> >>> +    uint32_t sz;
> >>> +    uint8_t *buf = NULL;
> >>> +    uint32_t action = VFIO_DEVICE_DATA_ACTION_SET_BUFFER;
> >>> +
> >>> +    if (len > region_devmem->size) {
> >> 
> >> error_report?
> > 
> > seems better to add error_report.
> >>> +        return -1;
> >>> +    }
> >>> +
> >>> +    sz = sizeof(pos);
> >>> +    if (pwrite(vbasedev->fd, &pos, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.pos))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to set device memory buffer pos");
> >>> +        return -1;
> >>> +    }
> >>> +    if (!vfio_device_state_region_mmaped(region_devmem)) {
> >>> +        buf = g_malloc(len);
> >>> +        if (buf == NULL) {
> >>> +            error_report("vfio: Failed to allocate memory for migrate");
> >>> +            return -1;
> >>> +        }
> >>> +        qemu_get_buffer(f, buf, len);
> >>> +        if (pwrite(vbasedev->fd, buf, len,
> >>> +                    region_devmem->fd_offset) != len) {
> >>> +            error_report("vfio: Failed to load devie memory buffer");
> >>> +            return -1;
> >>> +        }
> >>> +        g_free(buf);
> >>> +    } else {
> >>> +        dest = region_devmem->mmaps[0].mmap;
> >>> +        qemu_get_buffer(f, dest, len);
> >>> +    }
> >>> +
> >>> +    sz = sizeof(action);
> >>> +    if (pwrite(vbasedev->fd, &action, sz,
> >>> +                region_ctl->fd_offset +
> >>> +                offsetof(struct vfio_device_state_ctl, device_memory.action))
> >>> +            != sz) {
> >>> +        error_report("vfio: Failed to set load device memory buffer action");
> >>> +        return -1;
> >>> +    }
> >>> +
> >>> +    return 0;
> >>> +
> >>> +}
> >>> +
> >>> +static int vfio_load_data_device_memory(VFIOPCIDevice *vdev,
> >>> +                        QEMUFile *f, uint64_t total_len)
> >>> +{
> >>> +    uint64_t pos = 0, len = 0;
> >>> +
> >>> +    vfio_set_device_memory_size(vdev, total_len);
> >>> +
> >>> +    while (pos + len < total_len) {
> >>> +        len = qemu_get_be64(f);
> >>> +        pos = qemu_get_be64(f);
> >> 
> >> Nit: load reads len/pos in the loop, whereas save does it in the
> >> inner function (vfio_save_data_device_memory_chunk)
> > right, load has to read len/pos in the loop.
> >> 
> >>> +
> >>> +        vfio_load_data_device_memory_chunk(vdev, f, pos, len);
> >>> +    }
> >>> +
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +
> >>> static int vfio_set_dirty_page_bitmap_chunk(VFIOPCIDevice *vdev,
> >>>        uint64_t start_addr, uint64_t page_nr)
> >>> {
> >>> @@ -377,6 +572,10 @@ static void vfio_save_live_pending(QEMUFile *f, void *opaque,
> >>>        return;
> >>>    }
> >>> 
> >>> +    /* get dirty data size of device memory */
> >>> +    vfio_get_device_memory_size(vdev);
> >>> +
> >>> +    *res_precopy_only += vdev->migration->devmem_size;
> >>>    return;
> >>> }
> >>> 
> >>> @@ -388,7 +587,9 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
> >>>        return 0;
> >>>    }
> >>> 
> >>> -    return 0;
> >>> +    qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> >>> +    /* get dirty data of device memory */
> >>> +    return vfio_save_data_device_memory(vdev, f);
> >>> }
> >>> 
> >>> static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> >>> @@ -458,6 +659,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> >>>            len = qemu_get_be64(f);
> >>>            vfio_load_data_device_config(vdev, f, len);
> >>>            break;
> >>> +        case VFIO_SAVE_FLAG_DEVMEMORY:
> >>> +            len = qemu_get_be64(f);
> >>> +            vfio_load_data_device_memory(vdev, f, len);
> >>> +            break;
> >>>        default:
> >>>            ret = -EINVAL;
> >>>        }
> >>> @@ -503,6 +708,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >>>    VFIOPCIDevice *vdev = opaque;
> >>>    int rc = 0;
> >>> 
> >>> +    if (vfio_device_data_cap_device_memory(vdev)) {
> >>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY | VFIO_SAVE_FLAG_CONTINUE);
> >>> +        /* get dirty data of device memory */
> >>> +        vfio_get_device_memory_size(vdev);
> >>> +        rc = vfio_save_data_device_memory(vdev, f);
> >>> +    }
> >>> +
> >>>    qemu_put_byte(f, VFIO_SAVE_FLAG_PCI | VFIO_SAVE_FLAG_CONTINUE);
> >>>    vfio_pci_save_config(vdev, f);
> >>> 
> >>> @@ -515,12 +727,22 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >>> 
> >>> static int vfio_save_setup(QEMUFile *f, void *opaque)
> >>> {
> >>> +    int rc = 0;
> >>>    VFIOPCIDevice *vdev = opaque;
> >>> -    qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> >>> +
> >>> +    if (vfio_device_data_cap_device_memory(vdev)) {
> >>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP | VFIO_SAVE_FLAG_CONTINUE);
> >>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_DEVMEMORY);
> >>> +        /* get whole snapshot of device memory */
> >>> +        vfio_get_device_memory_size(vdev);
> >>> +        rc = vfio_save_data_device_memory(vdev, f);
> >>> +    } else {
> >>> +        qemu_put_byte(f, VFIO_SAVE_FLAG_SETUP);
> >>> +    }
> >>> 
> >>>    vfio_set_device_state(vdev, VFIO_DEVICE_STATE_RUNNING |
> >>>            VFIO_DEVICE_STATE_LOGGING);
> >>> -    return 0;
> >>> +    return rc;
> >>> }
> >>> 
> >>> static int vfio_load_setup(QEMUFile *f, void *opaque)
> >>> @@ -576,8 +798,11 @@ int vfio_migration_init(VFIOPCIDevice *vdev, Error **errp)
> >>>        goto error;
> >>>    }
> >>> 
> >>> -    if (vfio_device_data_cap_device_memory(vdev)) {
> >>> -        error_report("No suppport of data cap device memory Yet");
> >>> +    if (vfio_device_data_cap_device_memory(vdev) &&
> >>> +            vfio_device_state_region_setup(vdev,
> >>> +              &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_DEVICE_MEMORY],
> >>> +              VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_MEMORY,
> >>> +              "device-state-data-device-memory")) {
> >>>        goto error;
> >>>    }
> >>> 
> >>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> >>> index 4b7b1bb..a2cc64b 100644
> >>> --- a/hw/vfio/pci.h
> >>> +++ b/hw/vfio/pci.h
> >>> @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
> >>>    uint32_t data_caps;
> >>>    uint32_t device_state;
> >>>    uint64_t devconfig_size;
> >>> +    uint64_t devmem_size;
> >>>    VMChangeStateEntry *vm_state;
> >>> } VFIOMigration;
> >>> 
> >>> -- 
> >>> 2.7.4
> >>> 
> >>> _______________________________________________
> >>> intel-gvt-dev mailing list
> >>> intel-gvt-dev@lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> >> 
> >> _______________________________________________
> >> intel-gvt-dev mailing list
> >> intel-gvt-dev@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-20 11:56   ` [Qemu-devel] " Gonglei (Arei)
@ 2019-02-21  0:24     ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  0:24 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, Ken.Xue, kevin.tian, dgilbert, alex.williamson,
	intel-gvt-dev@lists.freedesktop.org

On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> Hi yan,
> 
> Thanks for your work.
> 
> I have some suggestions or questions:
> 
> 1) Would you add msix mode support,? if not, pls add a check in vfio_pci_save_config(), likes Nvidia's solution.
ok.

> 2) We should start vfio devices before vcpu resumes, so we can't rely on vm start change handler completely.
vfio devices is by default set to running state.
In the target machine, its state transition flow is running->stop->running.
so, maybe you can ignore the stop notification in kernel?
> 3) We'd better support live migration rollback since have many failure scenarios,
>  register a migration notifier is a good choice.
I think this patchset can also handle the failure case well.
if migration failure or cancelling happens, 
in cleanup handler, LOGGING state is cleared. device state(running or
stopped) keeps as it is).
then,
if vm switches back to running, device state will be set to running;
if vm stayes at stopped state, device state is also stopped (it has no
meaning to let it in running state).
Do you think so ?

> 4) Four memory region for live migration is too complicated IMHO. 
one big region requires the sub-regions well padded.
like for the first control fields, they have to be padded to 4K.
the same for other data fields.
Otherwise, mmap simply fails, because the start-offset and size for mmap
both need to be PAGE aligned.

Also, 4 regions is clearer in my view :)

> 5) About log sync, why not register log_global_start/stop in vfio_memory_listener?
> 
> 
seems log_global_start/stop cannot be iterately called in pre-copy phase?
for dirty pages in system memory, it's better to transfer dirty data
iteratively to reduce down time, right?


> Regards,
> -Gonglei
> 
> 
> > -----Original Message-----
> > From: Yan Zhao [mailto:yan.y.zhao@intel.com]
> > Sent: Tuesday, February 19, 2019 4:51 PM
> > To: alex.williamson@redhat.com; qemu-devel@nongnu.org
> > Cc: intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com; Gonglei (Arei)
> > <arei.gonglei@huawei.com>; kvm@vger.kernel.org; Yan Zhao
> > <yan.y.zhao@intel.com>
> > Subject: [PATCH 0/5] QEMU VFIO live migration
> > 
> > This patchset enables VFIO devices to have live migration capability.
> > Currently it does not support post-copy phase.
> > 
> > It follows Alex's comments on last version of VFIO live migration patches,
> > including device states, VFIO device state region layout, dirty bitmap's
> > query.
> > 
> > Device Data
> > -----------
> > Device data is divided into three types: device memory, device config,
> > and system memory dirty pages produced by device.
> > 
> > Device config: data like MMIOs, page tables...
> >         Every device is supposed to possess device config data.
> >     	Usually device config's size is small (no big than 10M), and it
> >         needs to be loaded in certain strict order.
> >         Therefore, device config only needs to be saved/loaded in
> >         stop-and-copy phase.
> >         The data of device config is held in device config region.
> >         Size of device config data is smaller than or equal to that of
> >         device config region.
> > 
> > Device Memory: device's internal memory, standalone and outside system
> >         memory. It is usually very big.
> >         This kind of data needs to be saved / loaded in pre-copy and
> >         stop-and-copy phase.
> >         The data of device memory is held in device memory region.
> >         Size of devie memory is usually larger than that of device
> >         memory region. qemu needs to save/load it in chunks of size of
> >         device memory region.
> >         Not all device has device memory. Like IGD only uses system memory.
> > 
> > System memory dirty pages: If a device produces dirty pages in system
> >         memory, it is able to get dirty bitmap for certain range of system
> >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> >         callback, dirty pages in system memory will be save/loaded by ram's
> >         live migration code.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> >         If system memory range is larger than that dirty bitmap region can
> >         hold, qemu will cut it into several chunks and get dirty bitmap in
> >         succession.
> > 
> > 
> > Device State Regions
> > --------------------
> > Vendor driver is required to expose two mandatory regions and another two
> > optional regions if it plans to support device state management.
> > 
> > So, there are up to four regions in total.
> > One control region: mandatory.
> >         Get access via read/write system call.
> >         Its layout is defined in struct vfio_device_state_ctl
> > Three data regions: mmaped into qemu.
> >         device config region: mandatory, holding data of device config
> >         device memory region: optional, holding data of device memory
> >         dirty bitmap region: optional, holding bitmap of system memory
> >                             dirty pages
> > 
> > (The reason why four seperate regions are defined is that the unit of mmap
> > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > control and three mmaped regions for data seems better than one big region
> > padded and sparse mmaped).
> > 
> > 
> > kernel device state interface [1]
> > --------------------------------------
> > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > 
> > #define VFIO_DEVICE_STATE_RUNNING 0
> > #define VFIO_DEVICE_STATE_STOP 1
> > #define VFIO_DEVICE_STATE_LOGGING 2
> > 
> > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > 
> > struct vfio_device_state_ctl {
> > 	__u32 version;		  /* ro */
> > 	__u32 device_state;       /* VFIO device state, wo */
> > 	__u32 caps;		 /* ro */
> >         struct {
> > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> > 		__u64 size;    /*rw*/
> > 	} device_config;
> > 	struct {
> > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> > 		__u64 size;     /* rw */
> >                 __u64 pos; /*the offset in total buffer of device memory*/
> > 	} device_memory;
> > 	struct {
> > 		__u64 start_addr; /* wo */
> > 		__u64 page_nr;   /* wo */
> > 	} system_memory;
> > };
> > 
> > Devcie States
> > -------------
> > After migration is initialzed, it will set device state via writing to
> > device_state field of control region.
> > 
> > Four states are defined for a VFIO device:
> >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP
> > 
> > RUNNING: In this state, a VFIO device is in active state ready to receive
> >         commands from device driver.
> >         It is the default state that a VFIO device enters initially.
> > 
> > STOP:  In this state, a VFIO device is deactivated to interact with
> >        device driver.
> > 
> > LOGGING: a special state that it CANNOT exist independently. It must be
> >        set alongside with state RUNNING or STOP (i.e. RUNNING &
> > LOGGING,
> >        STOP & LOGGING).
> >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> >        driver can start dirty data logging for device memory and system
> >        memory.
> >        LOGGING only impacts device/system memory. They return whole
> >        snapshot outside LOGGING and dirty data since last get operation
> >        inside LOGGING.
> >        Device config should be always accessible and return whole config
> >        snapshot regardless of LOGGING state.
> > 
> > Note:
> > The reason why RUNNING is the default state is that device's active state
> > must not depend on device state interface.
> > It is possible that region vfio_device_state_ctl fails to get registered.
> > In that condition, a device needs be in active state by default.
> > 
> > Get Version & Get Caps
> > ----------------------
> > On migration init phase, qemu will probe the existence of device state
> > regions of vendor driver, then get version of the device state interface
> > from the r/w control region.
> > 
> > Then it will probe VFIO device's data capability by reading caps field of
> > control region.
> >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> >         device memory in pre-copy and stop-and-copy phase. The data of
> >         device memory is held in device memory region.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty
> > pages
> >         produced by VFIO device during pre-copy and stop-and-copy phase.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> > 
> > If failing to find two mandatory regions and optional data regions
> > corresponding to data caps or version mismatching, it will setup a
> > migration blocker and disable live migration for VFIO device.
> > 
> > 
> > Flows to call device state interface for VFIO live migration
> > ------------------------------------------------------------
> > 
> > Live migration save path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_SAVE_SETUP
> >  |
> >  .save_setup callback -->
> >  get device memory size (whole snapshot size)
> >  get device memory buffer (whole snapshot data)
> >  set device state --> VFIO_DEVICE_STATE_RUNNING &
> > VFIO_DEVICE_STATE_LOGGING
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> >  .save_live_pending callback --> get device memory size (dirty data)
> >  .save_live_iteration callback --> get device memory buffer (dirty data)
> >  .log_sync callback --> get system memory dirty bitmap
> >  |
> > (vcpu stops) --> set device state -->
> >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> >  |
> > .save_live_complete_precopy callback -->
> >  get device memory size (dirty data)
> >  get device memory buffer (dirty data)
> >  get device config size (whole snapshot size)
> >  get device config buffer (whole snapshot data)
> >  |
> > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > MIGRATION_STATUS_COMPLETED
> > 
> > MIGRATION_STATUS_CANCELLED or
> > MIGRATION_STATUS_FAILED
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > 
> > 
> > Live migration load path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> > .load state callback -->
> >  set device memory size, set device memory buffer, set device config size,
> >  set device config buffer
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_COMPLETED
> > 
> > 
> > 
> > In source VM side,
> > In precopy phase,
> > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > qemu will first get whole snapshot of device memory in .save_setup
> > callback, and then it will get total size of dirty data in device memory in
> > .save_live_pending callback by reading device_memory.size field of control
> > region.
> > Then in .save_live_iteration callback, it will get buffer of device memory's
> > dirty data chunk by chunk from device memory region by writing pos &
> > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > control region. (size of each chunk is the size of device memory data
> > region).
> > .save_live_pending and .save_live_iteration may be called several times in
> > precopy phase to get dirty data in device memory.
> > 
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy
> > phase
> > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > vendor driver's device state interface to get data from devcie memory.
> > 
> > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY
> > on,
> > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > region by writing system memory's start address, page count and action
> > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr",
> > and
> > "system_memory.action" fields of control region.
> > If page count passed in .log_sync callback is larger than the bitmap size
> > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > vendor driver's get system memory dirty bitmap interface.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > returns without call to vendor driver.
> > 
> > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > in save_live_complete_precopy callback,
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > get device memory size and get device memory buffer will be called again.
> > After that,
> > device config data is get from device config region by reading
> > devcie_config.size of control region and writing action (GET_BITMAP) to
> > device_config.action of control region.
> > Then after migration completes, in cleanup handler, LOGGING state will be
> > cleared (i.e. deivce state is set to STOP).
> > Clearing LOGGING state in cleanup handler is in consideration of the case
> > of "migration failed" and "migration cancelled". They can also leverage
> > the cleanup handler to unset LOGGING state.
> > 
> > 
> > References
> > ----------
> > 1. kernel side implementation of Device state interfaces:
> > https://patchwork.freedesktop.org/series/56876/
> > 
> > 
> > Yan Zhao (5):
> >   vfio/migration: define kernel interfaces
> >   vfio/migration: support device of device config capability
> >   vfio/migration: tracking of dirty page in system memory
> >   vfio/migration: turn on migration
> >   vfio/migration: support device memory capability
> > 
> >  hw/vfio/Makefile.objs         |   2 +-
> >  hw/vfio/common.c              |  26 ++
> >  hw/vfio/migration.c           | 858
> > ++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c                 |  10 +-
> >  hw/vfio/pci.h                 |  26 +-
> >  include/hw/vfio/vfio-common.h |   1 +
> >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> >  7 files changed, 1174 insertions(+), 9 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> > --
> > 2.7.4
> 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-21  0:24     ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  0:24 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue, kwankhede, kevin.tian,
	cjia, kvm

On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> Hi yan,
> 
> Thanks for your work.
> 
> I have some suggestions or questions:
> 
> 1) Would you add msix mode support,? if not, pls add a check in vfio_pci_save_config(), likes Nvidia's solution.
ok.

> 2) We should start vfio devices before vcpu resumes, so we can't rely on vm start change handler completely.
vfio devices is by default set to running state.
In the target machine, its state transition flow is running->stop->running.
so, maybe you can ignore the stop notification in kernel?
> 3) We'd better support live migration rollback since have many failure scenarios,
>  register a migration notifier is a good choice.
I think this patchset can also handle the failure case well.
if migration failure or cancelling happens, 
in cleanup handler, LOGGING state is cleared. device state(running or
stopped) keeps as it is).
then,
if vm switches back to running, device state will be set to running;
if vm stayes at stopped state, device state is also stopped (it has no
meaning to let it in running state).
Do you think so ?

> 4) Four memory region for live migration is too complicated IMHO. 
one big region requires the sub-regions well padded.
like for the first control fields, they have to be padded to 4K.
the same for other data fields.
Otherwise, mmap simply fails, because the start-offset and size for mmap
both need to be PAGE aligned.

Also, 4 regions is clearer in my view :)

> 5) About log sync, why not register log_global_start/stop in vfio_memory_listener?
> 
> 
seems log_global_start/stop cannot be iterately called in pre-copy phase?
for dirty pages in system memory, it's better to transfer dirty data
iteratively to reduce down time, right?


> Regards,
> -Gonglei
> 
> 
> > -----Original Message-----
> > From: Yan Zhao [mailto:yan.y.zhao@intel.com]
> > Sent: Tuesday, February 19, 2019 4:51 PM
> > To: alex.williamson@redhat.com; qemu-devel@nongnu.org
> > Cc: intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com; Gonglei (Arei)
> > <arei.gonglei@huawei.com>; kvm@vger.kernel.org; Yan Zhao
> > <yan.y.zhao@intel.com>
> > Subject: [PATCH 0/5] QEMU VFIO live migration
> > 
> > This patchset enables VFIO devices to have live migration capability.
> > Currently it does not support post-copy phase.
> > 
> > It follows Alex's comments on last version of VFIO live migration patches,
> > including device states, VFIO device state region layout, dirty bitmap's
> > query.
> > 
> > Device Data
> > -----------
> > Device data is divided into three types: device memory, device config,
> > and system memory dirty pages produced by device.
> > 
> > Device config: data like MMIOs, page tables...
> >         Every device is supposed to possess device config data.
> >     	Usually device config's size is small (no big than 10M), and it
> >         needs to be loaded in certain strict order.
> >         Therefore, device config only needs to be saved/loaded in
> >         stop-and-copy phase.
> >         The data of device config is held in device config region.
> >         Size of device config data is smaller than or equal to that of
> >         device config region.
> > 
> > Device Memory: device's internal memory, standalone and outside system
> >         memory. It is usually very big.
> >         This kind of data needs to be saved / loaded in pre-copy and
> >         stop-and-copy phase.
> >         The data of device memory is held in device memory region.
> >         Size of devie memory is usually larger than that of device
> >         memory region. qemu needs to save/load it in chunks of size of
> >         device memory region.
> >         Not all device has device memory. Like IGD only uses system memory.
> > 
> > System memory dirty pages: If a device produces dirty pages in system
> >         memory, it is able to get dirty bitmap for certain range of system
> >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> >         callback, dirty pages in system memory will be save/loaded by ram's
> >         live migration code.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> >         If system memory range is larger than that dirty bitmap region can
> >         hold, qemu will cut it into several chunks and get dirty bitmap in
> >         succession.
> > 
> > 
> > Device State Regions
> > --------------------
> > Vendor driver is required to expose two mandatory regions and another two
> > optional regions if it plans to support device state management.
> > 
> > So, there are up to four regions in total.
> > One control region: mandatory.
> >         Get access via read/write system call.
> >         Its layout is defined in struct vfio_device_state_ctl
> > Three data regions: mmaped into qemu.
> >         device config region: mandatory, holding data of device config
> >         device memory region: optional, holding data of device memory
> >         dirty bitmap region: optional, holding bitmap of system memory
> >                             dirty pages
> > 
> > (The reason why four seperate regions are defined is that the unit of mmap
> > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > control and three mmaped regions for data seems better than one big region
> > padded and sparse mmaped).
> > 
> > 
> > kernel device state interface [1]
> > --------------------------------------
> > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > 
> > #define VFIO_DEVICE_STATE_RUNNING 0
> > #define VFIO_DEVICE_STATE_STOP 1
> > #define VFIO_DEVICE_STATE_LOGGING 2
> > 
> > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > 
> > struct vfio_device_state_ctl {
> > 	__u32 version;		  /* ro */
> > 	__u32 device_state;       /* VFIO device state, wo */
> > 	__u32 caps;		 /* ro */
> >         struct {
> > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */
> > 		__u64 size;    /*rw*/
> > 	} device_config;
> > 	struct {
> > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */
> > 		__u64 size;     /* rw */
> >                 __u64 pos; /*the offset in total buffer of device memory*/
> > 	} device_memory;
> > 	struct {
> > 		__u64 start_addr; /* wo */
> > 		__u64 page_nr;   /* wo */
> > 	} system_memory;
> > };
> > 
> > Devcie States
> > -------------
> > After migration is initialzed, it will set device state via writing to
> > device_state field of control region.
> > 
> > Four states are defined for a VFIO device:
> >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP
> > 
> > RUNNING: In this state, a VFIO device is in active state ready to receive
> >         commands from device driver.
> >         It is the default state that a VFIO device enters initially.
> > 
> > STOP:  In this state, a VFIO device is deactivated to interact with
> >        device driver.
> > 
> > LOGGING: a special state that it CANNOT exist independently. It must be
> >        set alongside with state RUNNING or STOP (i.e. RUNNING &
> > LOGGING,
> >        STOP & LOGGING).
> >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> >        driver can start dirty data logging for device memory and system
> >        memory.
> >        LOGGING only impacts device/system memory. They return whole
> >        snapshot outside LOGGING and dirty data since last get operation
> >        inside LOGGING.
> >        Device config should be always accessible and return whole config
> >        snapshot regardless of LOGGING state.
> > 
> > Note:
> > The reason why RUNNING is the default state is that device's active state
> > must not depend on device state interface.
> > It is possible that region vfio_device_state_ctl fails to get registered.
> > In that condition, a device needs be in active state by default.
> > 
> > Get Version & Get Caps
> > ----------------------
> > On migration init phase, qemu will probe the existence of device state
> > regions of vendor driver, then get version of the device state interface
> > from the r/w control region.
> > 
> > Then it will probe VFIO device's data capability by reading caps field of
> > control region.
> >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> >         device memory in pre-copy and stop-and-copy phase. The data of
> >         device memory is held in device memory region.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty
> > pages
> >         produced by VFIO device during pre-copy and stop-and-copy phase.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> > 
> > If failing to find two mandatory regions and optional data regions
> > corresponding to data caps or version mismatching, it will setup a
> > migration blocker and disable live migration for VFIO device.
> > 
> > 
> > Flows to call device state interface for VFIO live migration
> > ------------------------------------------------------------
> > 
> > Live migration save path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_SAVE_SETUP
> >  |
> >  .save_setup callback -->
> >  get device memory size (whole snapshot size)
> >  get device memory buffer (whole snapshot data)
> >  set device state --> VFIO_DEVICE_STATE_RUNNING &
> > VFIO_DEVICE_STATE_LOGGING
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> >  .save_live_pending callback --> get device memory size (dirty data)
> >  .save_live_iteration callback --> get device memory buffer (dirty data)
> >  .log_sync callback --> get system memory dirty bitmap
> >  |
> > (vcpu stops) --> set device state -->
> >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> >  |
> > .save_live_complete_precopy callback -->
> >  get device memory size (dirty data)
> >  get device memory buffer (dirty data)
> >  get device config size (whole snapshot size)
> >  get device config buffer (whole snapshot data)
> >  |
> > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > MIGRATION_STATUS_COMPLETED
> > 
> > MIGRATION_STATUS_CANCELLED or
> > MIGRATION_STATUS_FAILED
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > 
> > 
> > Live migration load path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> > .load state callback -->
> >  set device memory size, set device memory buffer, set device config size,
> >  set device config buffer
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_COMPLETED
> > 
> > 
> > 
> > In source VM side,
> > In precopy phase,
> > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > qemu will first get whole snapshot of device memory in .save_setup
> > callback, and then it will get total size of dirty data in device memory in
> > .save_live_pending callback by reading device_memory.size field of control
> > region.
> > Then in .save_live_iteration callback, it will get buffer of device memory's
> > dirty data chunk by chunk from device memory region by writing pos &
> > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > control region. (size of each chunk is the size of device memory data
> > region).
> > .save_live_pending and .save_live_iteration may be called several times in
> > precopy phase to get dirty data in device memory.
> > 
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy
> > phase
> > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > vendor driver's device state interface to get data from devcie memory.
> > 
> > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY
> > on,
> > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > region by writing system memory's start address, page count and action
> > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr",
> > and
> > "system_memory.action" fields of control region.
> > If page count passed in .log_sync callback is larger than the bitmap size
> > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > vendor driver's get system memory dirty bitmap interface.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > returns without call to vendor driver.
> > 
> > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > in save_live_complete_precopy callback,
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > get device memory size and get device memory buffer will be called again.
> > After that,
> > device config data is get from device config region by reading
> > devcie_config.size of control region and writing action (GET_BITMAP) to
> > device_config.action of control region.
> > Then after migration completes, in cleanup handler, LOGGING state will be
> > cleared (i.e. deivce state is set to STOP).
> > Clearing LOGGING state in cleanup handler is in consideration of the case
> > of "migration failed" and "migration cancelled". They can also leverage
> > the cleanup handler to unset LOGGING state.
> > 
> > 
> > References
> > ----------
> > 1. kernel side implementation of Device state interfaces:
> > https://patchwork.freedesktop.org/series/56876/
> > 
> > 
> > Yan Zhao (5):
> >   vfio/migration: define kernel interfaces
> >   vfio/migration: support device of device config capability
> >   vfio/migration: tracking of dirty page in system memory
> >   vfio/migration: turn on migration
> >   vfio/migration: support device memory capability
> > 
> >  hw/vfio/Makefile.objs         |   2 +-
> >  hw/vfio/common.c              |  26 ++
> >  hw/vfio/migration.c           | 858
> > ++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c                 |  10 +-
> >  hw/vfio/pci.h                 |  26 +-
> >  include/hw/vfio/vfio-common.h |   1 +
> >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> >  7 files changed, 1174 insertions(+), 9 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> > --
> > 2.7.4
> 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-20 11:01       ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2019-02-21  0:31         ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  0:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck,
	zhi.a.wang, jonathan.davies

On Wed, Feb 20, 2019 at 11:01:43AM +0000, Dr. David Alan Gilbert wrote:
> * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > > This patchset enables VFIO devices to have live migration capability.
> > > > Currently it does not support post-copy phase.
> > > > 
> > > > It follows Alex's comments on last version of VFIO live migration patches,
> > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > query.
> > > 
> > > Hi,
> > >   I've sent minor comments to later patches; but some minor general
> > > comments:
> > > 
> > >   a) Never trust the incoming migrations stream - it might be corrupt,
> > >     so check when you can.
> > hi Dave
> > Thanks for this suggestion. I'll add more checks for migration streams.
> > 
> > 
> > >   b) How do we detect if we're migrating from/to the wrong device or
> > > version of device?  Or say to a device with older firmware or perhaps
> > > a device that has less device memory ?
> > Actually it's still an open for VFIO migration. Need to think about
> > whether it's better to check that in libvirt or qemu (like a device magic
> > along with verion ?).
> > This patchset is intended to settle down the main device state interfaces
> > for VFIO migration. So that we can work on that and improve it.
> > 
> > 
> > >   c) Consider using the trace_ mechanism - it's really useful to
> > > add to loops writing/reading data so that you can see when it fails.
> > > 
> > > Dave
> > >
> > Got it. many thanks~~
> > 
> > 
> > > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > > 'migrtion'
> > 
> > sorry :)
> 
> No problem.
> 
> Given the mails, I'm guessing you've mostly tested this on graphics
> devices?  Have you also checked with VFIO network cards?
> 
yes, I tested it on Intel's graphics devices which do not have device
memory. so the cap of device-memory is off.
I believe this patchset can work well on VFIO network cards as well,
because Gonglei once said their NIC can work well on our previous code
(i.e. device-memory cap off).


> Also see the mail I sent in reply to Kirti's series; we need to boil
> these down to one solution.
>
Maybe Kirti can merge their implementaion into the code for device-memory
cap (like in my patch 5 for device-memory).

> Dave
> 
> > > 
> > > > Device Data
> > > > -----------
> > > > Device data is divided into three types: device memory, device config,
> > > > and system memory dirty pages produced by device.
> > > > 
> > > > Device config: data like MMIOs, page tables...
> > > >         Every device is supposed to possess device config data.
> > > >     	Usually device config's size is small (no big than 10M), and it
> > > >         needs to be loaded in certain strict order.
> > > >         Therefore, device config only needs to be saved/loaded in
> > > >         stop-and-copy phase.
> > > >         The data of device config is held in device config region.
> > > >         Size of device config data is smaller than or equal to that of
> > > >         device config region.
> > > > 
> > > > Device Memory: device's internal memory, standalone and outside system
> > > >         memory. It is usually very big.
> > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > >         stop-and-copy phase.
> > > >         The data of device memory is held in device memory region.
> > > >         Size of devie memory is usually larger than that of device
> > > >         memory region. qemu needs to save/load it in chunks of size of
> > > >         device memory region.
> > > >         Not all device has device memory. Like IGD only uses system memory.
> > > > 
> > > > System memory dirty pages: If a device produces dirty pages in system
> > > >         memory, it is able to get dirty bitmap for certain range of system
> > > >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > > >         callback, dirty pages in system memory will be save/loaded by ram's
> > > >         live migration code.
> > > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > > >         If system memory range is larger than that dirty bitmap region can
> > > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > > >         succession.
> > > > 
> > > > 
> > > > Device State Regions
> > > > --------------------
> > > > Vendor driver is required to expose two mandatory regions and another two
> > > > optional regions if it plans to support device state management.
> > > > 
> > > > So, there are up to four regions in total.
> > > > One control region: mandatory.
> > > >         Get access via read/write system call.
> > > >         Its layout is defined in struct vfio_device_state_ctl
> > > > Three data regions: mmaped into qemu.
> > > >         device config region: mandatory, holding data of device config
> > > >         device memory region: optional, holding data of device memory
> > > >         dirty bitmap region: optional, holding bitmap of system memory
> > > >                             dirty pages
> > > > 
> > > > (The reason why four seperate regions are defined is that the unit of mmap
> > > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > > control and three mmaped regions for data seems better than one big region
> > > > padded and sparse mmaped).
> > > > 
> > > > 
> > > > kernel device state interface [1]
> > > > --------------------------------------
> > > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > 
> > > > #define VFIO_DEVICE_STATE_RUNNING 0 
> > > > #define VFIO_DEVICE_STATE_STOP 1
> > > > #define VFIO_DEVICE_STATE_LOGGING 2
> > > > 
> > > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > > 
> > > > struct vfio_device_state_ctl {
> > > > 	__u32 version;		  /* ro */
> > > > 	__u32 device_state;       /* VFIO device state, wo */
> > > > 	__u32 caps;		 /* ro */
> > > >         struct {
> > > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > > > 		__u64 size;    /*rw*/
> > > > 	} device_config;
> > > > 	struct {
> > > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > > > 		__u64 size;     /* rw */  
> > > >                 __u64 pos; /*the offset in total buffer of device memory*/
> > > > 	} device_memory;
> > > > 	struct {
> > > > 		__u64 start_addr; /* wo */
> > > > 		__u64 page_nr;   /* wo */
> > > > 	} system_memory;
> > > > };
> > > > 
> > > > Devcie States
> > > > ------------- 
> > > > After migration is initialzed, it will set device state via writing to
> > > > device_state field of control region.
> > > > 
> > > > Four states are defined for a VFIO device:
> > > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > > > 
> > > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > > >         commands from device driver.
> > > >         It is the default state that a VFIO device enters initially.
> > > > 
> > > > STOP:  In this state, a VFIO device is deactivated to interact with
> > > >        device driver.
> > > > 
> > > > LOGGING: a special state that it CANNOT exist independently. It must be
> > > >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > > >        STOP & LOGGING).
> > > >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > > >        driver can start dirty data logging for device memory and system
> > > >        memory.
> > > >        LOGGING only impacts device/system memory. They return whole
> > > >        snapshot outside LOGGING and dirty data since last get operation
> > > >        inside LOGGING.
> > > >        Device config should be always accessible and return whole config
> > > >        snapshot regardless of LOGGING state.
> > > >        
> > > > Note:
> > > > The reason why RUNNING is the default state is that device's active state
> > > > must not depend on device state interface.
> > > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > > In that condition, a device needs be in active state by default. 
> > > > 
> > > > Get Version & Get Caps
> > > > ----------------------
> > > > On migration init phase, qemu will probe the existence of device state
> > > > regions of vendor driver, then get version of the device state interface
> > > > from the r/w control region.
> > > > 
> > > > Then it will probe VFIO device's data capability by reading caps field of
> > > > control region.
> > > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > > >         device memory in pre-copy and stop-and-copy phase. The data of
> > > >         device memory is held in device memory region.
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > > >         produced by VFIO device during pre-copy and stop-and-copy phase.
> > > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > > > 
> > > > If failing to find two mandatory regions and optional data regions
> > > > corresponding to data caps or version mismatching, it will setup a
> > > > migration blocker and disable live migration for VFIO device.
> > > > 
> > > > 
> > > > Flows to call device state interface for VFIO live migration
> > > > ------------------------------------------------------------
> > > > 
> > > > Live migration save path:
> > > > 
> > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > > 
> > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > MIGRATION_STATUS_SAVE_SETUP
> > > >  |
> > > >  .save_setup callback -->
> > > >  get device memory size (whole snapshot size)
> > > >  get device memory buffer (whole snapshot data)
> > > >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > > >  |
> > > > MIGRATION_STATUS_ACTIVE
> > > >  |
> > > >  .save_live_pending callback --> get device memory size (dirty data)
> > > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > > >  .log_sync callback --> get system memory dirty bitmap
> > > >  |
> > > > (vcpu stops) --> set device state -->
> > > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > > >  |
> > > > .save_live_complete_precopy callback -->
> > > >  get device memory size (dirty data)
> > > >  get device memory buffer (dirty data)
> > > >  get device config size (whole snapshot size)
> > > >  get device config buffer (whole snapshot data)
> > > >  |
> > > > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > > > MIGRATION_STATUS_COMPLETED
> > > > 
> > > > MIGRATION_STATUS_CANCELLED or
> > > > MIGRATION_STATUS_FAILED
> > > >  |
> > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > > 
> > > > 
> > > > Live migration load path:
> > > > 
> > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > > 
> > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > > >  |
> > > > MIGRATION_STATUS_ACTIVE
> > > >  |
> > > > .load state callback -->
> > > >  set device memory size, set device memory buffer, set device config size,
> > > >  set device config buffer
> > > >  |
> > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > MIGRATION_STATUS_COMPLETED
> > > > 
> > > > 
> > > > 
> > > > In source VM side,
> > > > In precopy phase,
> > > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > > qemu will first get whole snapshot of device memory in .save_setup
> > > > callback, and then it will get total size of dirty data in device memory in
> > > > .save_live_pending callback by reading device_memory.size field of control
> > > > region.
> > > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > > dirty data chunk by chunk from device memory region by writing pos &
> > > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > > control region. (size of each chunk is the size of device memory data
> > > > region).
> > > > .save_live_pending and .save_live_iteration may be called several times in
> > > > precopy phase to get dirty data in device memory.
> > > > 
> > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > > vendor driver's device state interface to get data from devcie memory.
> > > > 
> > > > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > > region by writing system memory's start address, page count and action 
> > > > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > > > "system_memory.action" fields of control region.
> > > > If page count passed in .log_sync callback is larger than the bitmap size
> > > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > > vendor driver's get system memory dirty bitmap interface.
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > > > returns without call to vendor driver.
> > > > 
> > > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > > in save_live_complete_precopy callback,
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > > get device memory size and get device memory buffer will be called again.
> > > > After that,
> > > > device config data is get from device config region by reading
> > > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > > device_config.action of control region.
> > > > Then after migration completes, in cleanup handler, LOGGING state will be
> > > > cleared (i.e. deivce state is set to STOP).
> > > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > > of "migration failed" and "migration cancelled". They can also leverage
> > > > the cleanup handler to unset LOGGING state.
> > > > 
> > > > 
> > > > References
> > > > ----------
> > > > 1. kernel side implementation of Device state interfaces:
> > > > https://patchwork.freedesktop.org/series/56876/
> > > > 
> > > > 
> > > > Yan Zhao (5):
> > > >   vfio/migration: define kernel interfaces
> > > >   vfio/migration: support device of device config capability
> > > >   vfio/migration: tracking of dirty page in system memory
> > > >   vfio/migration: turn on migration
> > > >   vfio/migration: support device memory capability
> > > > 
> > > >  hw/vfio/Makefile.objs         |   2 +-
> > > >  hw/vfio/common.c              |  26 ++
> > > >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> > > >  hw/vfio/pci.c                 |  10 +-
> > > >  hw/vfio/pci.h                 |  26 +-
> > > >  include/hw/vfio/vfio-common.h |   1 +
> > > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > > >  create mode 100644 hw/vfio/migration.c
> > > > 
> > > > -- 
> > > > 2.7.4
> > > > 
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > _______________________________________________
> > > intel-gvt-dev mailing list
> > > intel-gvt-dev@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-21  0:31         ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  0:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, zhi.a.wang, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck, Ken.Xue,
	jonathan.davies

On Wed, Feb 20, 2019 at 11:01:43AM +0000, Dr. David Alan Gilbert wrote:
> * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > > This patchset enables VFIO devices to have live migration capability.
> > > > Currently it does not support post-copy phase.
> > > > 
> > > > It follows Alex's comments on last version of VFIO live migration patches,
> > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > query.
> > > 
> > > Hi,
> > >   I've sent minor comments to later patches; but some minor general
> > > comments:
> > > 
> > >   a) Never trust the incoming migrations stream - it might be corrupt,
> > >     so check when you can.
> > hi Dave
> > Thanks for this suggestion. I'll add more checks for migration streams.
> > 
> > 
> > >   b) How do we detect if we're migrating from/to the wrong device or
> > > version of device?  Or say to a device with older firmware or perhaps
> > > a device that has less device memory ?
> > Actually it's still an open for VFIO migration. Need to think about
> > whether it's better to check that in libvirt or qemu (like a device magic
> > along with verion ?).
> > This patchset is intended to settle down the main device state interfaces
> > for VFIO migration. So that we can work on that and improve it.
> > 
> > 
> > >   c) Consider using the trace_ mechanism - it's really useful to
> > > add to loops writing/reading data so that you can see when it fails.
> > > 
> > > Dave
> > >
> > Got it. many thanks~~
> > 
> > 
> > > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > > 'migrtion'
> > 
> > sorry :)
> 
> No problem.
> 
> Given the mails, I'm guessing you've mostly tested this on graphics
> devices?  Have you also checked with VFIO network cards?
> 
yes, I tested it on Intel's graphics devices which do not have device
memory. so the cap of device-memory is off.
I believe this patchset can work well on VFIO network cards as well,
because Gonglei once said their NIC can work well on our previous code
(i.e. device-memory cap off).


> Also see the mail I sent in reply to Kirti's series; we need to boil
> these down to one solution.
>
Maybe Kirti can merge their implementaion into the code for device-memory
cap (like in my patch 5 for device-memory).

> Dave
> 
> > > 
> > > > Device Data
> > > > -----------
> > > > Device data is divided into three types: device memory, device config,
> > > > and system memory dirty pages produced by device.
> > > > 
> > > > Device config: data like MMIOs, page tables...
> > > >         Every device is supposed to possess device config data.
> > > >     	Usually device config's size is small (no big than 10M), and it
> > > >         needs to be loaded in certain strict order.
> > > >         Therefore, device config only needs to be saved/loaded in
> > > >         stop-and-copy phase.
> > > >         The data of device config is held in device config region.
> > > >         Size of device config data is smaller than or equal to that of
> > > >         device config region.
> > > > 
> > > > Device Memory: device's internal memory, standalone and outside system
> > > >         memory. It is usually very big.
> > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > >         stop-and-copy phase.
> > > >         The data of device memory is held in device memory region.
> > > >         Size of devie memory is usually larger than that of device
> > > >         memory region. qemu needs to save/load it in chunks of size of
> > > >         device memory region.
> > > >         Not all device has device memory. Like IGD only uses system memory.
> > > > 
> > > > System memory dirty pages: If a device produces dirty pages in system
> > > >         memory, it is able to get dirty bitmap for certain range of system
> > > >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > > >         callback, dirty pages in system memory will be save/loaded by ram's
> > > >         live migration code.
> > > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > > >         If system memory range is larger than that dirty bitmap region can
> > > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > > >         succession.
> > > > 
> > > > 
> > > > Device State Regions
> > > > --------------------
> > > > Vendor driver is required to expose two mandatory regions and another two
> > > > optional regions if it plans to support device state management.
> > > > 
> > > > So, there are up to four regions in total.
> > > > One control region: mandatory.
> > > >         Get access via read/write system call.
> > > >         Its layout is defined in struct vfio_device_state_ctl
> > > > Three data regions: mmaped into qemu.
> > > >         device config region: mandatory, holding data of device config
> > > >         device memory region: optional, holding data of device memory
> > > >         dirty bitmap region: optional, holding bitmap of system memory
> > > >                             dirty pages
> > > > 
> > > > (The reason why four seperate regions are defined is that the unit of mmap
> > > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > > control and three mmaped regions for data seems better than one big region
> > > > padded and sparse mmaped).
> > > > 
> > > > 
> > > > kernel device state interface [1]
> > > > --------------------------------------
> > > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > 
> > > > #define VFIO_DEVICE_STATE_RUNNING 0 
> > > > #define VFIO_DEVICE_STATE_STOP 1
> > > > #define VFIO_DEVICE_STATE_LOGGING 2
> > > > 
> > > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > > 
> > > > struct vfio_device_state_ctl {
> > > > 	__u32 version;		  /* ro */
> > > > 	__u32 device_state;       /* VFIO device state, wo */
> > > > 	__u32 caps;		 /* ro */
> > > >         struct {
> > > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > > > 		__u64 size;    /*rw*/
> > > > 	} device_config;
> > > > 	struct {
> > > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > > > 		__u64 size;     /* rw */  
> > > >                 __u64 pos; /*the offset in total buffer of device memory*/
> > > > 	} device_memory;
> > > > 	struct {
> > > > 		__u64 start_addr; /* wo */
> > > > 		__u64 page_nr;   /* wo */
> > > > 	} system_memory;
> > > > };
> > > > 
> > > > Devcie States
> > > > ------------- 
> > > > After migration is initialzed, it will set device state via writing to
> > > > device_state field of control region.
> > > > 
> > > > Four states are defined for a VFIO device:
> > > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > > > 
> > > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > > >         commands from device driver.
> > > >         It is the default state that a VFIO device enters initially.
> > > > 
> > > > STOP:  In this state, a VFIO device is deactivated to interact with
> > > >        device driver.
> > > > 
> > > > LOGGING: a special state that it CANNOT exist independently. It must be
> > > >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > > >        STOP & LOGGING).
> > > >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > > >        driver can start dirty data logging for device memory and system
> > > >        memory.
> > > >        LOGGING only impacts device/system memory. They return whole
> > > >        snapshot outside LOGGING and dirty data since last get operation
> > > >        inside LOGGING.
> > > >        Device config should be always accessible and return whole config
> > > >        snapshot regardless of LOGGING state.
> > > >        
> > > > Note:
> > > > The reason why RUNNING is the default state is that device's active state
> > > > must not depend on device state interface.
> > > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > > In that condition, a device needs be in active state by default. 
> > > > 
> > > > Get Version & Get Caps
> > > > ----------------------
> > > > On migration init phase, qemu will probe the existence of device state
> > > > regions of vendor driver, then get version of the device state interface
> > > > from the r/w control region.
> > > > 
> > > > Then it will probe VFIO device's data capability by reading caps field of
> > > > control region.
> > > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > > >         device memory in pre-copy and stop-and-copy phase. The data of
> > > >         device memory is held in device memory region.
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > > >         produced by VFIO device during pre-copy and stop-and-copy phase.
> > > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > > > 
> > > > If failing to find two mandatory regions and optional data regions
> > > > corresponding to data caps or version mismatching, it will setup a
> > > > migration blocker and disable live migration for VFIO device.
> > > > 
> > > > 
> > > > Flows to call device state interface for VFIO live migration
> > > > ------------------------------------------------------------
> > > > 
> > > > Live migration save path:
> > > > 
> > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > > 
> > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > MIGRATION_STATUS_SAVE_SETUP
> > > >  |
> > > >  .save_setup callback -->
> > > >  get device memory size (whole snapshot size)
> > > >  get device memory buffer (whole snapshot data)
> > > >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > > >  |
> > > > MIGRATION_STATUS_ACTIVE
> > > >  |
> > > >  .save_live_pending callback --> get device memory size (dirty data)
> > > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > > >  .log_sync callback --> get system memory dirty bitmap
> > > >  |
> > > > (vcpu stops) --> set device state -->
> > > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > > >  |
> > > > .save_live_complete_precopy callback -->
> > > >  get device memory size (dirty data)
> > > >  get device memory buffer (dirty data)
> > > >  get device config size (whole snapshot size)
> > > >  get device config buffer (whole snapshot data)
> > > >  |
> > > > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > > > MIGRATION_STATUS_COMPLETED
> > > > 
> > > > MIGRATION_STATUS_CANCELLED or
> > > > MIGRATION_STATUS_FAILED
> > > >  |
> > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > > 
> > > > 
> > > > Live migration load path:
> > > > 
> > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > > 
> > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > > >  |
> > > > MIGRATION_STATUS_ACTIVE
> > > >  |
> > > > .load state callback -->
> > > >  set device memory size, set device memory buffer, set device config size,
> > > >  set device config buffer
> > > >  |
> > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > >  |
> > > > MIGRATION_STATUS_COMPLETED
> > > > 
> > > > 
> > > > 
> > > > In source VM side,
> > > > In precopy phase,
> > > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > > qemu will first get whole snapshot of device memory in .save_setup
> > > > callback, and then it will get total size of dirty data in device memory in
> > > > .save_live_pending callback by reading device_memory.size field of control
> > > > region.
> > > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > > dirty data chunk by chunk from device memory region by writing pos &
> > > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > > control region. (size of each chunk is the size of device memory data
> > > > region).
> > > > .save_live_pending and .save_live_iteration may be called several times in
> > > > precopy phase to get dirty data in device memory.
> > > > 
> > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > > vendor driver's device state interface to get data from devcie memory.
> > > > 
> > > > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > > region by writing system memory's start address, page count and action 
> > > > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > > > "system_memory.action" fields of control region.
> > > > If page count passed in .log_sync callback is larger than the bitmap size
> > > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > > vendor driver's get system memory dirty bitmap interface.
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > > > returns without call to vendor driver.
> > > > 
> > > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > > in save_live_complete_precopy callback,
> > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > > get device memory size and get device memory buffer will be called again.
> > > > After that,
> > > > device config data is get from device config region by reading
> > > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > > device_config.action of control region.
> > > > Then after migration completes, in cleanup handler, LOGGING state will be
> > > > cleared (i.e. deivce state is set to STOP).
> > > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > > of "migration failed" and "migration cancelled". They can also leverage
> > > > the cleanup handler to unset LOGGING state.
> > > > 
> > > > 
> > > > References
> > > > ----------
> > > > 1. kernel side implementation of Device state interfaces:
> > > > https://patchwork.freedesktop.org/series/56876/
> > > > 
> > > > 
> > > > Yan Zhao (5):
> > > >   vfio/migration: define kernel interfaces
> > > >   vfio/migration: support device of device config capability
> > > >   vfio/migration: tracking of dirty page in system memory
> > > >   vfio/migration: turn on migration
> > > >   vfio/migration: support device memory capability
> > > > 
> > > >  hw/vfio/Makefile.objs         |   2 +-
> > > >  hw/vfio/common.c              |  26 ++
> > > >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> > > >  hw/vfio/pci.c                 |  10 +-
> > > >  hw/vfio/pci.h                 |  26 +-
> > > >  include/hw/vfio/vfio-common.h |   1 +
> > > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > > >  create mode 100644 hw/vfio/migration.c
> > > > 
> > > > -- 
> > > > 2.7.4
> > > > 
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > _______________________________________________
> > > intel-gvt-dev mailing list
> > > intel-gvt-dev@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-21  0:24     ` [Qemu-devel] " Zhao Yan
@ 2019-02-21  1:35       ` Gonglei (Arei)
  -1 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-21  1:35 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, Ken.Xue, kevin.tian, dgilbert, alex.williamson,
	intel-gvt-dev@lists.freedesktop.org



> -----Original Message-----
> From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> Sent: Thursday, February 21, 2019 8:25 AM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> kvm@vger.kernel.org
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> > Hi yan,
> >
> > Thanks for your work.
> >
> > I have some suggestions or questions:
> >
> > 1) Would you add msix mode support,? if not, pls add a check in
> vfio_pci_save_config(), likes Nvidia's solution.
> ok.
> 
> > 2) We should start vfio devices before vcpu resumes, so we can't rely on vm
> start change handler completely.
> vfio devices is by default set to running state.
> In the target machine, its state transition flow is running->stop->running.

That's confusing. We should start vfio devices after vfio_load_state, otherwise
how can you keep the devices' information are the same between source side
and destination side?

> so, maybe you can ignore the stop notification in kernel?
> > 3) We'd better support live migration rollback since have many failure
> scenarios,
> >  register a migration notifier is a good choice.
> I think this patchset can also handle the failure case well.
> if migration failure or cancelling happens,
> in cleanup handler, LOGGING state is cleared. device state(running or
> stopped) keeps as it is).

IIRC there're many failure paths don't calling cleanup handler.

> then,
> if vm switches back to running, device state will be set to running;
> if vm stayes at stopped state, device state is also stopped (it has no
> meaning to let it in running state).
> Do you think so ?
> 
IF the underlying state machine is complicated,
We should tell the canceling state to vendor driver proactively.

> > 4) Four memory region for live migration is too complicated IMHO.
> one big region requires the sub-regions well padded.
> like for the first control fields, they have to be padded to 4K.
> the same for other data fields.
> Otherwise, mmap simply fails, because the start-offset and size for mmap
> both need to be PAGE aligned.
> 
But if we don't need use mmap for control filed and device state, they are small basically.
The performance is enough using pread/pwrite. 

> Also, 4 regions is clearer in my view :)
> 
> > 5) About log sync, why not register log_global_start/stop in
> vfio_memory_listener?
> >
> >
> seems log_global_start/stop cannot be iterately called in pre-copy phase?
> for dirty pages in system memory, it's better to transfer dirty data
> iteratively to reduce down time, right?
> 

We just need invoking only once for start and stop logging. Why we need to call
them literately? See memory_listener of vhost.

Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-21  1:35       ` Gonglei (Arei)
  0 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-21  1:35 UTC (permalink / raw)
  To: Zhao Yan
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue, kwankhede, kevin.tian,
	cjia, kvm



> -----Original Message-----
> From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> Sent: Thursday, February 21, 2019 8:25 AM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> kvm@vger.kernel.org
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> > Hi yan,
> >
> > Thanks for your work.
> >
> > I have some suggestions or questions:
> >
> > 1) Would you add msix mode support,? if not, pls add a check in
> vfio_pci_save_config(), likes Nvidia's solution.
> ok.
> 
> > 2) We should start vfio devices before vcpu resumes, so we can't rely on vm
> start change handler completely.
> vfio devices is by default set to running state.
> In the target machine, its state transition flow is running->stop->running.

That's confusing. We should start vfio devices after vfio_load_state, otherwise
how can you keep the devices' information are the same between source side
and destination side?

> so, maybe you can ignore the stop notification in kernel?
> > 3) We'd better support live migration rollback since have many failure
> scenarios,
> >  register a migration notifier is a good choice.
> I think this patchset can also handle the failure case well.
> if migration failure or cancelling happens,
> in cleanup handler, LOGGING state is cleared. device state(running or
> stopped) keeps as it is).

IIRC there're many failure paths don't calling cleanup handler.

> then,
> if vm switches back to running, device state will be set to running;
> if vm stayes at stopped state, device state is also stopped (it has no
> meaning to let it in running state).
> Do you think so ?
> 
IF the underlying state machine is complicated,
We should tell the canceling state to vendor driver proactively.

> > 4) Four memory region for live migration is too complicated IMHO.
> one big region requires the sub-regions well padded.
> like for the first control fields, they have to be padded to 4K.
> the same for other data fields.
> Otherwise, mmap simply fails, because the start-offset and size for mmap
> both need to be PAGE aligned.
> 
But if we don't need use mmap for control filed and device state, they are small basically.
The performance is enough using pread/pwrite. 

> Also, 4 regions is clearer in my view :)
> 
> > 5) About log sync, why not register log_global_start/stop in
> vfio_memory_listener?
> >
> >
> seems log_global_start/stop cannot be iterately called in pre-copy phase?
> for dirty pages in system memory, it's better to transfer dirty data
> iteratively to reduce down time, right?
> 

We just need invoking only once for start and stop logging. Why we need to call
them literately? See memory_listener of vhost.

Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 1/5] vfio/migration: define kernel interfaces
  2019-02-20 17:08         ` [Qemu-devel] " Cornelia Huck
@ 2019-02-21  1:47           ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  1:47 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, zhi.a.wang, kevin.tian, dgilbert,
	alex.williamson, intel-gvt-dev, changpeng.liu, Ken.Xue,
	jonathan.davies

On Wed, Feb 20, 2019 at 06:08:13PM +0100, Cornelia Huck wrote:
> On Wed, 20 Feb 2019 02:36:36 -0500
> Zhao Yan <yan.y.zhao@intel.com> wrote:
> 
> > On Tue, Feb 19, 2019 at 02:09:18PM +0100, Cornelia Huck wrote:
> > > On Tue, 19 Feb 2019 16:52:14 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> (...)
> > > > + *          Size of device config data is smaller than or equal to that of
> > > > + *          device config region.  
> > > 
> > > Not sure if I understand that sentence correctly... but what if a
> > > device has more config state than fits into this region? Is that
> > > supposed to be covered by the device memory region? Or is this assumed
> > > to be something so exotic that we don't need to plan for it?
> > >   
> > Device config data and device config region are all provided by vendor
> > driver, so vendor driver is always able to create a large enough device
> > config region to hold device config data.
> > So, if a device has data that are better to be saved after device stop and
> > saved/loaded in strict order, the data needs to be in device config region.
> > This kind of data is supposed to be small.
> > If the device data can be saved/loaded several times, it can also be put
> > into device memory region.
> 
> So, it is the vendor driver's decision which device information should
> go via which region? With the device config data supposed to be
> saved/loaded in one go?
Right, exactly.


> (...)
> > > > +/* version number of the device state interface */
> > > > +#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1  
> > > 
> > > Hm. Is this supposed to be backwards-compatible, should we need to bump
> > > this?
> > >  
> > currently no backwords-compatible. we can discuss on that.
> 
> It might be useful if we discover that we need some extensions. But I'm
> not sure how much work it would be.
> 
> (...)
> > > > +/*
> > > > + * DEVICE STATES
> > > > + *
> > > > + * Four states are defined for a VFIO device:
> > > > + * RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP.
> > > > + * They can be set by writing to device_state field of
> > > > + * vfio_device_state_ctl region.  
> > > 
> > > Who controls this? Userspace?  
> > 
> > Yes. Userspace notifies vendor driver to do the state switching.
> 
> Might be good to mention this (just to make it obvious).
>
Got it. thanks

> > > > + * LOGGING state is a special state that it CANNOT exist
> > > > + * independently.  
> > > 
> > > So it's not a state, but rather a modifier?
> > >   
> > yes. or thinking LOGGING/not LOGGING as bit 1 of a device state,
> > whereas RUNNING/STOPPED is bit 0 of a device state.
> > They have to be got as a whole.
> 
> So it is (on a bit level):
> RUNNING -> 00
> STOPPED -> 01
> LOGGING/RUNNING -> 10
> LOGGING/STOPPED -> 11
> 

Yes.

> > > > + * It must be set alongside with state RUNNING or STOP, i.e,
> > > > + * RUNNING & LOGGING, STOP & LOGGING.
> > > > + * It is used for dirty data logging both for device memory
> > > > + * and system memory.
> > > > + *
> > > > + * LOGGING only impacts device/system memory. In LOGGING state, get buffer
> > > > + * of device memory returns dirty pages since last call; outside LOGGING
> > > > + * state, get buffer of device memory returns whole snapshot of device
> > > > + * memory. system memory's dirty page is only available in LOGGING state.
> > > > + *
> > > > + * Device config should be always accessible and return whole config snapshot
> > > > + * regardless of LOGGING state.
> > > > + * */
> > > > +#define VFIO_DEVICE_STATE_RUNNING 0
> > > > +#define VFIO_DEVICE_STATE_STOP 1
> > > > +#define VFIO_DEVICE_STATE_LOGGING 2
> 
> This makes it look a bit like LOGGING were an individual state, while 2
> is in reality LOGGING/RUNNING... not sure how to make that more
> obvious. Maybe (as we are dealing with a u32):
> 
> #define VFIO_DEVICE_STATE_RUNNING 0x00000000
> #define VFIO_DEVICE_STATE_STOPPED 0x00000001
> #define VFIO_DEVICE_STATE_LOGGING_RUNNING 0x00000002
> #define VFIO_DEVICE_STATE_LOGGING_STOPPED 0x00000003
> #define VFIO_DEVICE_STATE_LOGGING_MASK 0x00000002
>
Yes, yours are better, thanks:)

> > > > +
> > > > +/* action to get data from device memory or device config
> > > > + * the action is write to device state's control region, and data is read
> > > > + * from device memory region or device config region.
> > > > + * Each time before read device memory region or device config region,
> > > > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > > > + * field in control region. That is because device memory and devie config
> > > > + * region is mmaped into user space. vendor driver has to be notified of
> > > > + * the the GET_BUFFER action in advance.
> > > > + */
> > > > +#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > > +
> > > > +/* action to set data to device memory or device config
> > > > + * the action is write to device state's control region, and data is
> > > > + * written to device memory region or device config region.
> > > > + * Each time after write to device memory region or device config region,
> > > > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > > > + * field in control region. That is because device memory and devie config
> > > > + * region is mmaped into user space. vendor driver has to be notified of
> > > > + * the the SET_BUFFER action after data written.
> > > > + */
> > > > +#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2  
> > > 
> > > Let me describe this in my own words to make sure that I understand
> > > this correctly.
> > > 
> > > - The actions are set by userspace to notify the kernel that it is
> > >   going to get data or that it has just written data.
> > > - This is needed as a notification that the mmapped data should not be
> > >   changed resp. just has changed.  
> > we need this notification is because when userspace read the mmapped data,
> > it's from the ptr returned from mmap(). So, when userspace reads that ptr,
> > there will be no page fault or read/write system calls, so vendor driver
> > does not know whether read/write opertation happens or not. 
> > Therefore, before userspace reads the ptr from mmap, it first writes action
> > field in control region (through write system call), and vendor driver
> > will not return the write system call until data prepared.
> > 
> > When userspace writes to that ptr from mmap, it writes data to the data
> > region first, then writes the action field in control region (through write
> > system call) to notify vendor driver. vendor driver will return the system
> > call after it copies the buffer completely.
> > > 
> > > So, how does the kernel know whether the read action has finished resp.
> > > whether the write action has started? Even if userspace reads/writes it
> > > as a whole.
> > >   
> > kernel does not touch the data region except when in response to the
> > "action" write system call.
> 
> Thanks for the explanation, that makes sense.
> (...)
welcome:)
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] vfio/migration: define kernel interfaces
@ 2019-02-21  1:47           ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  1:47 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian, dgilbert,
	alex.williamson, intel-gvt-dev, changpeng.liu, zhi.a.wang,
	jonathan.davies

On Wed, Feb 20, 2019 at 06:08:13PM +0100, Cornelia Huck wrote:
> On Wed, 20 Feb 2019 02:36:36 -0500
> Zhao Yan <yan.y.zhao@intel.com> wrote:
> 
> > On Tue, Feb 19, 2019 at 02:09:18PM +0100, Cornelia Huck wrote:
> > > On Tue, 19 Feb 2019 16:52:14 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> (...)
> > > > + *          Size of device config data is smaller than or equal to that of
> > > > + *          device config region.  
> > > 
> > > Not sure if I understand that sentence correctly... but what if a
> > > device has more config state than fits into this region? Is that
> > > supposed to be covered by the device memory region? Or is this assumed
> > > to be something so exotic that we don't need to plan for it?
> > >   
> > Device config data and device config region are all provided by vendor
> > driver, so vendor driver is always able to create a large enough device
> > config region to hold device config data.
> > So, if a device has data that are better to be saved after device stop and
> > saved/loaded in strict order, the data needs to be in device config region.
> > This kind of data is supposed to be small.
> > If the device data can be saved/loaded several times, it can also be put
> > into device memory region.
> 
> So, it is the vendor driver's decision which device information should
> go via which region? With the device config data supposed to be
> saved/loaded in one go?
Right, exactly.


> (...)
> > > > +/* version number of the device state interface */
> > > > +#define VFIO_DEVICE_STATE_INTERFACE_VERSION 1  
> > > 
> > > Hm. Is this supposed to be backwards-compatible, should we need to bump
> > > this?
> > >  
> > currently no backwords-compatible. we can discuss on that.
> 
> It might be useful if we discover that we need some extensions. But I'm
> not sure how much work it would be.
> 
> (...)
> > > > +/*
> > > > + * DEVICE STATES
> > > > + *
> > > > + * Four states are defined for a VFIO device:
> > > > + * RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP.
> > > > + * They can be set by writing to device_state field of
> > > > + * vfio_device_state_ctl region.  
> > > 
> > > Who controls this? Userspace?  
> > 
> > Yes. Userspace notifies vendor driver to do the state switching.
> 
> Might be good to mention this (just to make it obvious).
>
Got it. thanks

> > > > + * LOGGING state is a special state that it CANNOT exist
> > > > + * independently.  
> > > 
> > > So it's not a state, but rather a modifier?
> > >   
> > yes. or thinking LOGGING/not LOGGING as bit 1 of a device state,
> > whereas RUNNING/STOPPED is bit 0 of a device state.
> > They have to be got as a whole.
> 
> So it is (on a bit level):
> RUNNING -> 00
> STOPPED -> 01
> LOGGING/RUNNING -> 10
> LOGGING/STOPPED -> 11
> 

Yes.

> > > > + * It must be set alongside with state RUNNING or STOP, i.e,
> > > > + * RUNNING & LOGGING, STOP & LOGGING.
> > > > + * It is used for dirty data logging both for device memory
> > > > + * and system memory.
> > > > + *
> > > > + * LOGGING only impacts device/system memory. In LOGGING state, get buffer
> > > > + * of device memory returns dirty pages since last call; outside LOGGING
> > > > + * state, get buffer of device memory returns whole snapshot of device
> > > > + * memory. system memory's dirty page is only available in LOGGING state.
> > > > + *
> > > > + * Device config should be always accessible and return whole config snapshot
> > > > + * regardless of LOGGING state.
> > > > + * */
> > > > +#define VFIO_DEVICE_STATE_RUNNING 0
> > > > +#define VFIO_DEVICE_STATE_STOP 1
> > > > +#define VFIO_DEVICE_STATE_LOGGING 2
> 
> This makes it look a bit like LOGGING were an individual state, while 2
> is in reality LOGGING/RUNNING... not sure how to make that more
> obvious. Maybe (as we are dealing with a u32):
> 
> #define VFIO_DEVICE_STATE_RUNNING 0x00000000
> #define VFIO_DEVICE_STATE_STOPPED 0x00000001
> #define VFIO_DEVICE_STATE_LOGGING_RUNNING 0x00000002
> #define VFIO_DEVICE_STATE_LOGGING_STOPPED 0x00000003
> #define VFIO_DEVICE_STATE_LOGGING_MASK 0x00000002
>
Yes, yours are better, thanks:)

> > > > +
> > > > +/* action to get data from device memory or device config
> > > > + * the action is write to device state's control region, and data is read
> > > > + * from device memory region or device config region.
> > > > + * Each time before read device memory region or device config region,
> > > > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > > > + * field in control region. That is because device memory and devie config
> > > > + * region is mmaped into user space. vendor driver has to be notified of
> > > > + * the the GET_BUFFER action in advance.
> > > > + */
> > > > +#define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > > +
> > > > +/* action to set data to device memory or device config
> > > > + * the action is write to device state's control region, and data is
> > > > + * written to device memory region or device config region.
> > > > + * Each time after write to device memory region or device config region,
> > > > + * action VFIO_DEVICE_DATA_ACTION_GET_BUFFER is required to write to action
> > > > + * field in control region. That is because device memory and devie config
> > > > + * region is mmaped into user space. vendor driver has to be notified of
> > > > + * the the SET_BUFFER action after data written.
> > > > + */
> > > > +#define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2  
> > > 
> > > Let me describe this in my own words to make sure that I understand
> > > this correctly.
> > > 
> > > - The actions are set by userspace to notify the kernel that it is
> > >   going to get data or that it has just written data.
> > > - This is needed as a notification that the mmapped data should not be
> > >   changed resp. just has changed.  
> > we need this notification is because when userspace read the mmapped data,
> > it's from the ptr returned from mmap(). So, when userspace reads that ptr,
> > there will be no page fault or read/write system calls, so vendor driver
> > does not know whether read/write opertation happens or not. 
> > Therefore, before userspace reads the ptr from mmap, it first writes action
> > field in control region (through write system call), and vendor driver
> > will not return the write system call until data prepared.
> > 
> > When userspace writes to that ptr from mmap, it writes data to the data
> > region first, then writes the action field in control region (through write
> > system call) to notify vendor driver. vendor driver will return the system
> > call after it copies the buffer completely.
> > > 
> > > So, how does the kernel know whether the read action has finished resp.
> > > whether the write action has started? Even if userspace reads/writes it
> > > as a whole.
> > >   
> > kernel does not touch the data region except when in response to the
> > "action" write system call.
> 
> Thanks for the explanation, that makes sense.
> (...)
welcome:)
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-21  1:35       ` [Qemu-devel] " Gonglei (Arei)
@ 2019-02-21  1:58         ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  1:58 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, Ken.Xue, kevin.tian, dgilbert, alex.williamson,
	intel-gvt-dev@lists.freedesktop.org

On Thu, Feb 21, 2019 at 01:35:43AM +0000, Gonglei (Arei) wrote:
> 
> 
> > -----Original Message-----
> > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > Sent: Thursday, February 21, 2019 8:25 AM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > kvm@vger.kernel.org
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> > > Hi yan,
> > >
> > > Thanks for your work.
> > >
> > > I have some suggestions or questions:
> > >
> > > 1) Would you add msix mode support,? if not, pls add a check in
> > vfio_pci_save_config(), likes Nvidia's solution.
> > ok.
> > 
> > > 2) We should start vfio devices before vcpu resumes, so we can't rely on vm
> > start change handler completely.
> > vfio devices is by default set to running state.
> > In the target machine, its state transition flow is running->stop->running.
> 
> That's confusing. We should start vfio devices after vfio_load_state, otherwise
> how can you keep the devices' information are the same between source side
> and destination side?
>
so, your meaning is to set device state to running in the first call to
vfio_load_state?

> > so, maybe you can ignore the stop notification in kernel?
> > > 3) We'd better support live migration rollback since have many failure
> > scenarios,
> > >  register a migration notifier is a good choice.
> > I think this patchset can also handle the failure case well.
> > if migration failure or cancelling happens,
> > in cleanup handler, LOGGING state is cleared. device state(running or
> > stopped) keeps as it is).
> 
> IIRC there're many failure paths don't calling cleanup handler.
>
could you take an example?
> > then,
> > if vm switches back to running, device state will be set to running;
> > if vm stayes at stopped state, device state is also stopped (it has no
> > meaning to let it in running state).
> > Do you think so ?
> > 
> IF the underlying state machine is complicated,
> We should tell the canceling state to vendor driver proactively.
> 
That makes sense.

> > > 4) Four memory region for live migration is too complicated IMHO.
> > one big region requires the sub-regions well padded.
> > like for the first control fields, they have to be padded to 4K.
> > the same for other data fields.
> > Otherwise, mmap simply fails, because the start-offset and size for mmap
> > both need to be PAGE aligned.
> > 
> But if we don't need use mmap for control filed and device state, they are small basically.
> The performance is enough using pread/pwrite. 
> 
we don't mmap control fields. but if data fields going immedately after
control fields (e.g. just 64 bytes), we can't mmap data fields
successfully because its start offset is 64. Therefore control fields have
to be padded to 4k to let data fields start from 4k.
That's the drawback of one big region holding both control and data fields.

> > Also, 4 regions is clearer in my view :)
> > 
> > > 5) About log sync, why not register log_global_start/stop in
> > vfio_memory_listener?
> > >
> > >
> > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > for dirty pages in system memory, it's better to transfer dirty data
> > iteratively to reduce down time, right?
> > 
> 
> We just need invoking only once for start and stop logging. Why we need to call
> them literately? See memory_listener of vhost.
> 



> Regards,
> -Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-21  1:58         ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  1:58 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue, kwankhede, kevin.tian,
	cjia, kvm

On Thu, Feb 21, 2019 at 01:35:43AM +0000, Gonglei (Arei) wrote:
> 
> 
> > -----Original Message-----
> > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > Sent: Thursday, February 21, 2019 8:25 AM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > kvm@vger.kernel.org
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> > > Hi yan,
> > >
> > > Thanks for your work.
> > >
> > > I have some suggestions or questions:
> > >
> > > 1) Would you add msix mode support,? if not, pls add a check in
> > vfio_pci_save_config(), likes Nvidia's solution.
> > ok.
> > 
> > > 2) We should start vfio devices before vcpu resumes, so we can't rely on vm
> > start change handler completely.
> > vfio devices is by default set to running state.
> > In the target machine, its state transition flow is running->stop->running.
> 
> That's confusing. We should start vfio devices after vfio_load_state, otherwise
> how can you keep the devices' information are the same between source side
> and destination side?
>
so, your meaning is to set device state to running in the first call to
vfio_load_state?

> > so, maybe you can ignore the stop notification in kernel?
> > > 3) We'd better support live migration rollback since have many failure
> > scenarios,
> > >  register a migration notifier is a good choice.
> > I think this patchset can also handle the failure case well.
> > if migration failure or cancelling happens,
> > in cleanup handler, LOGGING state is cleared. device state(running or
> > stopped) keeps as it is).
> 
> IIRC there're many failure paths don't calling cleanup handler.
>
could you take an example?
> > then,
> > if vm switches back to running, device state will be set to running;
> > if vm stayes at stopped state, device state is also stopped (it has no
> > meaning to let it in running state).
> > Do you think so ?
> > 
> IF the underlying state machine is complicated,
> We should tell the canceling state to vendor driver proactively.
> 
That makes sense.

> > > 4) Four memory region for live migration is too complicated IMHO.
> > one big region requires the sub-regions well padded.
> > like for the first control fields, they have to be padded to 4K.
> > the same for other data fields.
> > Otherwise, mmap simply fails, because the start-offset and size for mmap
> > both need to be PAGE aligned.
> > 
> But if we don't need use mmap for control filed and device state, they are small basically.
> The performance is enough using pread/pwrite. 
> 
we don't mmap control fields. but if data fields going immedately after
control fields (e.g. just 64 bytes), we can't mmap data fields
successfully because its start offset is 64. Therefore control fields have
to be padded to 4k to let data fields start from 4k.
That's the drawback of one big region holding both control and data fields.

> > Also, 4 regions is clearer in my view :)
> > 
> > > 5) About log sync, why not register log_global_start/stop in
> > vfio_memory_listener?
> > >
> > >
> > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > for dirty pages in system memory, it's better to transfer dirty data
> > iteratively to reduce down time, right?
> > 
> 
> We just need invoking only once for start and stop logging. Why we need to call
> them literately? See memory_listener of vhost.
> 



> Regards,
> -Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-21  1:35       ` [Qemu-devel] " Gonglei (Arei)
@ 2019-02-21  2:04         ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  2:04 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, Ken.Xue, kevin.tian, dgilbert, alex.williamson,
	intel-gvt-dev@lists.freedesktop.org

> > 
> > > 5) About log sync, why not register log_global_start/stop in
> > vfio_memory_listener?
> > >
> > >
> > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > for dirty pages in system memory, it's better to transfer dirty data
> > iteratively to reduce down time, right?
> > 
> 
> We just need invoking only once for start and stop logging. Why we need to call
> them literately? See memory_listener of vhost.
>
the dirty pages in system memory produces by device is incremental.
if it can be got iteratively, the dirty pages in stop-and-copy phase can be
minimal. 
:)

> Regards,
> -Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-21  2:04         ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  2:04 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue, kwankhede, kevin.tian,
	cjia, kvm

> > 
> > > 5) About log sync, why not register log_global_start/stop in
> > vfio_memory_listener?
> > >
> > >
> > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > for dirty pages in system memory, it's better to transfer dirty data
> > iteratively to reduce down time, right?
> > 
> 
> We just need invoking only once for start and stop logging. Why we need to call
> them literately? See memory_listener of vhost.
>
the dirty pages in system memory produces by device is incremental.
if it can be got iteratively, the dirty pages in stop-and-copy phase can be
minimal. 
:)

> Regards,
> -Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-21  2:04         ` [Qemu-devel] " Zhao Yan
@ 2019-02-21  3:16           ` Gonglei (Arei)
  -1 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-21  3:16 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, Ken.Xue, kevin.tian, dgilbert, alex.williamson,
	intel-gvt-dev@lists.freedesktop.org




> -----Original Message-----
> From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> Sent: Thursday, February 21, 2019 10:05 AM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> kvm@vger.kernel.org
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> > >
> > > > 5) About log sync, why not register log_global_start/stop in
> > > vfio_memory_listener?
> > > >
> > > >
> > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > for dirty pages in system memory, it's better to transfer dirty data
> > > iteratively to reduce down time, right?
> > >
> >
> > We just need invoking only once for start and stop logging. Why we need to
> call
> > them literately? See memory_listener of vhost.
> >
> the dirty pages in system memory produces by device is incremental.
> if it can be got iteratively, the dirty pages in stop-and-copy phase can be
> minimal.
> :)
> 
I mean starting or stopping the capability of logging, not log sync. 

We register the below callbacks:

.log_sync = vfio_log_sync,
.log_global_start = vfio_log_global_start,
.log_global_stop = vfio_log_global_stop,

Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-21  3:16           ` Gonglei (Arei)
  0 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-21  3:16 UTC (permalink / raw)
  To: Zhao Yan
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue, kwankhede, kevin.tian,
	cjia, kvm




> -----Original Message-----
> From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> Sent: Thursday, February 21, 2019 10:05 AM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> kvm@vger.kernel.org
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> > >
> > > > 5) About log sync, why not register log_global_start/stop in
> > > vfio_memory_listener?
> > > >
> > > >
> > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > for dirty pages in system memory, it's better to transfer dirty data
> > > iteratively to reduce down time, right?
> > >
> >
> > We just need invoking only once for start and stop logging. Why we need to
> call
> > them literately? See memory_listener of vhost.
> >
> the dirty pages in system memory produces by device is incremental.
> if it can be got iteratively, the dirty pages in stop-and-copy phase can be
> minimal.
> :)
> 
I mean starting or stopping the capability of logging, not log sync. 

We register the below callbacks:

.log_sync = vfio_log_sync,
.log_global_start = vfio_log_global_start,
.log_global_stop = vfio_log_global_stop,

Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-21  1:58         ` [Qemu-devel] " Zhao Yan
@ 2019-02-21  3:33           ` Gonglei (Arei)
  -1 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-21  3:33 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, Ken.Xue, kevin.tian, dgilbert, alex.williamson,
	intel-gvt-dev@lists.freedesktop.org


> -----Original Message-----
> From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> Sent: Thursday, February 21, 2019 9:59 AM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> kvm@vger.kernel.org
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Thu, Feb 21, 2019 at 01:35:43AM +0000, Gonglei (Arei) wrote:
> >
> >
> > > -----Original Message-----
> > > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > > Sent: Thursday, February 21, 2019 8:25 AM
> > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> dgilbert@redhat.com;
> > > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > > jonathan.davies@nutanix.com; changpeng.liu@intel.com;
> Ken.Xue@amd.com;
> > > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > > kvm@vger.kernel.org
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> > > > Hi yan,
> > > >
> > > > Thanks for your work.
> > > >
> > > > I have some suggestions or questions:
> > > >
> > > > 1) Would you add msix mode support,? if not, pls add a check in
> > > vfio_pci_save_config(), likes Nvidia's solution.
> > > ok.
> > >
> > > > 2) We should start vfio devices before vcpu resumes, so we can't rely on
> vm
> > > start change handler completely.
> > > vfio devices is by default set to running state.
> > > In the target machine, its state transition flow is running->stop->running.
> >
> > That's confusing. We should start vfio devices after vfio_load_state,
> otherwise
> > how can you keep the devices' information are the same between source side
> > and destination side?
> >
> so, your meaning is to set device state to running in the first call to
> vfio_load_state?
> 
No, it should start devices after vfio_load_state and before vcpu resuming.

> > > so, maybe you can ignore the stop notification in kernel?
> > > > 3) We'd better support live migration rollback since have many failure
> > > scenarios,
> > > >  register a migration notifier is a good choice.
> > > I think this patchset can also handle the failure case well.
> > > if migration failure or cancelling happens,
> > > in cleanup handler, LOGGING state is cleared. device state(running or
> > > stopped) keeps as it is).
> >
> > IIRC there're many failure paths don't calling cleanup handler.
> >
> could you take an example?

Never mind, that's another bug I think. 

> > > then,
> > > if vm switches back to running, device state will be set to running;
> > > if vm stayes at stopped state, device state is also stopped (it has no
> > > meaning to let it in running state).
> > > Do you think so ?
> > >
> > IF the underlying state machine is complicated,
> > We should tell the canceling state to vendor driver proactively.
> >
> That makes sense.
> 
> > > > 4) Four memory region for live migration is too complicated IMHO.
> > > one big region requires the sub-regions well padded.
> > > like for the first control fields, they have to be padded to 4K.
> > > the same for other data fields.
> > > Otherwise, mmap simply fails, because the start-offset and size for mmap
> > > both need to be PAGE aligned.
> > >
> > But if we don't need use mmap for control filed and device state, they are
> small basically.
> > The performance is enough using pread/pwrite.
> >
> we don't mmap control fields. but if data fields going immedately after
> control fields (e.g. just 64 bytes), we can't mmap data fields
> successfully because its start offset is 64. Therefore control fields have
> to be padded to 4k to let data fields start from 4k.
> That's the drawback of one big region holding both control and data fields.
> 
> > > Also, 4 regions is clearer in my view :)
> > >
> > > > 5) About log sync, why not register log_global_start/stop in
> > > vfio_memory_listener?
> > > >
> > > >
> > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > for dirty pages in system memory, it's better to transfer dirty data
> > > iteratively to reduce down time, right?
> > >
> >
> > We just need invoking only once for start and stop logging. Why we need to
> call
> > them literately? See memory_listener of vhost.
> >
> 
> 
> 
> > Regards,
> > -Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-21  3:33           ` Gonglei (Arei)
  0 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-21  3:33 UTC (permalink / raw)
  To: Zhao Yan
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue, kwankhede, kevin.tian,
	cjia, kvm


> -----Original Message-----
> From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> Sent: Thursday, February 21, 2019 9:59 AM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> kvm@vger.kernel.org
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Thu, Feb 21, 2019 at 01:35:43AM +0000, Gonglei (Arei) wrote:
> >
> >
> > > -----Original Message-----
> > > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > > Sent: Thursday, February 21, 2019 8:25 AM
> > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> dgilbert@redhat.com;
> > > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > > jonathan.davies@nutanix.com; changpeng.liu@intel.com;
> Ken.Xue@amd.com;
> > > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > > kvm@vger.kernel.org
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> > > > Hi yan,
> > > >
> > > > Thanks for your work.
> > > >
> > > > I have some suggestions or questions:
> > > >
> > > > 1) Would you add msix mode support,? if not, pls add a check in
> > > vfio_pci_save_config(), likes Nvidia's solution.
> > > ok.
> > >
> > > > 2) We should start vfio devices before vcpu resumes, so we can't rely on
> vm
> > > start change handler completely.
> > > vfio devices is by default set to running state.
> > > In the target machine, its state transition flow is running->stop->running.
> >
> > That's confusing. We should start vfio devices after vfio_load_state,
> otherwise
> > how can you keep the devices' information are the same between source side
> > and destination side?
> >
> so, your meaning is to set device state to running in the first call to
> vfio_load_state?
> 
No, it should start devices after vfio_load_state and before vcpu resuming.

> > > so, maybe you can ignore the stop notification in kernel?
> > > > 3) We'd better support live migration rollback since have many failure
> > > scenarios,
> > > >  register a migration notifier is a good choice.
> > > I think this patchset can also handle the failure case well.
> > > if migration failure or cancelling happens,
> > > in cleanup handler, LOGGING state is cleared. device state(running or
> > > stopped) keeps as it is).
> >
> > IIRC there're many failure paths don't calling cleanup handler.
> >
> could you take an example?

Never mind, that's another bug I think. 

> > > then,
> > > if vm switches back to running, device state will be set to running;
> > > if vm stayes at stopped state, device state is also stopped (it has no
> > > meaning to let it in running state).
> > > Do you think so ?
> > >
> > IF the underlying state machine is complicated,
> > We should tell the canceling state to vendor driver proactively.
> >
> That makes sense.
> 
> > > > 4) Four memory region for live migration is too complicated IMHO.
> > > one big region requires the sub-regions well padded.
> > > like for the first control fields, they have to be padded to 4K.
> > > the same for other data fields.
> > > Otherwise, mmap simply fails, because the start-offset and size for mmap
> > > both need to be PAGE aligned.
> > >
> > But if we don't need use mmap for control filed and device state, they are
> small basically.
> > The performance is enough using pread/pwrite.
> >
> we don't mmap control fields. but if data fields going immedately after
> control fields (e.g. just 64 bytes), we can't mmap data fields
> successfully because its start offset is 64. Therefore control fields have
> to be padded to 4k to let data fields start from 4k.
> That's the drawback of one big region holding both control and data fields.
> 
> > > Also, 4 regions is clearer in my view :)
> > >
> > > > 5) About log sync, why not register log_global_start/stop in
> > > vfio_memory_listener?
> > > >
> > > >
> > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > for dirty pages in system memory, it's better to transfer dirty data
> > > iteratively to reduce down time, right?
> > >
> >
> > We just need invoking only once for start and stop logging. Why we need to
> call
> > them literately? See memory_listener of vhost.
> >
> 
> 
> 
> > Regards,
> > -Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-21  3:33           ` [Qemu-devel] " Gonglei (Arei)
@ 2019-02-21  4:08             ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  4:08 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, dgilbert, alex.williamson,

On Thu, Feb 21, 2019 at 03:33:24AM +0000, Gonglei (Arei) wrote:
> 
> > -----Original Message-----
> > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > Sent: Thursday, February 21, 2019 9:59 AM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > kvm@vger.kernel.org
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > On Thu, Feb 21, 2019 at 01:35:43AM +0000, Gonglei (Arei) wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > > > Sent: Thursday, February 21, 2019 8:25 AM
> > > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > > > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> > dgilbert@redhat.com;
> > > > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > > > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > > > jonathan.davies@nutanix.com; changpeng.liu@intel.com;
> > Ken.Xue@amd.com;
> > > > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > > > kvm@vger.kernel.org
> > > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > > >
> > > > On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> > > > > Hi yan,
> > > > >
> > > > > Thanks for your work.
> > > > >
> > > > > I have some suggestions or questions:
> > > > >
> > > > > 1) Would you add msix mode support,? if not, pls add a check in
> > > > vfio_pci_save_config(), likes Nvidia's solution.
> > > > ok.
> > > >
> > > > > 2) We should start vfio devices before vcpu resumes, so we can't rely on
> > vm
> > > > start change handler completely.
> > > > vfio devices is by default set to running state.
> > > > In the target machine, its state transition flow is running->stop->running.
> > >
> > > That's confusing. We should start vfio devices after vfio_load_state,
> > otherwise
> > > how can you keep the devices' information are the same between source side
> > > and destination side?
> > >
> > so, your meaning is to set device state to running in the first call to
> > vfio_load_state?
> > 
> No, it should start devices after vfio_load_state and before vcpu resuming.
>

What about set device state to running in load_cleanup handler ?

> > > > so, maybe you can ignore the stop notification in kernel?
> > > > > 3) We'd better support live migration rollback since have many failure
> > > > scenarios,
> > > > >  register a migration notifier is a good choice.
> > > > I think this patchset can also handle the failure case well.
> > > > if migration failure or cancelling happens,
> > > > in cleanup handler, LOGGING state is cleared. device state(running or
> > > > stopped) keeps as it is).
> > >
> > > IIRC there're many failure paths don't calling cleanup handler.
> > >
> > could you take an example?
> 
> Never mind, that's another bug I think. 
> 
> > > > then,
> > > > if vm switches back to running, device state will be set to running;
> > > > if vm stayes at stopped state, device state is also stopped (it has no
> > > > meaning to let it in running state).
> > > > Do you think so ?
> > > >
> > > IF the underlying state machine is complicated,
> > > We should tell the canceling state to vendor driver proactively.
> > >
> > That makes sense.
> > 
> > > > > 4) Four memory region for live migration is too complicated IMHO.
> > > > one big region requires the sub-regions well padded.
> > > > like for the first control fields, they have to be padded to 4K.
> > > > the same for other data fields.
> > > > Otherwise, mmap simply fails, because the start-offset and size for mmap
> > > > both need to be PAGE aligned.
> > > >
> > > But if we don't need use mmap for control filed and device state, they are
> > small basically.
> > > The performance is enough using pread/pwrite.
> > >
> > we don't mmap control fields. but if data fields going immedately after
> > control fields (e.g. just 64 bytes), we can't mmap data fields
> > successfully because its start offset is 64. Therefore control fields have
> > to be padded to 4k to let data fields start from 4k.
> > That's the drawback of one big region holding both control and data fields.
> > 
> > > > Also, 4 regions is clearer in my view :)
> > > >
> > > > > 5) About log sync, why not register log_global_start/stop in
> > > > vfio_memory_listener?
> > > > >
> > > > >
> > > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > > for dirty pages in system memory, it's better to transfer dirty data
> > > > iteratively to reduce down time, right?
> > > >
> > >
> > > We just need invoking only once for start and stop logging. Why we need to
> > call
> > > them literately? See memory_listener of vhost.
> > >
> > 
> > 
> > 
> > > Regards,
> > > -Gonglei
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-21  4:08             ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  4:08 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, Ken.Xue, kevin.tian, dgilbert, alex.williamson,
	intel-gvt-dev, changpeng.liu, cohuck, zhi.a.wang,
	jonathan.davies

On Thu, Feb 21, 2019 at 03:33:24AM +0000, Gonglei (Arei) wrote:
> 
> > -----Original Message-----
> > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > Sent: Thursday, February 21, 2019 9:59 AM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > kvm@vger.kernel.org
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > On Thu, Feb 21, 2019 at 01:35:43AM +0000, Gonglei (Arei) wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > > > Sent: Thursday, February 21, 2019 8:25 AM
> > > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > > > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> > dgilbert@redhat.com;
> > > > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > > > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > > > jonathan.davies@nutanix.com; changpeng.liu@intel.com;
> > Ken.Xue@amd.com;
> > > > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > > > kvm@vger.kernel.org
> > > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > > >
> > > > On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> > > > > Hi yan,
> > > > >
> > > > > Thanks for your work.
> > > > >
> > > > > I have some suggestions or questions:
> > > > >
> > > > > 1) Would you add msix mode support,? if not, pls add a check in
> > > > vfio_pci_save_config(), likes Nvidia's solution.
> > > > ok.
> > > >
> > > > > 2) We should start vfio devices before vcpu resumes, so we can't rely on
> > vm
> > > > start change handler completely.
> > > > vfio devices is by default set to running state.
> > > > In the target machine, its state transition flow is running->stop->running.
> > >
> > > That's confusing. We should start vfio devices after vfio_load_state,
> > otherwise
> > > how can you keep the devices' information are the same between source side
> > > and destination side?
> > >
> > so, your meaning is to set device state to running in the first call to
> > vfio_load_state?
> > 
> No, it should start devices after vfio_load_state and before vcpu resuming.
>

What about set device state to running in load_cleanup handler ?

> > > > so, maybe you can ignore the stop notification in kernel?
> > > > > 3) We'd better support live migration rollback since have many failure
> > > > scenarios,
> > > > >  register a migration notifier is a good choice.
> > > > I think this patchset can also handle the failure case well.
> > > > if migration failure or cancelling happens,
> > > > in cleanup handler, LOGGING state is cleared. device state(running or
> > > > stopped) keeps as it is).
> > >
> > > IIRC there're many failure paths don't calling cleanup handler.
> > >
> > could you take an example?
> 
> Never mind, that's another bug I think. 
> 
> > > > then,
> > > > if vm switches back to running, device state will be set to running;
> > > > if vm stayes at stopped state, device state is also stopped (it has no
> > > > meaning to let it in running state).
> > > > Do you think so ?
> > > >
> > > IF the underlying state machine is complicated,
> > > We should tell the canceling state to vendor driver proactively.
> > >
> > That makes sense.
> > 
> > > > > 4) Four memory region for live migration is too complicated IMHO.
> > > > one big region requires the sub-regions well padded.
> > > > like for the first control fields, they have to be padded to 4K.
> > > > the same for other data fields.
> > > > Otherwise, mmap simply fails, because the start-offset and size for mmap
> > > > both need to be PAGE aligned.
> > > >
> > > But if we don't need use mmap for control filed and device state, they are
> > small basically.
> > > The performance is enough using pread/pwrite.
> > >
> > we don't mmap control fields. but if data fields going immedately after
> > control fields (e.g. just 64 bytes), we can't mmap data fields
> > successfully because its start offset is 64. Therefore control fields have
> > to be padded to 4k to let data fields start from 4k.
> > That's the drawback of one big region holding both control and data fields.
> > 
> > > > Also, 4 regions is clearer in my view :)
> > > >
> > > > > 5) About log sync, why not register log_global_start/stop in
> > > > vfio_memory_listener?
> > > > >
> > > > >
> > > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > > for dirty pages in system memory, it's better to transfer dirty data
> > > > iteratively to reduce down time, right?
> > > >
> > >
> > > We just need invoking only once for start and stop logging. Why we need to
> > call
> > > them literately? See memory_listener of vhost.
> > >
> > 
> > 
> > 
> > > Regards,
> > > -Gonglei
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-21  3:16           ` [Qemu-devel] " Gonglei (Arei)
@ 2019-02-21  4:21             ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  4:21 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, Ken.Xue, kevin.tian, dgilbert, alex.williamson,
	intel-gvt-dev@lists.freedesktop.org

On Thu, Feb 21, 2019 at 03:16:45AM +0000, Gonglei (Arei) wrote:
> 
> 
> 
> > -----Original Message-----
> > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > Sent: Thursday, February 21, 2019 10:05 AM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > kvm@vger.kernel.org
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > > >
> > > > > 5) About log sync, why not register log_global_start/stop in
> > > > vfio_memory_listener?
> > > > >
> > > > >
> > > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > > for dirty pages in system memory, it's better to transfer dirty data
> > > > iteratively to reduce down time, right?
> > > >
> > >
> > > We just need invoking only once for start and stop logging. Why we need to
> > call
> > > them literately? See memory_listener of vhost.
> > >
> > the dirty pages in system memory produces by device is incremental.
> > if it can be got iteratively, the dirty pages in stop-and-copy phase can be
> > minimal.
> > :)
> > 
> I mean starting or stopping the capability of logging, not log sync. 
> 
> We register the below callbacks:
> 
> .log_sync = vfio_log_sync,
> .log_global_start = vfio_log_global_start,
> .log_global_stop = vfio_log_global_stop,
>
.log_global_start is also a good point to notify logging state.
But if notifying in .save_setup handler, we can do fine-grained
control of when to notify of logging starting together with get_buffer
operation.
Is there any special benifit by registering to .log_global_start/stop?


> Regards,
> -Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-21  4:21             ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-21  4:21 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue, kwankhede, kevin.tian,
	cjia, kvm

On Thu, Feb 21, 2019 at 03:16:45AM +0000, Gonglei (Arei) wrote:
> 
> 
> 
> > -----Original Message-----
> > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > Sent: Thursday, February 21, 2019 10:05 AM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com; dgilbert@redhat.com;
> > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > jonathan.davies@nutanix.com; changpeng.liu@intel.com; Ken.Xue@amd.com;
> > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > kvm@vger.kernel.org
> > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > 
> > > >
> > > > > 5) About log sync, why not register log_global_start/stop in
> > > > vfio_memory_listener?
> > > > >
> > > > >
> > > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > > for dirty pages in system memory, it's better to transfer dirty data
> > > > iteratively to reduce down time, right?
> > > >
> > >
> > > We just need invoking only once for start and stop logging. Why we need to
> > call
> > > them literately? See memory_listener of vhost.
> > >
> > the dirty pages in system memory produces by device is incremental.
> > if it can be got iteratively, the dirty pages in stop-and-copy phase can be
> > minimal.
> > :)
> > 
> I mean starting or stopping the capability of logging, not log sync. 
> 
> We register the below callbacks:
> 
> .log_sync = vfio_log_sync,
> .log_global_start = vfio_log_global_start,
> .log_global_stop = vfio_log_global_stop,
>
.log_global_start is also a good point to notify logging state.
But if notifying in .save_setup handler, we can do fine-grained
control of when to notify of logging starting together with get_buffer
operation.
Is there any special benifit by registering to .log_global_start/stop?


> Regards,
> -Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-21  4:08             ` [Qemu-devel] " Zhao Yan
@ 2019-02-21  5:46               ` Gonglei (Arei)
  -1 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-21  5:46 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, dgilbert, alex.williamson,







> -----Original Message-----
> From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> Sent: Thursday, February 21, 2019 12:08 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: cjia@nvidia.com; kvm@vger.kernel.org; aik@ozlabs.ru;
> Zhengxiao.zx@Alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> qemu-devel@nongnu.org; kwankhede@nvidia.com; eauger@redhat.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> mlevitsk@redhat.com; pasic@linux.ibm.com; felipe@nutanix.com;
> Ken.Xue@amd.com; kevin.tian@intel.com; dgilbert@redhat.com;
> alex.williamson@redhat.com; intel-gvt-dev@lists.freedesktop.org;
> changpeng.liu@intel.com; cohuck@redhat.com; zhi.a.wang@intel.com;
> jonathan.davies@nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Thu, Feb 21, 2019 at 03:33:24AM +0000, Gonglei (Arei) wrote:
> >
> > > -----Original Message-----
> > > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > > Sent: Thursday, February 21, 2019 9:59 AM
> > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> dgilbert@redhat.com;
> > > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > > jonathan.davies@nutanix.com; changpeng.liu@intel.com;
> Ken.Xue@amd.com;
> > > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > > kvm@vger.kernel.org
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > On Thu, Feb 21, 2019 at 01:35:43AM +0000, Gonglei (Arei) wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > > > > Sent: Thursday, February 21, 2019 8:25 AM
> > > > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > > > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > > > > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > > > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > > > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> > > dgilbert@redhat.com;
> > > > > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > > > > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > > > > jonathan.davies@nutanix.com; changpeng.liu@intel.com;
> > > Ken.Xue@amd.com;
> > > > > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > > > > kvm@vger.kernel.org
> > > > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > > > >
> > > > > On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> > > > > > Hi yan,
> > > > > >
> > > > > > Thanks for your work.
> > > > > >
> > > > > > I have some suggestions or questions:
> > > > > >
> > > > > > 1) Would you add msix mode support,? if not, pls add a check in
> > > > > vfio_pci_save_config(), likes Nvidia's solution.
> > > > > ok.
> > > > >
> > > > > > 2) We should start vfio devices before vcpu resumes, so we can't rely
> on
> > > vm
> > > > > start change handler completely.
> > > > > vfio devices is by default set to running state.
> > > > > In the target machine, its state transition flow is
> running->stop->running.
> > > >
> > > > That's confusing. We should start vfio devices after vfio_load_state,
> > > otherwise
> > > > how can you keep the devices' information are the same between source
> side
> > > > and destination side?
> > > >
> > > so, your meaning is to set device state to running in the first call to
> > > vfio_load_state?
> > >
> > No, it should start devices after vfio_load_state and before vcpu resuming.
> >
> 
> What about set device state to running in load_cleanup handler ?
> 

The timing is fine, but you should also think about if should set device state 
to running in failure branches when calling load_cleanup handler.

Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-21  5:46               ` Gonglei (Arei)
  0 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-21  5:46 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, Ken.Xue, kevin.tian, dgilbert, alex.williamson,
	intel-gvt-dev, changpeng.liu, cohuck, zhi.a.wang,
	jonathan.davies







> -----Original Message-----
> From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> Sent: Thursday, February 21, 2019 12:08 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: cjia@nvidia.com; kvm@vger.kernel.org; aik@ozlabs.ru;
> Zhengxiao.zx@Alibaba-inc.com; shuangtai.tst@alibaba-inc.com;
> qemu-devel@nongnu.org; kwankhede@nvidia.com; eauger@redhat.com;
> yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> mlevitsk@redhat.com; pasic@linux.ibm.com; felipe@nutanix.com;
> Ken.Xue@amd.com; kevin.tian@intel.com; dgilbert@redhat.com;
> alex.williamson@redhat.com; intel-gvt-dev@lists.freedesktop.org;
> changpeng.liu@intel.com; cohuck@redhat.com; zhi.a.wang@intel.com;
> jonathan.davies@nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Thu, Feb 21, 2019 at 03:33:24AM +0000, Gonglei (Arei) wrote:
> >
> > > -----Original Message-----
> > > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > > Sent: Thursday, February 21, 2019 9:59 AM
> > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> dgilbert@redhat.com;
> > > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > > jonathan.davies@nutanix.com; changpeng.liu@intel.com;
> Ken.Xue@amd.com;
> > > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > > kvm@vger.kernel.org
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > On Thu, Feb 21, 2019 at 01:35:43AM +0000, Gonglei (Arei) wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > > > > Sent: Thursday, February 21, 2019 8:25 AM
> > > > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > > > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > > > > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > > > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > > > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> > > dgilbert@redhat.com;
> > > > > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > > > > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > > > > jonathan.davies@nutanix.com; changpeng.liu@intel.com;
> > > Ken.Xue@amd.com;
> > > > > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > > > > kvm@vger.kernel.org
> > > > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > > > >
> > > > > On Wed, Feb 20, 2019 at 11:56:01AM +0000, Gonglei (Arei) wrote:
> > > > > > Hi yan,
> > > > > >
> > > > > > Thanks for your work.
> > > > > >
> > > > > > I have some suggestions or questions:
> > > > > >
> > > > > > 1) Would you add msix mode support,? if not, pls add a check in
> > > > > vfio_pci_save_config(), likes Nvidia's solution.
> > > > > ok.
> > > > >
> > > > > > 2) We should start vfio devices before vcpu resumes, so we can't rely
> on
> > > vm
> > > > > start change handler completely.
> > > > > vfio devices is by default set to running state.
> > > > > In the target machine, its state transition flow is
> running->stop->running.
> > > >
> > > > That's confusing. We should start vfio devices after vfio_load_state,
> > > otherwise
> > > > how can you keep the devices' information are the same between source
> side
> > > > and destination side?
> > > >
> > > so, your meaning is to set device state to running in the first call to
> > > vfio_load_state?
> > >
> > No, it should start devices after vfio_load_state and before vcpu resuming.
> >
> 
> What about set device state to running in load_cleanup handler ?
> 

The timing is fine, but you should also think about if should set device state 
to running in failure branches when calling load_cleanup handler.

Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-21  4:21             ` [Qemu-devel] " Zhao Yan
@ 2019-02-21  5:56               ` Gonglei (Arei)
  -1 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-21  5:56 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, felipe, Ken.Xue, kevin.tian, dgilbert, alex.williamson,
	intel-gvt-dev@lists.freedesktop.org

> >
> > > -----Original Message-----
> > > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > > Sent: Thursday, February 21, 2019 10:05 AM
> > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> dgilbert@redhat.com;
> > > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > > jonathan.davies@nutanix.com; changpeng.liu@intel.com;
> Ken.Xue@amd.com;
> > > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > > kvm@vger.kernel.org
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > > >
> > > > > > 5) About log sync, why not register log_global_start/stop in
> > > > > vfio_memory_listener?
> > > > > >
> > > > > >
> > > > > seems log_global_start/stop cannot be iterately called in pre-copy
> phase?
> > > > > for dirty pages in system memory, it's better to transfer dirty data
> > > > > iteratively to reduce down time, right?
> > > > >
> > > >
> > > > We just need invoking only once for start and stop logging. Why we need
> to
> > > call
> > > > them literately? See memory_listener of vhost.
> > > >
> > > the dirty pages in system memory produces by device is incremental.
> > > if it can be got iteratively, the dirty pages in stop-and-copy phase can be
> > > minimal.
> > > :)
> > >
> > I mean starting or stopping the capability of logging, not log sync.
> >
> > We register the below callbacks:
> >
> > .log_sync = vfio_log_sync,
> > .log_global_start = vfio_log_global_start,
> > .log_global_stop = vfio_log_global_stop,
> >
> .log_global_start is also a good point to notify logging state.
> But if notifying in .save_setup handler, we can do fine-grained
> control of when to notify of logging starting together with get_buffer
> operation.
> Is there any special benifit by registering to .log_global_start/stop?
> 

Performance benefit when one VM has multiple same vfio devices.


Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-21  5:56               ` Gonglei (Arei)
  0 siblings, 0 replies; 133+ messages in thread
From: Gonglei (Arei) @ 2019-02-21  5:56 UTC (permalink / raw)
  To: Zhao Yan
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, cohuck, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue, kwankhede, kevin.tian,
	cjia, kvm

> >
> > > -----Original Message-----
> > > From: Zhao Yan [mailto:yan.y.zhao@intel.com]
> > > Sent: Thursday, February 21, 2019 10:05 AM
> > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > Cc: alex.williamson@redhat.com; qemu-devel@nongnu.org;
> > > intel-gvt-dev@lists.freedesktop.org; Zhengxiao.zx@Alibaba-inc.com;
> > > yi.l.liu@intel.com; eskultet@redhat.com; ziye.yang@intel.com;
> > > cohuck@redhat.com; shuangtai.tst@alibaba-inc.com;
> dgilbert@redhat.com;
> > > zhi.a.wang@intel.com; mlevitsk@redhat.com; pasic@linux.ibm.com;
> > > aik@ozlabs.ru; eauger@redhat.com; felipe@nutanix.com;
> > > jonathan.davies@nutanix.com; changpeng.liu@intel.com;
> Ken.Xue@amd.com;
> > > kwankhede@nvidia.com; kevin.tian@intel.com; cjia@nvidia.com;
> > > kvm@vger.kernel.org
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > > >
> > > > > > 5) About log sync, why not register log_global_start/stop in
> > > > > vfio_memory_listener?
> > > > > >
> > > > > >
> > > > > seems log_global_start/stop cannot be iterately called in pre-copy
> phase?
> > > > > for dirty pages in system memory, it's better to transfer dirty data
> > > > > iteratively to reduce down time, right?
> > > > >
> > > >
> > > > We just need invoking only once for start and stop logging. Why we need
> to
> > > call
> > > > them literately? See memory_listener of vhost.
> > > >
> > > the dirty pages in system memory produces by device is incremental.
> > > if it can be got iteratively, the dirty pages in stop-and-copy phase can be
> > > minimal.
> > > :)
> > >
> > I mean starting or stopping the capability of logging, not log sync.
> >
> > We register the below callbacks:
> >
> > .log_sync = vfio_log_sync,
> > .log_global_start = vfio_log_global_start,
> > .log_global_stop = vfio_log_global_stop,
> >
> .log_global_start is also a good point to notify logging state.
> But if notifying in .save_setup handler, we can do fine-grained
> control of when to notify of logging starting together with get_buffer
> operation.
> Is there any special benifit by registering to .log_global_start/stop?
> 

Performance benefit when one VM has multiple same vfio devices.


Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-21  0:31         ` [Qemu-devel] " Zhao Yan
@ 2019-02-21  9:15           ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 133+ messages in thread
From: Dr. David Alan Gilbert @ 2019-02-21  9:15 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck,
	zhi.a.wang, jonathan.davies

* Zhao Yan (yan.y.zhao@intel.com) wrote:
> On Wed, Feb 20, 2019 at 11:01:43AM +0000, Dr. David Alan Gilbert wrote:
> > * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > > > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > > > This patchset enables VFIO devices to have live migration capability.
> > > > > Currently it does not support post-copy phase.
> > > > > 
> > > > > It follows Alex's comments on last version of VFIO live migration patches,
> > > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > > query.
> > > > 
> > > > Hi,
> > > >   I've sent minor comments to later patches; but some minor general
> > > > comments:
> > > > 
> > > >   a) Never trust the incoming migrations stream - it might be corrupt,
> > > >     so check when you can.
> > > hi Dave
> > > Thanks for this suggestion. I'll add more checks for migration streams.
> > > 
> > > 
> > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > version of device?  Or say to a device with older firmware or perhaps
> > > > a device that has less device memory ?
> > > Actually it's still an open for VFIO migration. Need to think about
> > > whether it's better to check that in libvirt or qemu (like a device magic
> > > along with verion ?).
> > > This patchset is intended to settle down the main device state interfaces
> > > for VFIO migration. So that we can work on that and improve it.
> > > 
> > > 
> > > >   c) Consider using the trace_ mechanism - it's really useful to
> > > > add to loops writing/reading data so that you can see when it fails.
> > > > 
> > > > Dave
> > > >
> > > Got it. many thanks~~
> > > 
> > > 
> > > > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > > > 'migrtion'
> > > 
> > > sorry :)
> > 
> > No problem.
> > 
> > Given the mails, I'm guessing you've mostly tested this on graphics
> > devices?  Have you also checked with VFIO network cards?
> > 
> yes, I tested it on Intel's graphics devices which do not have device
> memory. so the cap of device-memory is off.
> I believe this patchset can work well on VFIO network cards as well,
> because Gonglei once said their NIC can work well on our previous code
> (i.e. device-memory cap off).

It would be great if you could find some Intel NIC people to help test
it out.

> 
> > Also see the mail I sent in reply to Kirti's series; we need to boil
> > these down to one solution.
> >
> Maybe Kirti can merge their implementaion into the code for device-memory
> cap (like in my patch 5 for device-memory).

It would be great to come up with one patchset between yourself and
Kirti that was tested for Intel and Nvidia GPUs and Intel NICs
(and anyone else who wants to jump on board!).

Dave

> > Dave
> > 
> > > > 
> > > > > Device Data
> > > > > -----------
> > > > > Device data is divided into three types: device memory, device config,
> > > > > and system memory dirty pages produced by device.
> > > > > 
> > > > > Device config: data like MMIOs, page tables...
> > > > >         Every device is supposed to possess device config data.
> > > > >     	Usually device config's size is small (no big than 10M), and it
> > > > >         needs to be loaded in certain strict order.
> > > > >         Therefore, device config only needs to be saved/loaded in
> > > > >         stop-and-copy phase.
> > > > >         The data of device config is held in device config region.
> > > > >         Size of device config data is smaller than or equal to that of
> > > > >         device config region.
> > > > > 
> > > > > Device Memory: device's internal memory, standalone and outside system
> > > > >         memory. It is usually very big.
> > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > >         stop-and-copy phase.
> > > > >         The data of device memory is held in device memory region.
> > > > >         Size of devie memory is usually larger than that of device
> > > > >         memory region. qemu needs to save/load it in chunks of size of
> > > > >         device memory region.
> > > > >         Not all device has device memory. Like IGD only uses system memory.
> > > > > 
> > > > > System memory dirty pages: If a device produces dirty pages in system
> > > > >         memory, it is able to get dirty bitmap for certain range of system
> > > > >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > > > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > > > >         callback, dirty pages in system memory will be save/loaded by ram's
> > > > >         live migration code.
> > > > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > > > >         If system memory range is larger than that dirty bitmap region can
> > > > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > > > >         succession.
> > > > > 
> > > > > 
> > > > > Device State Regions
> > > > > --------------------
> > > > > Vendor driver is required to expose two mandatory regions and another two
> > > > > optional regions if it plans to support device state management.
> > > > > 
> > > > > So, there are up to four regions in total.
> > > > > One control region: mandatory.
> > > > >         Get access via read/write system call.
> > > > >         Its layout is defined in struct vfio_device_state_ctl
> > > > > Three data regions: mmaped into qemu.
> > > > >         device config region: mandatory, holding data of device config
> > > > >         device memory region: optional, holding data of device memory
> > > > >         dirty bitmap region: optional, holding bitmap of system memory
> > > > >                             dirty pages
> > > > > 
> > > > > (The reason why four seperate regions are defined is that the unit of mmap
> > > > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > > > control and three mmaped regions for data seems better than one big region
> > > > > padded and sparse mmaped).
> > > > > 
> > > > > 
> > > > > kernel device state interface [1]
> > > > > --------------------------------------
> > > > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > > 
> > > > > #define VFIO_DEVICE_STATE_RUNNING 0 
> > > > > #define VFIO_DEVICE_STATE_STOP 1
> > > > > #define VFIO_DEVICE_STATE_LOGGING 2
> > > > > 
> > > > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > > > 
> > > > > struct vfio_device_state_ctl {
> > > > > 	__u32 version;		  /* ro */
> > > > > 	__u32 device_state;       /* VFIO device state, wo */
> > > > > 	__u32 caps;		 /* ro */
> > > > >         struct {
> > > > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > > > > 		__u64 size;    /*rw*/
> > > > > 	} device_config;
> > > > > 	struct {
> > > > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > > > > 		__u64 size;     /* rw */  
> > > > >                 __u64 pos; /*the offset in total buffer of device memory*/
> > > > > 	} device_memory;
> > > > > 	struct {
> > > > > 		__u64 start_addr; /* wo */
> > > > > 		__u64 page_nr;   /* wo */
> > > > > 	} system_memory;
> > > > > };
> > > > > 
> > > > > Devcie States
> > > > > ------------- 
> > > > > After migration is initialzed, it will set device state via writing to
> > > > > device_state field of control region.
> > > > > 
> > > > > Four states are defined for a VFIO device:
> > > > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > > > > 
> > > > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > > > >         commands from device driver.
> > > > >         It is the default state that a VFIO device enters initially.
> > > > > 
> > > > > STOP:  In this state, a VFIO device is deactivated to interact with
> > > > >        device driver.
> > > > > 
> > > > > LOGGING: a special state that it CANNOT exist independently. It must be
> > > > >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > > > >        STOP & LOGGING).
> > > > >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > > > >        driver can start dirty data logging for device memory and system
> > > > >        memory.
> > > > >        LOGGING only impacts device/system memory. They return whole
> > > > >        snapshot outside LOGGING and dirty data since last get operation
> > > > >        inside LOGGING.
> > > > >        Device config should be always accessible and return whole config
> > > > >        snapshot regardless of LOGGING state.
> > > > >        
> > > > > Note:
> > > > > The reason why RUNNING is the default state is that device's active state
> > > > > must not depend on device state interface.
> > > > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > > > In that condition, a device needs be in active state by default. 
> > > > > 
> > > > > Get Version & Get Caps
> > > > > ----------------------
> > > > > On migration init phase, qemu will probe the existence of device state
> > > > > regions of vendor driver, then get version of the device state interface
> > > > > from the r/w control region.
> > > > > 
> > > > > Then it will probe VFIO device's data capability by reading caps field of
> > > > > control region.
> > > > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > > > >         device memory in pre-copy and stop-and-copy phase. The data of
> > > > >         device memory is held in device memory region.
> > > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > > > >         produced by VFIO device during pre-copy and stop-and-copy phase.
> > > > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > > > > 
> > > > > If failing to find two mandatory regions and optional data regions
> > > > > corresponding to data caps or version mismatching, it will setup a
> > > > > migration blocker and disable live migration for VFIO device.
> > > > > 
> > > > > 
> > > > > Flows to call device state interface for VFIO live migration
> > > > > ------------------------------------------------------------
> > > > > 
> > > > > Live migration save path:
> > > > > 
> > > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > > > 
> > > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > > >  |
> > > > > MIGRATION_STATUS_SAVE_SETUP
> > > > >  |
> > > > >  .save_setup callback -->
> > > > >  get device memory size (whole snapshot size)
> > > > >  get device memory buffer (whole snapshot data)
> > > > >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > > > >  |
> > > > > MIGRATION_STATUS_ACTIVE
> > > > >  |
> > > > >  .save_live_pending callback --> get device memory size (dirty data)
> > > > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > > > >  .log_sync callback --> get system memory dirty bitmap
> > > > >  |
> > > > > (vcpu stops) --> set device state -->
> > > > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > > > >  |
> > > > > .save_live_complete_precopy callback -->
> > > > >  get device memory size (dirty data)
> > > > >  get device memory buffer (dirty data)
> > > > >  get device config size (whole snapshot size)
> > > > >  get device config buffer (whole snapshot data)
> > > > >  |
> > > > > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > > > > MIGRATION_STATUS_COMPLETED
> > > > > 
> > > > > MIGRATION_STATUS_CANCELLED or
> > > > > MIGRATION_STATUS_FAILED
> > > > >  |
> > > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > > > 
> > > > > 
> > > > > Live migration load path:
> > > > > 
> > > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > > > 
> > > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > > >  |
> > > > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > > > >  |
> > > > > MIGRATION_STATUS_ACTIVE
> > > > >  |
> > > > > .load state callback -->
> > > > >  set device memory size, set device memory buffer, set device config size,
> > > > >  set device config buffer
> > > > >  |
> > > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > > >  |
> > > > > MIGRATION_STATUS_COMPLETED
> > > > > 
> > > > > 
> > > > > 
> > > > > In source VM side,
> > > > > In precopy phase,
> > > > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > > > qemu will first get whole snapshot of device memory in .save_setup
> > > > > callback, and then it will get total size of dirty data in device memory in
> > > > > .save_live_pending callback by reading device_memory.size field of control
> > > > > region.
> > > > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > > > dirty data chunk by chunk from device memory region by writing pos &
> > > > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > > > control region. (size of each chunk is the size of device memory data
> > > > > region).
> > > > > .save_live_pending and .save_live_iteration may be called several times in
> > > > > precopy phase to get dirty data in device memory.
> > > > > 
> > > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > > > vendor driver's device state interface to get data from devcie memory.
> > > > > 
> > > > > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > > > region by writing system memory's start address, page count and action 
> > > > > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > > > > "system_memory.action" fields of control region.
> > > > > If page count passed in .log_sync callback is larger than the bitmap size
> > > > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > > > vendor driver's get system memory dirty bitmap interface.
> > > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > > > > returns without call to vendor driver.
> > > > > 
> > > > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > > > in save_live_complete_precopy callback,
> > > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > > > get device memory size and get device memory buffer will be called again.
> > > > > After that,
> > > > > device config data is get from device config region by reading
> > > > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > > > device_config.action of control region.
> > > > > Then after migration completes, in cleanup handler, LOGGING state will be
> > > > > cleared (i.e. deivce state is set to STOP).
> > > > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > > > of "migration failed" and "migration cancelled". They can also leverage
> > > > > the cleanup handler to unset LOGGING state.
> > > > > 
> > > > > 
> > > > > References
> > > > > ----------
> > > > > 1. kernel side implementation of Device state interfaces:
> > > > > https://patchwork.freedesktop.org/series/56876/
> > > > > 
> > > > > 
> > > > > Yan Zhao (5):
> > > > >   vfio/migration: define kernel interfaces
> > > > >   vfio/migration: support device of device config capability
> > > > >   vfio/migration: tracking of dirty page in system memory
> > > > >   vfio/migration: turn on migration
> > > > >   vfio/migration: support device memory capability
> > > > > 
> > > > >  hw/vfio/Makefile.objs         |   2 +-
> > > > >  hw/vfio/common.c              |  26 ++
> > > > >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> > > > >  hw/vfio/pci.c                 |  10 +-
> > > > >  hw/vfio/pci.h                 |  26 +-
> > > > >  include/hw/vfio/vfio-common.h |   1 +
> > > > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > > > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > > > >  create mode 100644 hw/vfio/migration.c
> > > > > 
> > > > > -- 
> > > > > 2.7.4
> > > > > 
> > > > --
> > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > > _______________________________________________
> > > > intel-gvt-dev mailing list
> > > > intel-gvt-dev@lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > _______________________________________________
> > intel-gvt-dev mailing list
> > intel-gvt-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-21  9:15           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 133+ messages in thread
From: Dr. David Alan Gilbert @ 2019-02-21  9:15 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, zhi.a.wang, kevin.tian,
	alex.williamson, intel-gvt-dev, changpeng.liu, cohuck, Ken.Xue,
	jonathan.davies

* Zhao Yan (yan.y.zhao@intel.com) wrote:
> On Wed, Feb 20, 2019 at 11:01:43AM +0000, Dr. David Alan Gilbert wrote:
> > * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > > On Tue, Feb 19, 2019 at 11:32:13AM +0000, Dr. David Alan Gilbert wrote:
> > > > * Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > > > This patchset enables VFIO devices to have live migration capability.
> > > > > Currently it does not support post-copy phase.
> > > > > 
> > > > > It follows Alex's comments on last version of VFIO live migration patches,
> > > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > > query.
> > > > 
> > > > Hi,
> > > >   I've sent minor comments to later patches; but some minor general
> > > > comments:
> > > > 
> > > >   a) Never trust the incoming migrations stream - it might be corrupt,
> > > >     so check when you can.
> > > hi Dave
> > > Thanks for this suggestion. I'll add more checks for migration streams.
> > > 
> > > 
> > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > version of device?  Or say to a device with older firmware or perhaps
> > > > a device that has less device memory ?
> > > Actually it's still an open for VFIO migration. Need to think about
> > > whether it's better to check that in libvirt or qemu (like a device magic
> > > along with verion ?).
> > > This patchset is intended to settle down the main device state interfaces
> > > for VFIO migration. So that we can work on that and improve it.
> > > 
> > > 
> > > >   c) Consider using the trace_ mechanism - it's really useful to
> > > > add to loops writing/reading data so that you can see when it fails.
> > > > 
> > > > Dave
> > > >
> > > Got it. many thanks~~
> > > 
> > > 
> > > > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > > > 'migrtion'
> > > 
> > > sorry :)
> > 
> > No problem.
> > 
> > Given the mails, I'm guessing you've mostly tested this on graphics
> > devices?  Have you also checked with VFIO network cards?
> > 
> yes, I tested it on Intel's graphics devices which do not have device
> memory. so the cap of device-memory is off.
> I believe this patchset can work well on VFIO network cards as well,
> because Gonglei once said their NIC can work well on our previous code
> (i.e. device-memory cap off).

It would be great if you could find some Intel NIC people to help test
it out.

> 
> > Also see the mail I sent in reply to Kirti's series; we need to boil
> > these down to one solution.
> >
> Maybe Kirti can merge their implementaion into the code for device-memory
> cap (like in my patch 5 for device-memory).

It would be great to come up with one patchset between yourself and
Kirti that was tested for Intel and Nvidia GPUs and Intel NICs
(and anyone else who wants to jump on board!).

Dave

> > Dave
> > 
> > > > 
> > > > > Device Data
> > > > > -----------
> > > > > Device data is divided into three types: device memory, device config,
> > > > > and system memory dirty pages produced by device.
> > > > > 
> > > > > Device config: data like MMIOs, page tables...
> > > > >         Every device is supposed to possess device config data.
> > > > >     	Usually device config's size is small (no big than 10M), and it
> > > > >         needs to be loaded in certain strict order.
> > > > >         Therefore, device config only needs to be saved/loaded in
> > > > >         stop-and-copy phase.
> > > > >         The data of device config is held in device config region.
> > > > >         Size of device config data is smaller than or equal to that of
> > > > >         device config region.
> > > > > 
> > > > > Device Memory: device's internal memory, standalone and outside system
> > > > >         memory. It is usually very big.
> > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > >         stop-and-copy phase.
> > > > >         The data of device memory is held in device memory region.
> > > > >         Size of devie memory is usually larger than that of device
> > > > >         memory region. qemu needs to save/load it in chunks of size of
> > > > >         device memory region.
> > > > >         Not all device has device memory. Like IGD only uses system memory.
> > > > > 
> > > > > System memory dirty pages: If a device produces dirty pages in system
> > > > >         memory, it is able to get dirty bitmap for certain range of system
> > > > >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > > > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > > > >         callback, dirty pages in system memory will be save/loaded by ram's
> > > > >         live migration code.
> > > > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > > > >         If system memory range is larger than that dirty bitmap region can
> > > > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > > > >         succession.
> > > > > 
> > > > > 
> > > > > Device State Regions
> > > > > --------------------
> > > > > Vendor driver is required to expose two mandatory regions and another two
> > > > > optional regions if it plans to support device state management.
> > > > > 
> > > > > So, there are up to four regions in total.
> > > > > One control region: mandatory.
> > > > >         Get access via read/write system call.
> > > > >         Its layout is defined in struct vfio_device_state_ctl
> > > > > Three data regions: mmaped into qemu.
> > > > >         device config region: mandatory, holding data of device config
> > > > >         device memory region: optional, holding data of device memory
> > > > >         dirty bitmap region: optional, holding bitmap of system memory
> > > > >                             dirty pages
> > > > > 
> > > > > (The reason why four seperate regions are defined is that the unit of mmap
> > > > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > > > control and three mmaped regions for data seems better than one big region
> > > > > padded and sparse mmaped).
> > > > > 
> > > > > 
> > > > > kernel device state interface [1]
> > > > > --------------------------------------
> > > > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > > 
> > > > > #define VFIO_DEVICE_STATE_RUNNING 0 
> > > > > #define VFIO_DEVICE_STATE_STOP 1
> > > > > #define VFIO_DEVICE_STATE_LOGGING 2
> > > > > 
> > > > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > > > 
> > > > > struct vfio_device_state_ctl {
> > > > > 	__u32 version;		  /* ro */
> > > > > 	__u32 device_state;       /* VFIO device state, wo */
> > > > > 	__u32 caps;		 /* ro */
> > > > >         struct {
> > > > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > > > > 		__u64 size;    /*rw*/
> > > > > 	} device_config;
> > > > > 	struct {
> > > > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > > > > 		__u64 size;     /* rw */  
> > > > >                 __u64 pos; /*the offset in total buffer of device memory*/
> > > > > 	} device_memory;
> > > > > 	struct {
> > > > > 		__u64 start_addr; /* wo */
> > > > > 		__u64 page_nr;   /* wo */
> > > > > 	} system_memory;
> > > > > };
> > > > > 
> > > > > Devcie States
> > > > > ------------- 
> > > > > After migration is initialzed, it will set device state via writing to
> > > > > device_state field of control region.
> > > > > 
> > > > > Four states are defined for a VFIO device:
> > > > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > > > > 
> > > > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > > > >         commands from device driver.
> > > > >         It is the default state that a VFIO device enters initially.
> > > > > 
> > > > > STOP:  In this state, a VFIO device is deactivated to interact with
> > > > >        device driver.
> > > > > 
> > > > > LOGGING: a special state that it CANNOT exist independently. It must be
> > > > >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > > > >        STOP & LOGGING).
> > > > >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > > > >        driver can start dirty data logging for device memory and system
> > > > >        memory.
> > > > >        LOGGING only impacts device/system memory. They return whole
> > > > >        snapshot outside LOGGING and dirty data since last get operation
> > > > >        inside LOGGING.
> > > > >        Device config should be always accessible and return whole config
> > > > >        snapshot regardless of LOGGING state.
> > > > >        
> > > > > Note:
> > > > > The reason why RUNNING is the default state is that device's active state
> > > > > must not depend on device state interface.
> > > > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > > > In that condition, a device needs be in active state by default. 
> > > > > 
> > > > > Get Version & Get Caps
> > > > > ----------------------
> > > > > On migration init phase, qemu will probe the existence of device state
> > > > > regions of vendor driver, then get version of the device state interface
> > > > > from the r/w control region.
> > > > > 
> > > > > Then it will probe VFIO device's data capability by reading caps field of
> > > > > control region.
> > > > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > > > >         device memory in pre-copy and stop-and-copy phase. The data of
> > > > >         device memory is held in device memory region.
> > > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > > > >         produced by VFIO device during pre-copy and stop-and-copy phase.
> > > > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > > > > 
> > > > > If failing to find two mandatory regions and optional data regions
> > > > > corresponding to data caps or version mismatching, it will setup a
> > > > > migration blocker and disable live migration for VFIO device.
> > > > > 
> > > > > 
> > > > > Flows to call device state interface for VFIO live migration
> > > > > ------------------------------------------------------------
> > > > > 
> > > > > Live migration save path:
> > > > > 
> > > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > > > 
> > > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > > >  |
> > > > > MIGRATION_STATUS_SAVE_SETUP
> > > > >  |
> > > > >  .save_setup callback -->
> > > > >  get device memory size (whole snapshot size)
> > > > >  get device memory buffer (whole snapshot data)
> > > > >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > > > >  |
> > > > > MIGRATION_STATUS_ACTIVE
> > > > >  |
> > > > >  .save_live_pending callback --> get device memory size (dirty data)
> > > > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > > > >  .log_sync callback --> get system memory dirty bitmap
> > > > >  |
> > > > > (vcpu stops) --> set device state -->
> > > > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > > > >  |
> > > > > .save_live_complete_precopy callback -->
> > > > >  get device memory size (dirty data)
> > > > >  get device memory buffer (dirty data)
> > > > >  get device config size (whole snapshot size)
> > > > >  get device config buffer (whole snapshot data)
> > > > >  |
> > > > > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > > > > MIGRATION_STATUS_COMPLETED
> > > > > 
> > > > > MIGRATION_STATUS_CANCELLED or
> > > > > MIGRATION_STATUS_FAILED
> > > > >  |
> > > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > > > 
> > > > > 
> > > > > Live migration load path:
> > > > > 
> > > > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > > > 
> > > > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > > > >  |
> > > > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > > > >  |
> > > > > MIGRATION_STATUS_ACTIVE
> > > > >  |
> > > > > .load state callback -->
> > > > >  set device memory size, set device memory buffer, set device config size,
> > > > >  set device config buffer
> > > > >  |
> > > > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > > >  |
> > > > > MIGRATION_STATUS_COMPLETED
> > > > > 
> > > > > 
> > > > > 
> > > > > In source VM side,
> > > > > In precopy phase,
> > > > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > > > qemu will first get whole snapshot of device memory in .save_setup
> > > > > callback, and then it will get total size of dirty data in device memory in
> > > > > .save_live_pending callback by reading device_memory.size field of control
> > > > > region.
> > > > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > > > dirty data chunk by chunk from device memory region by writing pos &
> > > > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > > > control region. (size of each chunk is the size of device memory data
> > > > > region).
> > > > > .save_live_pending and .save_live_iteration may be called several times in
> > > > > precopy phase to get dirty data in device memory.
> > > > > 
> > > > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > > > vendor driver's device state interface to get data from devcie memory.
> > > > > 
> > > > > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > > > region by writing system memory's start address, page count and action 
> > > > > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > > > > "system_memory.action" fields of control region.
> > > > > If page count passed in .log_sync callback is larger than the bitmap size
> > > > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > > > vendor driver's get system memory dirty bitmap interface.
> > > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > > > > returns without call to vendor driver.
> > > > > 
> > > > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > > > in save_live_complete_precopy callback,
> > > > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > > > get device memory size and get device memory buffer will be called again.
> > > > > After that,
> > > > > device config data is get from device config region by reading
> > > > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > > > device_config.action of control region.
> > > > > Then after migration completes, in cleanup handler, LOGGING state will be
> > > > > cleared (i.e. deivce state is set to STOP).
> > > > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > > > of "migration failed" and "migration cancelled". They can also leverage
> > > > > the cleanup handler to unset LOGGING state.
> > > > > 
> > > > > 
> > > > > References
> > > > > ----------
> > > > > 1. kernel side implementation of Device state interfaces:
> > > > > https://patchwork.freedesktop.org/series/56876/
> > > > > 
> > > > > 
> > > > > Yan Zhao (5):
> > > > >   vfio/migration: define kernel interfaces
> > > > >   vfio/migration: support device of device config capability
> > > > >   vfio/migration: tracking of dirty page in system memory
> > > > >   vfio/migration: turn on migration
> > > > >   vfio/migration: support device memory capability
> > > > > 
> > > > >  hw/vfio/Makefile.objs         |   2 +-
> > > > >  hw/vfio/common.c              |  26 ++
> > > > >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> > > > >  hw/vfio/pci.c                 |  10 +-
> > > > >  hw/vfio/pci.h                 |  26 +-
> > > > >  include/hw/vfio/vfio-common.h |   1 +
> > > > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > > > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > > > >  create mode 100644 hw/vfio/migration.c
> > > > > 
> > > > > -- 
> > > > > 2.7.4
> > > > > 
> > > > --
> > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > > _______________________________________________
> > > > intel-gvt-dev mailing list
> > > > intel-gvt-dev@lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > _______________________________________________
> > intel-gvt-dev mailing list
> > intel-gvt-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 2/5] vfio/migration: support device of device config capability
  2019-02-20 22:54       ` [Qemu-devel] " Zhao Yan
@ 2019-02-21 10:56         ` Cornelia Huck
  -1 siblings, 0 replies; 133+ messages in thread
From: Cornelia Huck @ 2019-02-21 10:56 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian, dgilbert,
	alex.williamson, intel-gvt-dev, changpeng.liu, zhi.a.wang,
	jonathan.davies

On Wed, 20 Feb 2019 17:54:00 -0500
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Tue, Feb 19, 2019 at 03:37:24PM +0100, Cornelia Huck wrote:
> > On Tue, 19 Feb 2019 16:52:27 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > Device config is the default data that every device should have. so
> > > device config capability is by default on, no need to set.
> > > 
> > > - Currently two type of resources are saved/loaded for device of device
> > >   config capability:
> > >   General PCI config data, and Device config data.
> > >   They are copies as a whole when precopy is stopped.
> > > 
> > > Migration setup flow:
> > > - Setup device state regions, check its device state version and capabilities.
> > >   Mmap Device Config Region and Dirty Bitmap Region, if available.
> > > - If device state regions are failed to get setup, a migration blocker is
> > >   registered instead.
> > > - Added SaveVMHandlers to register device state save/load handlers.
> > > - Register VM state change handler to set device's running/stop states.
> > > - On migration startup on source machine, set device's state to
> > >   VFIO_DEVICE_STATE_LOGGING
> > > 
> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> > > ---
> > >  hw/vfio/Makefile.objs         |   2 +-
> > >  hw/vfio/migration.c           | 633 ++++++++++++++++++++++++++++++++++++++++++
> > >  hw/vfio/pci.c                 |   1 -
> > >  hw/vfio/pci.h                 |  25 +-
> > >  include/hw/vfio/vfio-common.h |   1 +
> > >  5 files changed, 659 insertions(+), 3 deletions(-)
> > >  create mode 100644 hw/vfio/migration.c
> > > 
> > > diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> > > index 8b3f664..f32ff19 100644
> > > --- a/hw/vfio/Makefile.objs
> > > +++ b/hw/vfio/Makefile.objs
> > > @@ -1,6 +1,6 @@
> > >  ifeq ($(CONFIG_LINUX), y)
> > >  obj-$(CONFIG_SOFTMMU) += common.o
> > > -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> > > +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o  
> > 
> > I think you want to split the migration code: The type-independent
> > code, and the pci-specific code.
> >  
> ok. actually, now only saving/loading of pci generic config data is
> pci-specific. the data getting/setting through device state
> interfaces are type-independent.

Yes. If it has capability chains, it can probably be made to work.

> 
> > >  obj-$(CONFIG_VFIO_CCW) += ccw.o
> > >  obj-$(CONFIG_SOFTMMU) += platform.o
> > >  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
(...)
> > > +static int vfio_save_data_device_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > > +{
> > > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > > +    VFIORegion *region_ctl =
> > > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > > +    VFIORegion *region_config =
> > > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > > +    void *dest;
> > > +    uint32_t sz;
> > > +    uint8_t *buf = NULL;
> > > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> > > +    uint64_t len = vdev->migration->devconfig_size;
> > > +
> > > +    qemu_put_be64(f, len);  
> > 
> > Why big endian? (Generally, do we need any endianness considerations?)
> >   
> I think big endian is the endian for qemu to save file.
> as long as qemu_put and qemu_get use the same endian, it will be no
> problem.
> do you think so?

Yes, as long we are explicit about the endianness. I'm not sure whether
e.g. power even has the ability to mix endianness.

(...)
> > > +static int vfio_set_device_state(VFIOPCIDevice *vdev,
> > > +        uint32_t dev_state)
> > > +{
> > > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > > +    VFIORegion *region =
> > > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > > +    uint32_t sz = sizeof(dev_state);
> > > +
> > > +    if (!vdev->migration) {
> > > +        return -1;
> > > +    }
> > > +
> > > +    if (pwrite(vbasedev->fd, &dev_state, sz,
> > > +              region->fd_offset +
> > > +              offsetof(struct vfio_device_state_ctl, device_state))
> > > +            != sz) {
> > > +        error_report("vfio: Failed to set device state %d", dev_state);  
> > 
> > Can the kernel reject this if a state transition is not allowed (or are
> > all transitions allowed?)
> >   
> yes, kernel can reject some state transition if it's not allowed.
> But currently all transitions are allowed.
> Maybe a check of self-to-self transition is needed in kernel.

Self-to-self looks benign enough to simply return success early.

> 
> > > +        return -1;
> > > +    }
> > > +    vdev->migration->device_state = dev_state;
> > > +    return 0;
> > > +}
(...)
> > > +static int vfio_check_devstate_version(VFIOPCIDevice *vdev)
> > > +{
> > > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > > +    VFIORegion *region =
> > > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > > +
> > > +    uint32_t version;
> > > +    uint32_t size = sizeof(version);
> > > +
> > > +    if (pread(vbasedev->fd, &version, size,
> > > +                region->fd_offset +
> > > +                offsetof(struct vfio_device_state_ctl, version))
> > > +            != size) {
> > > +        error_report("%s Failed to read version of device state interfaces",
> > > +                vbasedev->name);
> > > +        return -1;
> > > +    }
> > > +
> > > +    if (version != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> > > +        error_report("%s migration version mismatch, right version is %d",
> > > +                vbasedev->name, VFIO_DEVICE_STATE_INTERFACE_VERSION);  
> > 
> > So, we require an exact match... or should we allow to extend the
> > interface in an backwards-compatible way, in which case we'd require
> > (QEMU interface version) <= (kernel interface version)?
> >  
> currently yes. we can discuss on that.

If we want to allow that, we need to have a strictly monotonous
progression of versions here (which means dragging along old
compatibility code for basically forever). Maintaining a table about
which version is compatible with which other version would get insane
pretty quickly.

Can we somehow accommodate more optional regions via capabilities?
Maybe via optional vmstates?

> > > +        return -1;
> > > +    }
> > > +
> > > +    return 0;
> > > +}
(...)
> > > +static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > > +{
> > > +    PCIDevice *pdev = &vdev->pdev;
> > > +    uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
> > > +    bool msi_64bit;
> > > +
> > > +    /* retore pci bar configuration */
> > > +    ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > > +            ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> > > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > +        bar_cfg = qemu_get_be32(f);
> > > +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar_cfg, 4);
> > > +    }
> > > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > > +            ctl | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> > > +
> > > +    /* restore msi configuration */
> > > +    ctl = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > > +    msi_64bit = !!(ctl & PCI_MSI_FLAGS_64BIT);
> > > +
> > > +    vfio_pci_write_config(&vdev->pdev,
> > > +            pdev->msi_cap + PCI_MSI_FLAGS,
> > > +            ctl & (!PCI_MSI_FLAGS_ENABLE), 2);
> > > +
> > > +    msi_lo = qemu_get_be32(f);
> > > +    vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO, msi_lo, 4);
> > > +
> > > +    if (msi_64bit) {
> > > +        msi_hi = qemu_get_be32(f);
> > > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > > +                msi_hi, 4);
> > > +    }
> > > +    msi_data = qemu_get_be32(f);
> > > +    vfio_pci_write_config(pdev,
> > > +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > > +            msi_data, 2);
> > > +
> > > +    vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > > +            ctl | PCI_MSI_FLAGS_ENABLE, 2);
> > > +  
> > 
> > Ok, this function is indeed pci-specific and probably should be moved
> > to the vfio-pci code (other types could hook themselves up in the same
> > place, then).
> >   
> yes, this is the only pci-specific code.
> maybe using VFIO_DEVICE_TYPE_PCI as a sign to decide whether to save/load
> pci config data?
> or as Dave said, put saving/loading of pci config data into
> VMStateDescription interface?

I like having an interface like vmstate where other types can register
themselves better than introducing conditional handling based on
hard-coded type values.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] vfio/migration: support device of device config capability
@ 2019-02-21 10:56         ` Cornelia Huck
  0 siblings, 0 replies; 133+ messages in thread
From: Cornelia Huck @ 2019-02-21 10:56 UTC (permalink / raw)
  To: Zhao Yan
  Cc: alex.williamson, qemu-devel, intel-gvt-dev, Zhengxiao.zx,
	yi.l.liu, eskultet, ziye.yang, shuangtai.tst, dgilbert,
	zhi.a.wang, mlevitsk, pasic, aik, eauger, felipe,
	jonathan.davies, changpeng.liu, Ken.Xue, kwankhede, kevin.tian,
	cjia, arei.gonglei, kvm

On Wed, 20 Feb 2019 17:54:00 -0500
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Tue, Feb 19, 2019 at 03:37:24PM +0100, Cornelia Huck wrote:
> > On Tue, 19 Feb 2019 16:52:27 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > Device config is the default data that every device should have. so
> > > device config capability is by default on, no need to set.
> > > 
> > > - Currently two type of resources are saved/loaded for device of device
> > >   config capability:
> > >   General PCI config data, and Device config data.
> > >   They are copies as a whole when precopy is stopped.
> > > 
> > > Migration setup flow:
> > > - Setup device state regions, check its device state version and capabilities.
> > >   Mmap Device Config Region and Dirty Bitmap Region, if available.
> > > - If device state regions are failed to get setup, a migration blocker is
> > >   registered instead.
> > > - Added SaveVMHandlers to register device state save/load handlers.
> > > - Register VM state change handler to set device's running/stop states.
> > > - On migration startup on source machine, set device's state to
> > >   VFIO_DEVICE_STATE_LOGGING
> > > 
> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > Signed-off-by: Yulei Zhang <yulei.zhang@intel.com>
> > > ---
> > >  hw/vfio/Makefile.objs         |   2 +-
> > >  hw/vfio/migration.c           | 633 ++++++++++++++++++++++++++++++++++++++++++
> > >  hw/vfio/pci.c                 |   1 -
> > >  hw/vfio/pci.h                 |  25 +-
> > >  include/hw/vfio/vfio-common.h |   1 +
> > >  5 files changed, 659 insertions(+), 3 deletions(-)
> > >  create mode 100644 hw/vfio/migration.c
> > > 
> > > diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> > > index 8b3f664..f32ff19 100644
> > > --- a/hw/vfio/Makefile.objs
> > > +++ b/hw/vfio/Makefile.objs
> > > @@ -1,6 +1,6 @@
> > >  ifeq ($(CONFIG_LINUX), y)
> > >  obj-$(CONFIG_SOFTMMU) += common.o
> > > -obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o
> > > +obj-$(CONFIG_PCI) += pci.o pci-quirks.o display.o migration.o  
> > 
> > I think you want to split the migration code: The type-independent
> > code, and the pci-specific code.
> >  
> ok. actually, now only saving/loading of pci generic config data is
> pci-specific. the data getting/setting through device state
> interfaces are type-independent.

Yes. If it has capability chains, it can probably be made to work.

> 
> > >  obj-$(CONFIG_VFIO_CCW) += ccw.o
> > >  obj-$(CONFIG_SOFTMMU) += platform.o
> > >  obj-$(CONFIG_VFIO_XGMAC) += calxeda-xgmac.o
(...)
> > > +static int vfio_save_data_device_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > > +{
> > > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > > +    VFIORegion *region_ctl =
> > > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > > +    VFIORegion *region_config =
> > > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_DATA_CONFIG];
> > > +    void *dest;
> > > +    uint32_t sz;
> > > +    uint8_t *buf = NULL;
> > > +    uint32_t action = VFIO_DEVICE_DATA_ACTION_GET_BUFFER;
> > > +    uint64_t len = vdev->migration->devconfig_size;
> > > +
> > > +    qemu_put_be64(f, len);  
> > 
> > Why big endian? (Generally, do we need any endianness considerations?)
> >   
> I think big endian is the endian for qemu to save file.
> as long as qemu_put and qemu_get use the same endian, it will be no
> problem.
> do you think so?

Yes, as long we are explicit about the endianness. I'm not sure whether
e.g. power even has the ability to mix endianness.

(...)
> > > +static int vfio_set_device_state(VFIOPCIDevice *vdev,
> > > +        uint32_t dev_state)
> > > +{
> > > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > > +    VFIORegion *region =
> > > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > > +    uint32_t sz = sizeof(dev_state);
> > > +
> > > +    if (!vdev->migration) {
> > > +        return -1;
> > > +    }
> > > +
> > > +    if (pwrite(vbasedev->fd, &dev_state, sz,
> > > +              region->fd_offset +
> > > +              offsetof(struct vfio_device_state_ctl, device_state))
> > > +            != sz) {
> > > +        error_report("vfio: Failed to set device state %d", dev_state);  
> > 
> > Can the kernel reject this if a state transition is not allowed (or are
> > all transitions allowed?)
> >   
> yes, kernel can reject some state transition if it's not allowed.
> But currently all transitions are allowed.
> Maybe a check of self-to-self transition is needed in kernel.

Self-to-self looks benign enough to simply return success early.

> 
> > > +        return -1;
> > > +    }
> > > +    vdev->migration->device_state = dev_state;
> > > +    return 0;
> > > +}
(...)
> > > +static int vfio_check_devstate_version(VFIOPCIDevice *vdev)
> > > +{
> > > +    VFIODevice *vbasedev = &vdev->vbasedev;
> > > +    VFIORegion *region =
> > > +        &vdev->migration->region[VFIO_DEVSTATE_REGION_CTL];
> > > +
> > > +    uint32_t version;
> > > +    uint32_t size = sizeof(version);
> > > +
> > > +    if (pread(vbasedev->fd, &version, size,
> > > +                region->fd_offset +
> > > +                offsetof(struct vfio_device_state_ctl, version))
> > > +            != size) {
> > > +        error_report("%s Failed to read version of device state interfaces",
> > > +                vbasedev->name);
> > > +        return -1;
> > > +    }
> > > +
> > > +    if (version != VFIO_DEVICE_STATE_INTERFACE_VERSION) {
> > > +        error_report("%s migration version mismatch, right version is %d",
> > > +                vbasedev->name, VFIO_DEVICE_STATE_INTERFACE_VERSION);  
> > 
> > So, we require an exact match... or should we allow to extend the
> > interface in an backwards-compatible way, in which case we'd require
> > (QEMU interface version) <= (kernel interface version)?
> >  
> currently yes. we can discuss on that.

If we want to allow that, we need to have a strictly monotonous
progression of versions here (which means dragging along old
compatibility code for basically forever). Maintaining a table about
which version is compatible with which other version would get insane
pretty quickly.

Can we somehow accommodate more optional regions via capabilities?
Maybe via optional vmstates?

> > > +        return -1;
> > > +    }
> > > +
> > > +    return 0;
> > > +}
(...)
> > > +static void vfio_pci_load_config(VFIOPCIDevice *vdev, QEMUFile *f)
> > > +{
> > > +    PCIDevice *pdev = &vdev->pdev;
> > > +    uint32_t ctl, msi_lo, msi_hi, msi_data, bar_cfg, i;
> > > +    bool msi_64bit;
> > > +
> > > +    /* retore pci bar configuration */
> > > +    ctl = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > > +            ctl & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> > > +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > +        bar_cfg = qemu_get_be32(f);
> > > +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar_cfg, 4);
> > > +    }
> > > +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > > +            ctl | PCI_COMMAND_IO | PCI_COMMAND_MEMORY, 2);
> > > +
> > > +    /* restore msi configuration */
> > > +    ctl = pci_default_read_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > > +    msi_64bit = !!(ctl & PCI_MSI_FLAGS_64BIT);
> > > +
> > > +    vfio_pci_write_config(&vdev->pdev,
> > > +            pdev->msi_cap + PCI_MSI_FLAGS,
> > > +            ctl & (!PCI_MSI_FLAGS_ENABLE), 2);
> > > +
> > > +    msi_lo = qemu_get_be32(f);
> > > +    vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO, msi_lo, 4);
> > > +
> > > +    if (msi_64bit) {
> > > +        msi_hi = qemu_get_be32(f);
> > > +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > > +                msi_hi, 4);
> > > +    }
> > > +    msi_data = qemu_get_be32(f);
> > > +    vfio_pci_write_config(pdev,
> > > +            pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > > +            msi_data, 2);
> > > +
> > > +    vfio_pci_write_config(&vdev->pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > > +            ctl | PCI_MSI_FLAGS_ENABLE, 2);
> > > +  
> > 
> > Ok, this function is indeed pci-specific and probably should be moved
> > to the vfio-pci code (other types could hook themselves up in the same
> > place, then).
> >   
> yes, this is the only pci-specific code.
> maybe using VFIO_DEVICE_TYPE_PCI as a sign to decide whether to save/load
> pci config data?
> or as Dave said, put saving/loading of pci config data into
> VMStateDescription interface?

I like having an interface like vmstate where other types can register
themselves better than introducing conditional handling based on
hard-coded type values.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-19  8:50 ` [Qemu-devel] " Yan Zhao
@ 2019-02-21 20:40   ` Alex Williamson
  -1 siblings, 0 replies; 133+ messages in thread
From: Alex Williamson @ 2019-02-21 20:40 UTC (permalink / raw)
  To: Yan Zhao
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian, dgilbert,
	intel-gvt-dev, changpeng.liu, cohuck, zhi.a.wang,
	jonathan.davies

Hi Yan,

Thanks for working on this!

On Tue, 19 Feb 2019 16:50:54 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> This patchset enables VFIO devices to have live migration capability.
> Currently it does not support post-copy phase.
> 
> It follows Alex's comments on last version of VFIO live migration patches,
> including device states, VFIO device state region layout, dirty bitmap's
> query.
> 
> Device Data
> -----------
> Device data is divided into three types: device memory, device config,
> and system memory dirty pages produced by device.
> 
> Device config: data like MMIOs, page tables...
>         Every device is supposed to possess device config data.
>     	Usually device config's size is small (no big than 10M), and it

I'm not sure how we can really impose a limit here, it is what it is
for a device.  A smaller state is obviously desirable to reduce
downtime, but some devices could have very large states.

>         needs to be loaded in certain strict order.
>         Therefore, device config only needs to be saved/loaded in
>         stop-and-copy phase.
>         The data of device config is held in device config region.
>         Size of device config data is smaller than or equal to that of
>         device config region.

So the intention here is that this is the last data read from the
device and it's done in one pass, so the region needs to be large
enough to expose all config data at once.  On restore it's the last
data written before switching the device to the run state.

> 
> Device Memory: device's internal memory, standalone and outside system

s/system/VM/

>         memory. It is usually very big.

Or it doesn't exist.  Not sure we should be setting expectations since
it will vary per device.

>         This kind of data needs to be saved / loaded in pre-copy and
>         stop-and-copy phase.
>         The data of device memory is held in device memory region.
>         Size of devie memory is usually larger than that of device
>         memory region. qemu needs to save/load it in chunks of size of
>         device memory region.
>         Not all device has device memory. Like IGD only uses system memory.

It seems a little gratuitous to me that this is a separate region or
that this data is handled separately.  All of this data is opaque to
QEMU, so why do we need to separate it?

> System memory dirty pages: If a device produces dirty pages in system
>         memory, it is able to get dirty bitmap for certain range of system
>         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
>         phase in .log_sync callback. By setting dirty bitmap in .log_sync
>         callback, dirty pages in system memory will be save/loaded by ram's
>         live migration code.
>         The dirty bitmap of system memory is held in dirty bitmap region.
>         If system memory range is larger than that dirty bitmap region can
>         hold, qemu will cut it into several chunks and get dirty bitmap in
>         succession.
> 
> 
> Device State Regions
> --------------------
> Vendor driver is required to expose two mandatory regions and another two
> optional regions if it plans to support device state management.
> 
> So, there are up to four regions in total.
> One control region: mandatory.
>         Get access via read/write system call.
>         Its layout is defined in struct vfio_device_state_ctl
> Three data regions: mmaped into qemu.

Is mmap mandatory?  I would think this would be defined by the mdev
device what access they want to support per region.  We don't want to
impose a more complicated interface if the device doesn't require it.

>         device config region: mandatory, holding data of device config
>         device memory region: optional, holding data of device memory
>         dirty bitmap region: optional, holding bitmap of system memory
>                             dirty pages
> 
> (The reason why four seperate regions are defined is that the unit of mmap
> system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> control and three mmaped regions for data seems better than one big region
> padded and sparse mmaped).

It's not obvious to me how this is better, a big region isn't padded,
there's simply a gap in the file descriptor.  Is having a sub-PAGE_SIZE
gap in a file really of any consequence?  Each region beyond the header
is more than likely larger than PAGE_SIZE, therefore they can be nicely
aligned together.  We still need fields to tell us how much data is
available in each area, so another to tell us the start of each area is
a minor detail.  And I think we still want to allow drivers to specify
which parts of which areas support mmap, so I don't think we're getting
away from sparse mmap support.

> kernel device state interface [1]
> --------------------------------------
> #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2

If we were to go with this multi-region solution, isn't it evident from
the regions exposed that device memory and a dirty bitmap are
provided?  Alternatively, I believe Kirti's proposal doesn't require
this distinction between device memory and device config, a device not
requiring runtime migration data would simply report no data until the
device moved to the stopped state, making it consistent path for
userspace.  Likewise, the dirty bitmap could report a zero page count
in the bitmap rather than branching based on device support.

Note that it seems that cap VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY implies
VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP, but there's no
consistency in the naming.

> #define VFIO_DEVICE_STATE_RUNNING 0 
> #define VFIO_DEVICE_STATE_STOP 1
> #define VFIO_DEVICE_STATE_LOGGING 2

It looks like these are being defined as bits, since patch 1 talks
about RUNNING & LOGGING and STOP & LOGGING.  I think Connie already
posted some comments about this.  I'm not sure anything prevents us
from defining RUNNING a 1 and STOPPED as 0 so we don't have the
polarity flip vs LOGGING though.

The state "STOP & LOGGING" also feels like a strange "device state", if
the device is stopped, it's not logging any new state, so I think this
is more that the device state is STOP, but the LOGGING feature is
active.  Maybe we should consider these independent bits.  LOGGING is
active as we stop a device so that we can fetch the last dirtied pages,
but disabled as we load the state of the device into the target.

> #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> 
> struct vfio_device_state_ctl {
> 	__u32 version;		  /* ro */
> 	__u32 device_state;       /* VFIO device state, wo */
> 	__u32 caps;		 /* ro */
>         struct {
> 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> 		__u64 size;    /*rw*/
> 	} device_config;

Patch 1 indicates that to get the config buffer we write GET_BUFFER to
action and read from the config region.  The size is previously read
and apparently constant.  To set the config buffer, the config region
is written followed by writing SET_BUFFER to action.  Why is size
listed as read-write?

Doesn't this protocol also require that the mdev driver consume each
full region's worth of host kernel memory for backing pages in
anticipation of a rare event like migration?  This might be a strike
against separate regions if the driver needs to provide backing pages
for 3 separate regions vs 1.  To avoid this runtime overhead, would it
be expected that the user only mmap the regions during migration and
the mdev driver allocate backing pages on mmap?  Should the mmap be
restricted to the LOGGING feature being enabled?  Maybe I'm mistaken in
how the mdev driver would back these mmap'd pages.

> 	struct {
> 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> 		__u64 size;     /* rw */  
>                 __u64 pos; /*the offset in total buffer of device memory*/

Patch 1 outlines the protocol here that getting device memory begins
with writing the position field, followed by reading from the device
memory region.  Setting device memory begins with writing the data to
the device memory region, followed by writing the position field.  Why
does the user need to have visibility of data position?  This is opaque
data to the user, the device should manage how the chunks fit together.

How does the user know when they reach the end?

Bullets 8 and 9 in patch 1 also discuss setting and getting the device
memory size, but these aren't well integrated into the protocol for
getting and setting the memory buffer.  Is getting the device memory
really started by reading the size, which triggers the vendor driver to
snapshot the state in an internal buffer which the user then iterates
through using GET_BUFFER?  Therefore re-reading the size field could
corrupt the data stream?  Wouldn't it be easier to report bytes
available and countdown as the user marks them read?  What does
position mean when we switch from snapshot to dirty data?

> 	} device_memory;
> 	struct {
> 		__u64 start_addr; /* wo */
> 		__u64 page_nr;   /* wo */
> 	} system_memory;
> };

Why is one specified as an address and the other as pages?  Note that
Kirti's implementation has an optimization to know how many pages are
set within a range to avoid unnecessary buffer reads.

> 
> Devcie States
> ------------- 
> After migration is initialzed, it will set device state via writing to
> device_state field of control region.
> 
> Four states are defined for a VFIO device:
>         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> 
> RUNNING: In this state, a VFIO device is in active state ready to receive
>         commands from device driver.
>         It is the default state that a VFIO device enters initially.
> 
> STOP:  In this state, a VFIO device is deactivated to interact with
>        device driver.
> 
> LOGGING: a special state that it CANNOT exist independently. It must be
>        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
>        STOP & LOGGING).
>        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
>        driver can start dirty data logging for device memory and system
>        memory.
>        LOGGING only impacts device/system memory. They return whole
>        snapshot outside LOGGING and dirty data since last get operation
>        inside LOGGING.
>        Device config should be always accessible and return whole config
>        snapshot regardless of LOGGING state.
>        
> Note:
> The reason why RUNNING is the default state is that device's active state
> must not depend on device state interface.
> It is possible that region vfio_device_state_ctl fails to get registered.
> In that condition, a device needs be in active state by default. 
> 
> Get Version & Get Caps
> ----------------------
> On migration init phase, qemu will probe the existence of device state
> regions of vendor driver, then get version of the device state interface
> from the r/w control region.
> 
> Then it will probe VFIO device's data capability by reading caps field of
> control region.
>         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
>         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
>         device memory in pre-copy and stop-and-copy phase. The data of
>         device memory is held in device memory region.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
>         produced by VFIO device during pre-copy and stop-and-copy phase.
>         The dirty bitmap of system memory is held in dirty bitmap region.

As above, these capabilities seem redundant to the existence of the
device specific regions in this implementation.

> If failing to find two mandatory regions and optional data regions
> corresponding to data caps or version mismatching, it will setup a
> migration blocker and disable live migration for VFIO device.
> 
> 
> Flows to call device state interface for VFIO live migration
> ------------------------------------------------------------
> 
> Live migration save path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_SAVE_SETUP
>  |
>  .save_setup callback -->
>  get device memory size (whole snapshot size)
>  get device memory buffer (whole snapshot data)
>  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
>  |
> MIGRATION_STATUS_ACTIVE
>  |
>  .save_live_pending callback --> get device memory size (dirty data)
>  .save_live_iteration callback --> get device memory buffer (dirty data)
>  .log_sync callback --> get system memory dirty bitmap
>  |
> (vcpu stops) --> set device state -->
>  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
>  |
> .save_live_complete_precopy callback -->
>  get device memory size (dirty data)
>  get device memory buffer (dirty data)
>  get device config size (whole snapshot size)
>  get device config buffer (whole snapshot data)
>  |
> .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> MIGRATION_STATUS_COMPLETED
> 
> MIGRATION_STATUS_CANCELLED or
> MIGRATION_STATUS_FAILED
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> 
> 
> Live migration load path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
>  |
> MIGRATION_STATUS_ACTIVE
>  |
> .load state callback -->
>  set device memory size, set device memory buffer, set device config size,
>  set device config buffer
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_COMPLETED
> 
> 
> 
> In source VM side,
> In precopy phase,
> if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> qemu will first get whole snapshot of device memory in .save_setup
> callback, and then it will get total size of dirty data in device memory in
> .save_live_pending callback by reading device_memory.size field of control
> region.

This requires iterative reads of device memory buffer but the protocol
is unclear (to me) how the user knows how to do this or interact with
the position field. 

> Then in .save_live_iteration callback, it will get buffer of device memory's
> dirty data chunk by chunk from device memory region by writing pos &
> action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> control region. (size of each chunk is the size of device memory data
> region).

What if there's not enough dirty data to fill the region?  The data is
always padded to fill the region?

> .save_live_pending and .save_live_iteration may be called several times in
> precopy phase to get dirty data in device memory.
> 
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> like .save_setup, .save_live_pending, .save_live_iteration will not call
> vendor driver's device state interface to get data from devcie memory.

Therefore through the entire precopy phase we have no data from source
to target to begin a compatibility check :-\  I think both proposals
currently still lack any sort of device compatibility or data
versioning check between source and target.  Thanks,

Alex

> In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> .log_sync callback will get system memory dirty bitmap from dirty bitmap
> region by writing system memory's start address, page count and action 
> (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> "system_memory.action" fields of control region.
> If page count passed in .log_sync callback is larger than the bitmap size
> the dirty bitmap region supports, Qemu will cut it into chunks and call
> vendor driver's get system memory dirty bitmap interface.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> returns without call to vendor driver.
> 
> In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> in save_live_complete_precopy callback,
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> get device memory size and get device memory buffer will be called again.
> After that,
> device config data is get from device config region by reading
> devcie_config.size of control region and writing action (GET_BITMAP) to
> device_config.action of control region.
> Then after migration completes, in cleanup handler, LOGGING state will be
> cleared (i.e. deivce state is set to STOP).
> Clearing LOGGING state in cleanup handler is in consideration of the case
> of "migration failed" and "migration cancelled". They can also leverage
> the cleanup handler to unset LOGGING state.
> 
> 
> References
> ----------
> 1. kernel side implementation of Device state interfaces:
> https://patchwork.freedesktop.org/series/56876/
> 
> 
> Yan Zhao (5):
>   vfio/migration: define kernel interfaces
>   vfio/migration: support device of device config capability
>   vfio/migration: tracking of dirty page in system memory
>   vfio/migration: turn on migration
>   vfio/migration: support device memory capability
> 
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/common.c              |  26 ++
>  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |  10 +-
>  hw/vfio/pci.h                 |  26 +-
>  include/hw/vfio/vfio-common.h |   1 +
>  linux-headers/linux/vfio.h    | 260 +++++++++++++
>  7 files changed, 1174 insertions(+), 9 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-21 20:40   ` Alex Williamson
  0 siblings, 0 replies; 133+ messages in thread
From: Alex Williamson @ 2019-02-21 20:40 UTC (permalink / raw)
  To: Yan Zhao
  Cc: qemu-devel, intel-gvt-dev, Zhengxiao.zx, yi.l.liu, eskultet,
	ziye.yang, cohuck, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk,
	pasic, aik, eauger, felipe, jonathan.davies, changpeng.liu,
	Ken.Xue, kwankhede, kevin.tian, cjia, arei.gonglei, kvm

Hi Yan,

Thanks for working on this!

On Tue, 19 Feb 2019 16:50:54 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> This patchset enables VFIO devices to have live migration capability.
> Currently it does not support post-copy phase.
> 
> It follows Alex's comments on last version of VFIO live migration patches,
> including device states, VFIO device state region layout, dirty bitmap's
> query.
> 
> Device Data
> -----------
> Device data is divided into three types: device memory, device config,
> and system memory dirty pages produced by device.
> 
> Device config: data like MMIOs, page tables...
>         Every device is supposed to possess device config data.
>     	Usually device config's size is small (no big than 10M), and it

I'm not sure how we can really impose a limit here, it is what it is
for a device.  A smaller state is obviously desirable to reduce
downtime, but some devices could have very large states.

>         needs to be loaded in certain strict order.
>         Therefore, device config only needs to be saved/loaded in
>         stop-and-copy phase.
>         The data of device config is held in device config region.
>         Size of device config data is smaller than or equal to that of
>         device config region.

So the intention here is that this is the last data read from the
device and it's done in one pass, so the region needs to be large
enough to expose all config data at once.  On restore it's the last
data written before switching the device to the run state.

> 
> Device Memory: device's internal memory, standalone and outside system

s/system/VM/

>         memory. It is usually very big.

Or it doesn't exist.  Not sure we should be setting expectations since
it will vary per device.

>         This kind of data needs to be saved / loaded in pre-copy and
>         stop-and-copy phase.
>         The data of device memory is held in device memory region.
>         Size of devie memory is usually larger than that of device
>         memory region. qemu needs to save/load it in chunks of size of
>         device memory region.
>         Not all device has device memory. Like IGD only uses system memory.

It seems a little gratuitous to me that this is a separate region or
that this data is handled separately.  All of this data is opaque to
QEMU, so why do we need to separate it?

> System memory dirty pages: If a device produces dirty pages in system
>         memory, it is able to get dirty bitmap for certain range of system
>         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
>         phase in .log_sync callback. By setting dirty bitmap in .log_sync
>         callback, dirty pages in system memory will be save/loaded by ram's
>         live migration code.
>         The dirty bitmap of system memory is held in dirty bitmap region.
>         If system memory range is larger than that dirty bitmap region can
>         hold, qemu will cut it into several chunks and get dirty bitmap in
>         succession.
> 
> 
> Device State Regions
> --------------------
> Vendor driver is required to expose two mandatory regions and another two
> optional regions if it plans to support device state management.
> 
> So, there are up to four regions in total.
> One control region: mandatory.
>         Get access via read/write system call.
>         Its layout is defined in struct vfio_device_state_ctl
> Three data regions: mmaped into qemu.

Is mmap mandatory?  I would think this would be defined by the mdev
device what access they want to support per region.  We don't want to
impose a more complicated interface if the device doesn't require it.

>         device config region: mandatory, holding data of device config
>         device memory region: optional, holding data of device memory
>         dirty bitmap region: optional, holding bitmap of system memory
>                             dirty pages
> 
> (The reason why four seperate regions are defined is that the unit of mmap
> system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> control and three mmaped regions for data seems better than one big region
> padded and sparse mmaped).

It's not obvious to me how this is better, a big region isn't padded,
there's simply a gap in the file descriptor.  Is having a sub-PAGE_SIZE
gap in a file really of any consequence?  Each region beyond the header
is more than likely larger than PAGE_SIZE, therefore they can be nicely
aligned together.  We still need fields to tell us how much data is
available in each area, so another to tell us the start of each area is
a minor detail.  And I think we still want to allow drivers to specify
which parts of which areas support mmap, so I don't think we're getting
away from sparse mmap support.

> kernel device state interface [1]
> --------------------------------------
> #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2

If we were to go with this multi-region solution, isn't it evident from
the regions exposed that device memory and a dirty bitmap are
provided?  Alternatively, I believe Kirti's proposal doesn't require
this distinction between device memory and device config, a device not
requiring runtime migration data would simply report no data until the
device moved to the stopped state, making it consistent path for
userspace.  Likewise, the dirty bitmap could report a zero page count
in the bitmap rather than branching based on device support.

Note that it seems that cap VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY implies
VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP, but there's no
consistency in the naming.

> #define VFIO_DEVICE_STATE_RUNNING 0 
> #define VFIO_DEVICE_STATE_STOP 1
> #define VFIO_DEVICE_STATE_LOGGING 2

It looks like these are being defined as bits, since patch 1 talks
about RUNNING & LOGGING and STOP & LOGGING.  I think Connie already
posted some comments about this.  I'm not sure anything prevents us
from defining RUNNING a 1 and STOPPED as 0 so we don't have the
polarity flip vs LOGGING though.

The state "STOP & LOGGING" also feels like a strange "device state", if
the device is stopped, it's not logging any new state, so I think this
is more that the device state is STOP, but the LOGGING feature is
active.  Maybe we should consider these independent bits.  LOGGING is
active as we stop a device so that we can fetch the last dirtied pages,
but disabled as we load the state of the device into the target.

> #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> 
> struct vfio_device_state_ctl {
> 	__u32 version;		  /* ro */
> 	__u32 device_state;       /* VFIO device state, wo */
> 	__u32 caps;		 /* ro */
>         struct {
> 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> 		__u64 size;    /*rw*/
> 	} device_config;

Patch 1 indicates that to get the config buffer we write GET_BUFFER to
action and read from the config region.  The size is previously read
and apparently constant.  To set the config buffer, the config region
is written followed by writing SET_BUFFER to action.  Why is size
listed as read-write?

Doesn't this protocol also require that the mdev driver consume each
full region's worth of host kernel memory for backing pages in
anticipation of a rare event like migration?  This might be a strike
against separate regions if the driver needs to provide backing pages
for 3 separate regions vs 1.  To avoid this runtime overhead, would it
be expected that the user only mmap the regions during migration and
the mdev driver allocate backing pages on mmap?  Should the mmap be
restricted to the LOGGING feature being enabled?  Maybe I'm mistaken in
how the mdev driver would back these mmap'd pages.

> 	struct {
> 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> 		__u64 size;     /* rw */  
>                 __u64 pos; /*the offset in total buffer of device memory*/

Patch 1 outlines the protocol here that getting device memory begins
with writing the position field, followed by reading from the device
memory region.  Setting device memory begins with writing the data to
the device memory region, followed by writing the position field.  Why
does the user need to have visibility of data position?  This is opaque
data to the user, the device should manage how the chunks fit together.

How does the user know when they reach the end?

Bullets 8 and 9 in patch 1 also discuss setting and getting the device
memory size, but these aren't well integrated into the protocol for
getting and setting the memory buffer.  Is getting the device memory
really started by reading the size, which triggers the vendor driver to
snapshot the state in an internal buffer which the user then iterates
through using GET_BUFFER?  Therefore re-reading the size field could
corrupt the data stream?  Wouldn't it be easier to report bytes
available and countdown as the user marks them read?  What does
position mean when we switch from snapshot to dirty data?

> 	} device_memory;
> 	struct {
> 		__u64 start_addr; /* wo */
> 		__u64 page_nr;   /* wo */
> 	} system_memory;
> };

Why is one specified as an address and the other as pages?  Note that
Kirti's implementation has an optimization to know how many pages are
set within a range to avoid unnecessary buffer reads.

> 
> Devcie States
> ------------- 
> After migration is initialzed, it will set device state via writing to
> device_state field of control region.
> 
> Four states are defined for a VFIO device:
>         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> 
> RUNNING: In this state, a VFIO device is in active state ready to receive
>         commands from device driver.
>         It is the default state that a VFIO device enters initially.
> 
> STOP:  In this state, a VFIO device is deactivated to interact with
>        device driver.
> 
> LOGGING: a special state that it CANNOT exist independently. It must be
>        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
>        STOP & LOGGING).
>        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
>        driver can start dirty data logging for device memory and system
>        memory.
>        LOGGING only impacts device/system memory. They return whole
>        snapshot outside LOGGING and dirty data since last get operation
>        inside LOGGING.
>        Device config should be always accessible and return whole config
>        snapshot regardless of LOGGING state.
>        
> Note:
> The reason why RUNNING is the default state is that device's active state
> must not depend on device state interface.
> It is possible that region vfio_device_state_ctl fails to get registered.
> In that condition, a device needs be in active state by default. 
> 
> Get Version & Get Caps
> ----------------------
> On migration init phase, qemu will probe the existence of device state
> regions of vendor driver, then get version of the device state interface
> from the r/w control region.
> 
> Then it will probe VFIO device's data capability by reading caps field of
> control region.
>         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
>         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
>         device memory in pre-copy and stop-and-copy phase. The data of
>         device memory is held in device memory region.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
>         produced by VFIO device during pre-copy and stop-and-copy phase.
>         The dirty bitmap of system memory is held in dirty bitmap region.

As above, these capabilities seem redundant to the existence of the
device specific regions in this implementation.

> If failing to find two mandatory regions and optional data regions
> corresponding to data caps or version mismatching, it will setup a
> migration blocker and disable live migration for VFIO device.
> 
> 
> Flows to call device state interface for VFIO live migration
> ------------------------------------------------------------
> 
> Live migration save path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_SAVE_SETUP
>  |
>  .save_setup callback -->
>  get device memory size (whole snapshot size)
>  get device memory buffer (whole snapshot data)
>  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
>  |
> MIGRATION_STATUS_ACTIVE
>  |
>  .save_live_pending callback --> get device memory size (dirty data)
>  .save_live_iteration callback --> get device memory buffer (dirty data)
>  .log_sync callback --> get system memory dirty bitmap
>  |
> (vcpu stops) --> set device state -->
>  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
>  |
> .save_live_complete_precopy callback -->
>  get device memory size (dirty data)
>  get device memory buffer (dirty data)
>  get device config size (whole snapshot size)
>  get device config buffer (whole snapshot data)
>  |
> .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> MIGRATION_STATUS_COMPLETED
> 
> MIGRATION_STATUS_CANCELLED or
> MIGRATION_STATUS_FAILED
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> 
> 
> Live migration load path:
> 
> (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> 
> MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
>  |
> (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
>  |
> MIGRATION_STATUS_ACTIVE
>  |
> .load state callback -->
>  set device memory size, set device memory buffer, set device config size,
>  set device config buffer
>  |
> (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
>  |
> MIGRATION_STATUS_COMPLETED
> 
> 
> 
> In source VM side,
> In precopy phase,
> if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> qemu will first get whole snapshot of device memory in .save_setup
> callback, and then it will get total size of dirty data in device memory in
> .save_live_pending callback by reading device_memory.size field of control
> region.

This requires iterative reads of device memory buffer but the protocol
is unclear (to me) how the user knows how to do this or interact with
the position field. 

> Then in .save_live_iteration callback, it will get buffer of device memory's
> dirty data chunk by chunk from device memory region by writing pos &
> action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> control region. (size of each chunk is the size of device memory data
> region).

What if there's not enough dirty data to fill the region?  The data is
always padded to fill the region?

> .save_live_pending and .save_live_iteration may be called several times in
> precopy phase to get dirty data in device memory.
> 
> If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> like .save_setup, .save_live_pending, .save_live_iteration will not call
> vendor driver's device state interface to get data from devcie memory.

Therefore through the entire precopy phase we have no data from source
to target to begin a compatibility check :-\  I think both proposals
currently still lack any sort of device compatibility or data
versioning check between source and target.  Thanks,

Alex

> In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> .log_sync callback will get system memory dirty bitmap from dirty bitmap
> region by writing system memory's start address, page count and action 
> (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> "system_memory.action" fields of control region.
> If page count passed in .log_sync callback is larger than the bitmap size
> the dirty bitmap region supports, Qemu will cut it into chunks and call
> vendor driver's get system memory dirty bitmap interface.
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> returns without call to vendor driver.
> 
> In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> in save_live_complete_precopy callback,
> If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> get device memory size and get device memory buffer will be called again.
> After that,
> device config data is get from device config region by reading
> devcie_config.size of control region and writing action (GET_BITMAP) to
> device_config.action of control region.
> Then after migration completes, in cleanup handler, LOGGING state will be
> cleared (i.e. deivce state is set to STOP).
> Clearing LOGGING state in cleanup handler is in consideration of the case
> of "migration failed" and "migration cancelled". They can also leverage
> the cleanup handler to unset LOGGING state.
> 
> 
> References
> ----------
> 1. kernel side implementation of Device state interfaces:
> https://patchwork.freedesktop.org/series/56876/
> 
> 
> Yan Zhao (5):
>   vfio/migration: define kernel interfaces
>   vfio/migration: support device of device config capability
>   vfio/migration: tracking of dirty page in system memory
>   vfio/migration: turn on migration
>   vfio/migration: support device memory capability
> 
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/common.c              |  26 ++
>  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |  10 +-
>  hw/vfio/pci.h                 |  26 +-
>  include/hw/vfio/vfio-common.h |   1 +
>  linux-headers/linux/vfio.h    | 260 +++++++++++++
>  7 files changed, 1174 insertions(+), 9 deletions(-)
>  create mode 100644 hw/vfio/migration.c
> 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-21 20:40   ` [Qemu-devel] " Alex Williamson
@ 2019-02-25  2:22     ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-25  2:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian, dgilbert,
	intel-gvt-dev, changpeng.liu, cohuck, zhi.a.wang,
	jonathan.davies

On Thu, Feb 21, 2019 at 01:40:51PM -0700, Alex Williamson wrote:
> Hi Yan,
> 
> Thanks for working on this!
> 
> On Tue, 19 Feb 2019 16:50:54 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > This patchset enables VFIO devices to have live migration capability.
> > Currently it does not support post-copy phase.
> > 
> > It follows Alex's comments on last version of VFIO live migration patches,
> > including device states, VFIO device state region layout, dirty bitmap's
> > query.
> > 
> > Device Data
> > -----------
> > Device data is divided into three types: device memory, device config,
> > and system memory dirty pages produced by device.
> > 
> > Device config: data like MMIOs, page tables...
> >         Every device is supposed to possess device config data.
> >     	Usually device config's size is small (no big than 10M), and it
> 
> I'm not sure how we can really impose a limit here, it is what it is
> for a device.  A smaller state is obviously desirable to reduce
> downtime, but some devices could have very large states.
> 
> >         needs to be loaded in certain strict order.
> >         Therefore, device config only needs to be saved/loaded in
> >         stop-and-copy phase.
> >         The data of device config is held in device config region.
> >         Size of device config data is smaller than or equal to that of
> >         device config region.
> 
> So the intention here is that this is the last data read from the
> device and it's done in one pass, so the region needs to be large
> enough to expose all config data at once.  On restore it's the last
> data written before switching the device to the run state.
> 
> > 
> > Device Memory: device's internal memory, standalone and outside system
> 
> s/system/VM/
> 
> >         memory. It is usually very big.
> 
> Or it doesn't exist.  Not sure we should be setting expectations since
> it will vary per device.
> 
> >         This kind of data needs to be saved / loaded in pre-copy and
> >         stop-and-copy phase.
> >         The data of device memory is held in device memory region.
> >         Size of devie memory is usually larger than that of device
> >         memory region. qemu needs to save/load it in chunks of size of
> >         device memory region.
> >         Not all device has device memory. Like IGD only uses system memory.
> 
> It seems a little gratuitous to me that this is a separate region or
> that this data is handled separately.  All of this data is opaque to
> QEMU, so why do we need to separate it?
hi Alex,
as the device state interfaces are provided by kernel, it is expected to
meet as general needs as possible. So, do you think there are such use
cases from user space that user space knows well of the device, and
it wants kernel to return desired data back to it.
E.g. It just wants to get whole device config data including all mmios,
page tables, pci config data...
or, It just wants to get current device memory snapshot, not including any
dirty data.
Or, It just needs the dirty pages in device memory or system memory.
With all this accurate query, quite a lot of useful features can be
developped in user space.

If all of this data is opaque to user app, seems the only use case is
for live migration.

>From another aspect, if the final solution is to let the data opaque to
user space, like what NV did, kernel side's implementation will be more
complicated, and actually a little challenge to vendor driver.
in that case, in pre-copy phase,
1. in not LOGGING state, vendor driver first returns full data including
full device memory snapshot
2. user space reads some data (you can't expect it to finish reading all
data)
3. then userspace set the device state to LOGGING to start dirty data
logging
4. vendor driver starts dirty data logging, and appends the dirty data to
the tail of remaining unread full data and increase the pending data size?
5. user space keeps reading data.
6. vendor driver keeps appending new dirty data to the tail of remaining
unread full data/dirty data and increase the pending data size?

in stop-and-copy phase
1. user space sets device state to exit LOGGING state,
2. vendor driver stops data logging. it has to append device config
   data at the tail of remaining dirty data unread by userspace.

during this flow, when vendor driver should get dirty data? just keeps
logging and appends to tail? how to ensure dirty data is refresh new before
LOGGING state exit? how does vendor driver know whether certain dirty data
is copied or not?

I've no idea how NVidia handle this problem, and they don't open their
kernel side code. 
just feel it's a bit hard for other vendor drivers to follow:)

> > System memory dirty pages: If a device produces dirty pages in system
> >         memory, it is able to get dirty bitmap for certain range of system
> >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> >         callback, dirty pages in system memory will be save/loaded by ram's
> >         live migration code.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> >         If system memory range is larger than that dirty bitmap region can
> >         hold, qemu will cut it into several chunks and get dirty bitmap in
> >         succession.
> > 
> > 
> > Device State Regions
> > --------------------
> > Vendor driver is required to expose two mandatory regions and another two
> > optional regions if it plans to support device state management.
> > 
> > So, there are up to four regions in total.
> > One control region: mandatory.
> >         Get access via read/write system call.
> >         Its layout is defined in struct vfio_device_state_ctl
> > Three data regions: mmaped into qemu.
> 
> Is mmap mandatory?  I would think this would be defined by the mdev
> device what access they want to support per region.  We don't want to
> impose a more complicated interface if the device doesn't require it.
I think it's "mmap is preferred, but allowed to fail".
just like a normal region with MMAP flag on (like bar regions), we also
allow its mmap to fail, right?

> >         device config region: mandatory, holding data of device config
> >         device memory region: optional, holding data of device memory
> >         dirty bitmap region: optional, holding bitmap of system memory
> >                             dirty pages
> > 
> > (The reason why four seperate regions are defined is that the unit of mmap
> > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > control and three mmaped regions for data seems better than one big region
> > padded and sparse mmaped).
> 
> It's not obvious to me how this is better, a big region isn't padded,
> there's simply a gap in the file descriptor.  Is having a sub-PAGE_SIZE
> gap in a file really of any consequence?  Each region beyond the header
> is more than likely larger than PAGE_SIZE, therefore they can be nicely
> aligned together.  We still need fields to tell us how much data is
> available in each area, so another to tell us the start of each area is
> a minor detail.  And I think we still want to allow drivers to specify
> which parts of which areas support mmap, so I don't think we're getting
> away from sparse mmap support.

with seperate regions and sub-region type defined, user space can explictly
know which region is which region after vfio_get_dev_region_info(). along
with it, user space knows region offset and size. mmap is allowed to fail
and falls back to normal read/write to the region.

But with one big region and sparse mmapped subregions (1 data subregion or
3 data subregions, whatever), userspace can't tell which subregion is which
one.
So, if using one big region, I think we need to explictly define
subregions' sequence (like index 0 is dedicated to control subregion,
index 1 is for device config data subregion ...). Vendor driver cannot
freely change the sequence.
Then keep data offset the same as region->mmaps[i].offset, and data size
the same as region->mmaps[i].size. (i.e. let actual data starts immediatly
from first byte of its data subregion)
Also, mmaps for sparse mmaped subregions are not allowed to fail.


With one big region, we also need to consider the case when vendor driver
does not want the data subregion to be mmaped.
so, what is the data layout for that case?
data subregion immedately follows control subregion, or not?
Of couse, for this condition, we can specify the data filed's start offset
and size through control region. And we must not expect the data start
offset in source and target are equal.
(because the big region's fd_offset
may vary in source and target. consider the case when both source and
target have one opregion and one device state region, but source has
opregion in the first and target has device state region in the first.
If we think this case is illegal, we must be able to detect it in the first
place).
Also, we must keep the start offset and size consistent with the above mmap
case.


> > kernel device state interface [1]
> > --------------------------------------
> > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> 
> If we were to go with this multi-region solution, isn't it evident from
> the regions exposed that device memory and a dirty bitmap are
> provided?  Alternatively, I believe Kirti's proposal doesn't require

> this distinction between device memory and device config, a device not
> requiring runtime migration data would simply report no data until the
> device moved to the stopped state, making it consistent path for
> userspace.  Likewise, the dirty bitmap could report a zero page count
> in the bitmap rather than branching based on device support.
If the path in userspace is consistent for device config and device
memory, there will be many unnecessary call of getting data size into
vendor driver.
 

> Note that it seems that cap VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY implies
> VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP, but there's no
> consistency in the naming.
> 
> > #define VFIO_DEVICE_STATE_RUNNING 0 
> > #define VFIO_DEVICE_STATE_STOP 1
> > #define VFIO_DEVICE_STATE_LOGGING 2
> 
> It looks like these are being defined as bits, since patch 1 talks
> about RUNNING & LOGGING and STOP & LOGGING.  I think Connie already
> posted some comments about this.  I'm not sure anything prevents us
> from defining RUNNING a 1 and STOPPED as 0 so we don't have the
> polarity flip vs LOGGING though.
> 
> The state "STOP & LOGGING" also feels like a strange "device state", if
> the device is stopped, it's not logging any new state, so I think this
> is more that the device state is STOP, but the LOGGING feature is
> active.  Maybe we should consider these independent bits.  LOGGING is
> active as we stop a device so that we can fetch the last dirtied pages,
> but disabled as we load the state of the device into the target.
> 
> > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > 
> > struct vfio_device_state_ctl {
> > 	__u32 version;		  /* ro */
> > 	__u32 device_state;       /* VFIO device state, wo */
> > 	__u32 caps;		 /* ro */
> >         struct {
> > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > 		__u64 size;    /*rw*/
> > 	} device_config;
> 
> Patch 1 indicates that to get the config buffer we write GET_BUFFER to
> action and read from the config region.  The size is previously read
> and apparently constant.  To set the config buffer, the config region
> is written followed by writing SET_BUFFER to action.  Why is size
> listed as read-write?
this is the size of config data.
size of config data <= size of config data region.


> Doesn't this protocol also require that the mdev driver consume each
> full region's worth of host kernel memory for backing pages in
> anticipation of a rare event like migration?  This might be a strike
> against separate regions if the driver needs to provide backing pages
> for 3 separate regions vs 1.  To avoid this runtime overhead, would it
> be expected that the user only mmap the regions during migration and
> the mdev driver allocate backing pages on mmap?  Should the mmap be
> restricted to the LOGGING feature being enabled?  Maybe I'm mistaken in
> how the mdev driver would back these mmap'd pages.
>
yes, 3 seperate regions consume a little more memory than 1 region.
but it's just a little overhead.
As in intel's kernel implementation,
device config region's size is 9M, dirty bitmap region's size is 16k.
if there is device memory region, its size can be defined as 100M?
so it's 109M vs 100M ?

> > 	struct {
> > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > 		__u64 size;     /* rw */  
> >                 __u64 pos; /*the offset in total buffer of device memory*/
> 
> Patch 1 outlines the protocol here that getting device memory begins
> with writing the position field, followed by reading from the device
> memory region.  Setting device memory begins with writing the data to
> the device memory region, followed by writing the position field.  Why
> does the user need to have visibility of data position?  This is opaque
> data to the user, the device should manage how the chunks fit together.
> 
> How does the user know when they reach the end?
sorry, maybe I didn't explain clearly here.

device  ________________________________________
memory  |    |    |////|    |    |    |    |    |
data:   |____|____|////|____|____|____|____|____|
                  :pos :
                  :    :
device            :____:
memory            |    |
region:           |____|

the whole sequence is like this:

1. user space reads device_memory.size
2. driver gets device memory's data(full snapshot or dirty data, depending
on it's in LOGGING state or not), and return the total size of
this data. 
3. user space finishes reading device_memory.size (>= device memory
region's size)

4. user space starts a loop like
  
   while (pos < total_len) {
        uint64_t len = region_devmem->size;

        if (pos + len >= total_len) {
            len = total_len - pos;
        }
        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
            return -1;
        }
	pos += len;
    }

 vfio_save_data_device_memory_chunk reads each chunk from device memory
 region by writing GET_BUFFER  to device_memory.action, and pos to
 device_memory.pos.


so. each time, userspace will finish getting device memory data in one
time.

specifying "pos" is just like the "lseek" before "write".

> Bullets 8 and 9 in patch 1 also discuss setting and getting the device
> memory size, but these aren't well integrated into the protocol for
> getting and setting the memory buffer.  Is getting the device memory
> really started by reading the size, which triggers the vendor driver to
> snapshot the state in an internal buffer which the user then iterates
> through using GET_BUFFER?  Therefore re-reading the size field could
> corrupt the data stream?  Wouldn't it be easier to report bytes
> available and countdown as the user marks them read?  What does
> position mean when we switch from snapshot to dirty data?
when switching to device memory's dirty data, pos means the pos in whole
dirty data.

.save_live_pending ==> driver gets dirty data in device memory and returns
total size.

.save_live_iterate ==> userspace reads all dirty data from device memory
region chunk by chunk

So, in an iteration, all dirty data are saved.
then in next iteration, dirty data is recalculated.


> 
> > 	} device_memory;
> > 	struct {
> > 		__u64 start_addr; /* wo */
> > 		__u64 page_nr;   /* wo */
> > 	} system_memory;
> > };
> 
> Why is one specified as an address and the other as pages?  Note that
Yes, start_addr ==> start pfn is better

> Kirti's implementation has an optimization to know how many pages are
> set within a range to avoid unnecessary buffer reads.
> 

Let's use start_pfn_all, page_nr_all to represent the start pfn and
page_nr passed in from qemu .log_sync interface.

and use start_pfn_i, page_nr_i to the value passed to driver.


start_pfn_all
  |         start_pfn_i
  |         |
 \ /_______\_/_____________________________
  |    |    |////|    |    |    |    |    |
  |____|____|////|____|____|____|____|____|
            :    :
            :    :
            :____:
bitmap      |    |
region:     |____|
           

1. Each time QEMU queries dirty bitmap from driver, it passes in
start_pfn_i, and page_nr_i. (page_nr_i is the biggest page number the
bitmap region can hold).
2. driver queries memory range from start_pfn_i with size page_nr_i.
3. driver return a bitmap (if no dirty data, the bitmap is all 0).
4. QEMU saves the pages according to the bitmap

If there's no dirty data found in step 2, step 4 can be skipped.
(I'll add this check before step 4 in future, thanks)
but if there's even 1 bit set in the bitmap, no step from 1-4 can be
skipped.

Honestly, after reviewing Kirti's implementation, I don't think it's an
optimization. As in below pseudo code for Kirti's code, I would think the
copied_pfns corresponds to the page_nr_i in my case. so, the case of
copied_pfns equaling to 0 is for the tail chunk? don't think it's working..

write start_pfn to driver
write page_size  to driver
write pfn_count to driver

do {
    read copied_pfns from driver.
    if (copied_pfns == 0) {
       break;
    }
   bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
   buf = get bitmap from driver
   cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
                                           (start_pfn + count) * page_size,
                                                copied_pfns);

     count +=  copied_pfns;

} while (count < pfn_count);



> > 
> > Devcie States
> > ------------- 
> > After migration is initialzed, it will set device state via writing to
> > device_state field of control region.
> > 
> > Four states are defined for a VFIO device:
> >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > 
> > RUNNING: In this state, a VFIO device is in active state ready to receive
> >         commands from device driver.
> >         It is the default state that a VFIO device enters initially.
> > 
> > STOP:  In this state, a VFIO device is deactivated to interact with
> >        device driver.
> > 
> > LOGGING: a special state that it CANNOT exist independently. It must be
> >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> >        STOP & LOGGING).
> >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> >        driver can start dirty data logging for device memory and system
> >        memory.
> >        LOGGING only impacts device/system memory. They return whole
> >        snapshot outside LOGGING and dirty data since last get operation
> >        inside LOGGING.
> >        Device config should be always accessible and return whole config
> >        snapshot regardless of LOGGING state.
> >        
> > Note:
> > The reason why RUNNING is the default state is that device's active state
> > must not depend on device state interface.
> > It is possible that region vfio_device_state_ctl fails to get registered.
> > In that condition, a device needs be in active state by default. 
> > 
> > Get Version & Get Caps
> > ----------------------
> > On migration init phase, qemu will probe the existence of device state
> > regions of vendor driver, then get version of the device state interface
> > from the r/w control region.
> > 
> > Then it will probe VFIO device's data capability by reading caps field of
> > control region.
> >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> >         device memory in pre-copy and stop-and-copy phase. The data of
> >         device memory is held in device memory region.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> >         produced by VFIO device during pre-copy and stop-and-copy phase.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> 
> As above, these capabilities seem redundant to the existence of the
> device specific regions in this implementation.
>
seems so :)

> > If failing to find two mandatory regions and optional data regions
> > corresponding to data caps or version mismatching, it will setup a
> > migration blocker and disable live migration for VFIO device.
> > 
> > 
> > Flows to call device state interface for VFIO live migration
> > ------------------------------------------------------------
> > 
> > Live migration save path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_SAVE_SETUP
> >  |
> >  .save_setup callback -->
> >  get device memory size (whole snapshot size)
> >  get device memory buffer (whole snapshot data)
> >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> >  .save_live_pending callback --> get device memory size (dirty data)
> >  .save_live_iteration callback --> get device memory buffer (dirty data)
> >  .log_sync callback --> get system memory dirty bitmap
> >  |
> > (vcpu stops) --> set device state -->
> >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> >  |
> > .save_live_complete_precopy callback -->
> >  get device memory size (dirty data)
> >  get device memory buffer (dirty data)
> >  get device config size (whole snapshot size)
> >  get device config buffer (whole snapshot data)
> >  |
> > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > MIGRATION_STATUS_COMPLETED
> > 
> > MIGRATION_STATUS_CANCELLED or
> > MIGRATION_STATUS_FAILED
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > 
> > 
> > Live migration load path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> > .load state callback -->
> >  set device memory size, set device memory buffer, set device config size,
> >  set device config buffer
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_COMPLETED
> > 
> > 
> > 
> > In source VM side,
> > In precopy phase,
> > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > qemu will first get whole snapshot of device memory in .save_setup
> > callback, and then it will get total size of dirty data in device memory in
> > .save_live_pending callback by reading device_memory.size field of control
> > region.
> 
> This requires iterative reads of device memory buffer but the protocol
> is unclear (to me) how the user knows how to do this or interact with
> the position field. 
> 
> > Then in .save_live_iteration callback, it will get buffer of device memory's
> > dirty data chunk by chunk from device memory region by writing pos &
> > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > control region. (size of each chunk is the size of device memory data
> > region).
> 
> What if there's not enough dirty data to fill the region?  The data is
> always padded to fill the region?
>
I think dirty data in vendor driver is orgnaized in a format like:
(addr0, len0, data0, addr1, len1, data1, addr2, len2, data2,....addrN,
lenN, dataN).
for full snapshot, it's like (addr0, len0, data0).
so, to userspace and data region, it doesn't matter whether it's full
snapshot or dirty data.


> > .save_live_pending and .save_live_iteration may be called several times in
> > precopy phase to get dirty data in device memory.
> > 
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > vendor driver's device state interface to get data from devcie memory.
> 
> Therefore through the entire precopy phase we have no data from source
> to target to begin a compatibility check :-\  I think both proposals
> currently still lack any sort of device compatibility or data
> versioning check between source and target.  Thanks,
I checked the compatibility, though not good enough:)

in migration_init, vfio_check_devstate_version() checked version from
kernel with VFIO_DEVICE_STATE_INTERFACE_VERSION in both source and target,
and in target side, vfio_load_state() checked source side version.


int vfio_load_state(QEMUFile *f, void *opaque, int version_id)                          
{       
    ...
    if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {                                   
        return -EINVAL;
    } 
    ...
}

Thanks
Yan

> Alex
> 
> > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > region by writing system memory's start address, page count and action 
> > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > "system_memory.action" fields of control region.
> > If page count passed in .log_sync callback is larger than the bitmap size
> > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > vendor driver's get system memory dirty bitmap interface.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > returns without call to vendor driver.
> > 
> > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > in save_live_complete_precopy callback,
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > get device memory size and get device memory buffer will be called again.
> > After that,
> > device config data is get from device config region by reading
> > devcie_config.size of control region and writing action (GET_BITMAP) to
> > device_config.action of control region.
> > Then after migration completes, in cleanup handler, LOGGING state will be
> > cleared (i.e. deivce state is set to STOP).
> > Clearing LOGGING state in cleanup handler is in consideration of the case
> > of "migration failed" and "migration cancelled". They can also leverage
> > the cleanup handler to unset LOGGING state.
> > 
> > 
> > References
> > ----------
> > 1. kernel side implementation of Device state interfaces:
> > https://patchwork.freedesktop.org/series/56876/
> > 
> > 
> > Yan Zhao (5):
> >   vfio/migration: define kernel interfaces
> >   vfio/migration: support device of device config capability
> >   vfio/migration: tracking of dirty page in system memory
> >   vfio/migration: turn on migration
> >   vfio/migration: support device memory capability
> > 
> >  hw/vfio/Makefile.objs         |   2 +-
> >  hw/vfio/common.c              |  26 ++
> >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c                 |  10 +-
> >  hw/vfio/pci.h                 |  26 +-
> >  include/hw/vfio/vfio-common.h |   1 +
> >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> >  7 files changed, 1174 insertions(+), 9 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-02-25  2:22     ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-02-25  2:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, intel-gvt-dev, Zhengxiao.zx, yi.l.liu, eskultet,
	ziye.yang, cohuck, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk,
	pasic, aik, eauger, felipe, jonathan.davies, changpeng.liu,
	Ken.Xue, kwankhede, kevin.tian, cjia, arei.gonglei, kvm

On Thu, Feb 21, 2019 at 01:40:51PM -0700, Alex Williamson wrote:
> Hi Yan,
> 
> Thanks for working on this!
> 
> On Tue, 19 Feb 2019 16:50:54 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > This patchset enables VFIO devices to have live migration capability.
> > Currently it does not support post-copy phase.
> > 
> > It follows Alex's comments on last version of VFIO live migration patches,
> > including device states, VFIO device state region layout, dirty bitmap's
> > query.
> > 
> > Device Data
> > -----------
> > Device data is divided into three types: device memory, device config,
> > and system memory dirty pages produced by device.
> > 
> > Device config: data like MMIOs, page tables...
> >         Every device is supposed to possess device config data.
> >     	Usually device config's size is small (no big than 10M), and it
> 
> I'm not sure how we can really impose a limit here, it is what it is
> for a device.  A smaller state is obviously desirable to reduce
> downtime, but some devices could have very large states.
> 
> >         needs to be loaded in certain strict order.
> >         Therefore, device config only needs to be saved/loaded in
> >         stop-and-copy phase.
> >         The data of device config is held in device config region.
> >         Size of device config data is smaller than or equal to that of
> >         device config region.
> 
> So the intention here is that this is the last data read from the
> device and it's done in one pass, so the region needs to be large
> enough to expose all config data at once.  On restore it's the last
> data written before switching the device to the run state.
> 
> > 
> > Device Memory: device's internal memory, standalone and outside system
> 
> s/system/VM/
> 
> >         memory. It is usually very big.
> 
> Or it doesn't exist.  Not sure we should be setting expectations since
> it will vary per device.
> 
> >         This kind of data needs to be saved / loaded in pre-copy and
> >         stop-and-copy phase.
> >         The data of device memory is held in device memory region.
> >         Size of devie memory is usually larger than that of device
> >         memory region. qemu needs to save/load it in chunks of size of
> >         device memory region.
> >         Not all device has device memory. Like IGD only uses system memory.
> 
> It seems a little gratuitous to me that this is a separate region or
> that this data is handled separately.  All of this data is opaque to
> QEMU, so why do we need to separate it?
hi Alex,
as the device state interfaces are provided by kernel, it is expected to
meet as general needs as possible. So, do you think there are such use
cases from user space that user space knows well of the device, and
it wants kernel to return desired data back to it.
E.g. It just wants to get whole device config data including all mmios,
page tables, pci config data...
or, It just wants to get current device memory snapshot, not including any
dirty data.
Or, It just needs the dirty pages in device memory or system memory.
With all this accurate query, quite a lot of useful features can be
developped in user space.

If all of this data is opaque to user app, seems the only use case is
for live migration.

>From another aspect, if the final solution is to let the data opaque to
user space, like what NV did, kernel side's implementation will be more
complicated, and actually a little challenge to vendor driver.
in that case, in pre-copy phase,
1. in not LOGGING state, vendor driver first returns full data including
full device memory snapshot
2. user space reads some data (you can't expect it to finish reading all
data)
3. then userspace set the device state to LOGGING to start dirty data
logging
4. vendor driver starts dirty data logging, and appends the dirty data to
the tail of remaining unread full data and increase the pending data size?
5. user space keeps reading data.
6. vendor driver keeps appending new dirty data to the tail of remaining
unread full data/dirty data and increase the pending data size?

in stop-and-copy phase
1. user space sets device state to exit LOGGING state,
2. vendor driver stops data logging. it has to append device config
   data at the tail of remaining dirty data unread by userspace.

during this flow, when vendor driver should get dirty data? just keeps
logging and appends to tail? how to ensure dirty data is refresh new before
LOGGING state exit? how does vendor driver know whether certain dirty data
is copied or not?

I've no idea how NVidia handle this problem, and they don't open their
kernel side code. 
just feel it's a bit hard for other vendor drivers to follow:)

> > System memory dirty pages: If a device produces dirty pages in system
> >         memory, it is able to get dirty bitmap for certain range of system
> >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> >         callback, dirty pages in system memory will be save/loaded by ram's
> >         live migration code.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> >         If system memory range is larger than that dirty bitmap region can
> >         hold, qemu will cut it into several chunks and get dirty bitmap in
> >         succession.
> > 
> > 
> > Device State Regions
> > --------------------
> > Vendor driver is required to expose two mandatory regions and another two
> > optional regions if it plans to support device state management.
> > 
> > So, there are up to four regions in total.
> > One control region: mandatory.
> >         Get access via read/write system call.
> >         Its layout is defined in struct vfio_device_state_ctl
> > Three data regions: mmaped into qemu.
> 
> Is mmap mandatory?  I would think this would be defined by the mdev
> device what access they want to support per region.  We don't want to
> impose a more complicated interface if the device doesn't require it.
I think it's "mmap is preferred, but allowed to fail".
just like a normal region with MMAP flag on (like bar regions), we also
allow its mmap to fail, right?

> >         device config region: mandatory, holding data of device config
> >         device memory region: optional, holding data of device memory
> >         dirty bitmap region: optional, holding bitmap of system memory
> >                             dirty pages
> > 
> > (The reason why four seperate regions are defined is that the unit of mmap
> > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > control and three mmaped regions for data seems better than one big region
> > padded and sparse mmaped).
> 
> It's not obvious to me how this is better, a big region isn't padded,
> there's simply a gap in the file descriptor.  Is having a sub-PAGE_SIZE
> gap in a file really of any consequence?  Each region beyond the header
> is more than likely larger than PAGE_SIZE, therefore they can be nicely
> aligned together.  We still need fields to tell us how much data is
> available in each area, so another to tell us the start of each area is
> a minor detail.  And I think we still want to allow drivers to specify
> which parts of which areas support mmap, so I don't think we're getting
> away from sparse mmap support.

with seperate regions and sub-region type defined, user space can explictly
know which region is which region after vfio_get_dev_region_info(). along
with it, user space knows region offset and size. mmap is allowed to fail
and falls back to normal read/write to the region.

But with one big region and sparse mmapped subregions (1 data subregion or
3 data subregions, whatever), userspace can't tell which subregion is which
one.
So, if using one big region, I think we need to explictly define
subregions' sequence (like index 0 is dedicated to control subregion,
index 1 is for device config data subregion ...). Vendor driver cannot
freely change the sequence.
Then keep data offset the same as region->mmaps[i].offset, and data size
the same as region->mmaps[i].size. (i.e. let actual data starts immediatly
from first byte of its data subregion)
Also, mmaps for sparse mmaped subregions are not allowed to fail.


With one big region, we also need to consider the case when vendor driver
does not want the data subregion to be mmaped.
so, what is the data layout for that case?
data subregion immedately follows control subregion, or not?
Of couse, for this condition, we can specify the data filed's start offset
and size through control region. And we must not expect the data start
offset in source and target are equal.
(because the big region's fd_offset
may vary in source and target. consider the case when both source and
target have one opregion and one device state region, but source has
opregion in the first and target has device state region in the first.
If we think this case is illegal, we must be able to detect it in the first
place).
Also, we must keep the start offset and size consistent with the above mmap
case.


> > kernel device state interface [1]
> > --------------------------------------
> > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> 
> If we were to go with this multi-region solution, isn't it evident from
> the regions exposed that device memory and a dirty bitmap are
> provided?  Alternatively, I believe Kirti's proposal doesn't require

> this distinction between device memory and device config, a device not
> requiring runtime migration data would simply report no data until the
> device moved to the stopped state, making it consistent path for
> userspace.  Likewise, the dirty bitmap could report a zero page count
> in the bitmap rather than branching based on device support.
If the path in userspace is consistent for device config and device
memory, there will be many unnecessary call of getting data size into
vendor driver.
 

> Note that it seems that cap VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY implies
> VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP, but there's no
> consistency in the naming.
> 
> > #define VFIO_DEVICE_STATE_RUNNING 0 
> > #define VFIO_DEVICE_STATE_STOP 1
> > #define VFIO_DEVICE_STATE_LOGGING 2
> 
> It looks like these are being defined as bits, since patch 1 talks
> about RUNNING & LOGGING and STOP & LOGGING.  I think Connie already
> posted some comments about this.  I'm not sure anything prevents us
> from defining RUNNING a 1 and STOPPED as 0 so we don't have the
> polarity flip vs LOGGING though.
> 
> The state "STOP & LOGGING" also feels like a strange "device state", if
> the device is stopped, it's not logging any new state, so I think this
> is more that the device state is STOP, but the LOGGING feature is
> active.  Maybe we should consider these independent bits.  LOGGING is
> active as we stop a device so that we can fetch the last dirtied pages,
> but disabled as we load the state of the device into the target.
> 
> > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > 
> > struct vfio_device_state_ctl {
> > 	__u32 version;		  /* ro */
> > 	__u32 device_state;       /* VFIO device state, wo */
> > 	__u32 caps;		 /* ro */
> >         struct {
> > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > 		__u64 size;    /*rw*/
> > 	} device_config;
> 
> Patch 1 indicates that to get the config buffer we write GET_BUFFER to
> action and read from the config region.  The size is previously read
> and apparently constant.  To set the config buffer, the config region
> is written followed by writing SET_BUFFER to action.  Why is size
> listed as read-write?
this is the size of config data.
size of config data <= size of config data region.


> Doesn't this protocol also require that the mdev driver consume each
> full region's worth of host kernel memory for backing pages in
> anticipation of a rare event like migration?  This might be a strike
> against separate regions if the driver needs to provide backing pages
> for 3 separate regions vs 1.  To avoid this runtime overhead, would it
> be expected that the user only mmap the regions during migration and
> the mdev driver allocate backing pages on mmap?  Should the mmap be
> restricted to the LOGGING feature being enabled?  Maybe I'm mistaken in
> how the mdev driver would back these mmap'd pages.
>
yes, 3 seperate regions consume a little more memory than 1 region.
but it's just a little overhead.
As in intel's kernel implementation,
device config region's size is 9M, dirty bitmap region's size is 16k.
if there is device memory region, its size can be defined as 100M?
so it's 109M vs 100M ?

> > 	struct {
> > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > 		__u64 size;     /* rw */  
> >                 __u64 pos; /*the offset in total buffer of device memory*/
> 
> Patch 1 outlines the protocol here that getting device memory begins
> with writing the position field, followed by reading from the device
> memory region.  Setting device memory begins with writing the data to
> the device memory region, followed by writing the position field.  Why
> does the user need to have visibility of data position?  This is opaque
> data to the user, the device should manage how the chunks fit together.
> 
> How does the user know when they reach the end?
sorry, maybe I didn't explain clearly here.

device  ________________________________________
memory  |    |    |////|    |    |    |    |    |
data:   |____|____|////|____|____|____|____|____|
                  :pos :
                  :    :
device            :____:
memory            |    |
region:           |____|

the whole sequence is like this:

1. user space reads device_memory.size
2. driver gets device memory's data(full snapshot or dirty data, depending
on it's in LOGGING state or not), and return the total size of
this data. 
3. user space finishes reading device_memory.size (>= device memory
region's size)

4. user space starts a loop like
  
   while (pos < total_len) {
        uint64_t len = region_devmem->size;

        if (pos + len >= total_len) {
            len = total_len - pos;
        }
        if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
            return -1;
        }
	pos += len;
    }

 vfio_save_data_device_memory_chunk reads each chunk from device memory
 region by writing GET_BUFFER  to device_memory.action, and pos to
 device_memory.pos.


so. each time, userspace will finish getting device memory data in one
time.

specifying "pos" is just like the "lseek" before "write".

> Bullets 8 and 9 in patch 1 also discuss setting and getting the device
> memory size, but these aren't well integrated into the protocol for
> getting and setting the memory buffer.  Is getting the device memory
> really started by reading the size, which triggers the vendor driver to
> snapshot the state in an internal buffer which the user then iterates
> through using GET_BUFFER?  Therefore re-reading the size field could
> corrupt the data stream?  Wouldn't it be easier to report bytes
> available and countdown as the user marks them read?  What does
> position mean when we switch from snapshot to dirty data?
when switching to device memory's dirty data, pos means the pos in whole
dirty data.

.save_live_pending ==> driver gets dirty data in device memory and returns
total size.

.save_live_iterate ==> userspace reads all dirty data from device memory
region chunk by chunk

So, in an iteration, all dirty data are saved.
then in next iteration, dirty data is recalculated.


> 
> > 	} device_memory;
> > 	struct {
> > 		__u64 start_addr; /* wo */
> > 		__u64 page_nr;   /* wo */
> > 	} system_memory;
> > };
> 
> Why is one specified as an address and the other as pages?  Note that
Yes, start_addr ==> start pfn is better

> Kirti's implementation has an optimization to know how many pages are
> set within a range to avoid unnecessary buffer reads.
> 

Let's use start_pfn_all, page_nr_all to represent the start pfn and
page_nr passed in from qemu .log_sync interface.

and use start_pfn_i, page_nr_i to the value passed to driver.


start_pfn_all
  |         start_pfn_i
  |         |
 \ /_______\_/_____________________________
  |    |    |////|    |    |    |    |    |
  |____|____|////|____|____|____|____|____|
            :    :
            :    :
            :____:
bitmap      |    |
region:     |____|
           

1. Each time QEMU queries dirty bitmap from driver, it passes in
start_pfn_i, and page_nr_i. (page_nr_i is the biggest page number the
bitmap region can hold).
2. driver queries memory range from start_pfn_i with size page_nr_i.
3. driver return a bitmap (if no dirty data, the bitmap is all 0).
4. QEMU saves the pages according to the bitmap

If there's no dirty data found in step 2, step 4 can be skipped.
(I'll add this check before step 4 in future, thanks)
but if there's even 1 bit set in the bitmap, no step from 1-4 can be
skipped.

Honestly, after reviewing Kirti's implementation, I don't think it's an
optimization. As in below pseudo code for Kirti's code, I would think the
copied_pfns corresponds to the page_nr_i in my case. so, the case of
copied_pfns equaling to 0 is for the tail chunk? don't think it's working..

write start_pfn to driver
write page_size  to driver
write pfn_count to driver

do {
    read copied_pfns from driver.
    if (copied_pfns == 0) {
       break;
    }
   bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
   buf = get bitmap from driver
   cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
                                           (start_pfn + count) * page_size,
                                                copied_pfns);

     count +=  copied_pfns;

} while (count < pfn_count);



> > 
> > Devcie States
> > ------------- 
> > After migration is initialzed, it will set device state via writing to
> > device_state field of control region.
> > 
> > Four states are defined for a VFIO device:
> >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > 
> > RUNNING: In this state, a VFIO device is in active state ready to receive
> >         commands from device driver.
> >         It is the default state that a VFIO device enters initially.
> > 
> > STOP:  In this state, a VFIO device is deactivated to interact with
> >        device driver.
> > 
> > LOGGING: a special state that it CANNOT exist independently. It must be
> >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> >        STOP & LOGGING).
> >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> >        driver can start dirty data logging for device memory and system
> >        memory.
> >        LOGGING only impacts device/system memory. They return whole
> >        snapshot outside LOGGING and dirty data since last get operation
> >        inside LOGGING.
> >        Device config should be always accessible and return whole config
> >        snapshot regardless of LOGGING state.
> >        
> > Note:
> > The reason why RUNNING is the default state is that device's active state
> > must not depend on device state interface.
> > It is possible that region vfio_device_state_ctl fails to get registered.
> > In that condition, a device needs be in active state by default. 
> > 
> > Get Version & Get Caps
> > ----------------------
> > On migration init phase, qemu will probe the existence of device state
> > regions of vendor driver, then get version of the device state interface
> > from the r/w control region.
> > 
> > Then it will probe VFIO device's data capability by reading caps field of
> > control region.
> >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> >         device memory in pre-copy and stop-and-copy phase. The data of
> >         device memory is held in device memory region.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> >         produced by VFIO device during pre-copy and stop-and-copy phase.
> >         The dirty bitmap of system memory is held in dirty bitmap region.
> 
> As above, these capabilities seem redundant to the existence of the
> device specific regions in this implementation.
>
seems so :)

> > If failing to find two mandatory regions and optional data regions
> > corresponding to data caps or version mismatching, it will setup a
> > migration blocker and disable live migration for VFIO device.
> > 
> > 
> > Flows to call device state interface for VFIO live migration
> > ------------------------------------------------------------
> > 
> > Live migration save path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_SAVE_SETUP
> >  |
> >  .save_setup callback -->
> >  get device memory size (whole snapshot size)
> >  get device memory buffer (whole snapshot data)
> >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> >  .save_live_pending callback --> get device memory size (dirty data)
> >  .save_live_iteration callback --> get device memory buffer (dirty data)
> >  .log_sync callback --> get system memory dirty bitmap
> >  |
> > (vcpu stops) --> set device state -->
> >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> >  |
> > .save_live_complete_precopy callback -->
> >  get device memory size (dirty data)
> >  get device memory buffer (dirty data)
> >  get device config size (whole snapshot size)
> >  get device config buffer (whole snapshot data)
> >  |
> > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > MIGRATION_STATUS_COMPLETED
> > 
> > MIGRATION_STATUS_CANCELLED or
> > MIGRATION_STATUS_FAILED
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > 
> > 
> > Live migration load path:
> > 
> > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > 
> > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> >  |
> > MIGRATION_STATUS_ACTIVE
> >  |
> > .load state callback -->
> >  set device memory size, set device memory buffer, set device config size,
> >  set device config buffer
> >  |
> > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> >  |
> > MIGRATION_STATUS_COMPLETED
> > 
> > 
> > 
> > In source VM side,
> > In precopy phase,
> > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > qemu will first get whole snapshot of device memory in .save_setup
> > callback, and then it will get total size of dirty data in device memory in
> > .save_live_pending callback by reading device_memory.size field of control
> > region.
> 
> This requires iterative reads of device memory buffer but the protocol
> is unclear (to me) how the user knows how to do this or interact with
> the position field. 
> 
> > Then in .save_live_iteration callback, it will get buffer of device memory's
> > dirty data chunk by chunk from device memory region by writing pos &
> > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > control region. (size of each chunk is the size of device memory data
> > region).
> 
> What if there's not enough dirty data to fill the region?  The data is
> always padded to fill the region?
>
I think dirty data in vendor driver is orgnaized in a format like:
(addr0, len0, data0, addr1, len1, data1, addr2, len2, data2,....addrN,
lenN, dataN).
for full snapshot, it's like (addr0, len0, data0).
so, to userspace and data region, it doesn't matter whether it's full
snapshot or dirty data.


> > .save_live_pending and .save_live_iteration may be called several times in
> > precopy phase to get dirty data in device memory.
> > 
> > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > vendor driver's device state interface to get data from devcie memory.
> 
> Therefore through the entire precopy phase we have no data from source
> to target to begin a compatibility check :-\  I think both proposals
> currently still lack any sort of device compatibility or data
> versioning check between source and target.  Thanks,
I checked the compatibility, though not good enough:)

in migration_init, vfio_check_devstate_version() checked version from
kernel with VFIO_DEVICE_STATE_INTERFACE_VERSION in both source and target,
and in target side, vfio_load_state() checked source side version.


int vfio_load_state(QEMUFile *f, void *opaque, int version_id)                          
{       
    ...
    if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {                                   
        return -EINVAL;
    } 
    ...
}

Thanks
Yan

> Alex
> 
> > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > region by writing system memory's start address, page count and action 
> > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > "system_memory.action" fields of control region.
> > If page count passed in .log_sync callback is larger than the bitmap size
> > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > vendor driver's get system memory dirty bitmap interface.
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > returns without call to vendor driver.
> > 
> > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > in save_live_complete_precopy callback,
> > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > get device memory size and get device memory buffer will be called again.
> > After that,
> > device config data is get from device config region by reading
> > devcie_config.size of control region and writing action (GET_BITMAP) to
> > device_config.action of control region.
> > Then after migration completes, in cleanup handler, LOGGING state will be
> > cleared (i.e. deivce state is set to STOP).
> > Clearing LOGGING state in cleanup handler is in consideration of the case
> > of "migration failed" and "migration cancelled". They can also leverage
> > the cleanup handler to unset LOGGING state.
> > 
> > 
> > References
> > ----------
> > 1. kernel side implementation of Device state interfaces:
> > https://patchwork.freedesktop.org/series/56876/
> > 
> > 
> > Yan Zhao (5):
> >   vfio/migration: define kernel interfaces
> >   vfio/migration: support device of device config capability
> >   vfio/migration: tracking of dirty page in system memory
> >   vfio/migration: turn on migration
> >   vfio/migration: support device memory capability
> > 
> >  hw/vfio/Makefile.objs         |   2 +-
> >  hw/vfio/common.c              |  26 ++
> >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.c                 |  10 +-
> >  hw/vfio/pci.h                 |  26 +-
> >  include/hw/vfio/vfio-common.h |   1 +
> >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> >  7 files changed, 1174 insertions(+), 9 deletions(-)
> >  create mode 100644 hw/vfio/migration.c
> > 
> 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-25  2:22     ` [Qemu-devel] " Zhao Yan
@ 2019-03-06  0:22       ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-03-06  0:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Wang, Zhi A, Tian, Kevin, dgilbert,
	intel-gvt-dev,

hi Alex
we still have some opens as below. could you kindly help review on that? :)

Thanks
Yan

On Mon, Feb 25, 2019 at 10:22:56AM +0800, Zhao Yan wrote:
> On Thu, Feb 21, 2019 at 01:40:51PM -0700, Alex Williamson wrote:
> > Hi Yan,
> > 
> > Thanks for working on this!
> > 
> > On Tue, 19 Feb 2019 16:50:54 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > 
> > > This patchset enables VFIO devices to have live migration capability.
> > > Currently it does not support post-copy phase.
> > > 
> > > It follows Alex's comments on last version of VFIO live migration patches,
> > > including device states, VFIO device state region layout, dirty bitmap's
> > > query.
> > > 
> > > Device Data
> > > -----------
> > > Device data is divided into three types: device memory, device config,
> > > and system memory dirty pages produced by device.
> > > 
> > > Device config: data like MMIOs, page tables...
> > >         Every device is supposed to possess device config data.
> > >     	Usually device config's size is small (no big than 10M), and it
> > 
> > I'm not sure how we can really impose a limit here, it is what it is
> > for a device.  A smaller state is obviously desirable to reduce
> > downtime, but some devices could have very large states.
> > 
> > >         needs to be loaded in certain strict order.
> > >         Therefore, device config only needs to be saved/loaded in
> > >         stop-and-copy phase.
> > >         The data of device config is held in device config region.
> > >         Size of device config data is smaller than or equal to that of
> > >         device config region.
> > 
> > So the intention here is that this is the last data read from the
> > device and it's done in one pass, so the region needs to be large
> > enough to expose all config data at once.  On restore it's the last
> > data written before switching the device to the run state.
> > 
> > > 
> > > Device Memory: device's internal memory, standalone and outside system
> > 
> > s/system/VM/
> > 
> > >         memory. It is usually very big.
> > 
> > Or it doesn't exist.  Not sure we should be setting expectations since
> > it will vary per device.
> > 
> > >         This kind of data needs to be saved / loaded in pre-copy and
> > >         stop-and-copy phase.
> > >         The data of device memory is held in device memory region.
> > >         Size of devie memory is usually larger than that of device
> > >         memory region. qemu needs to save/load it in chunks of size of
> > >         device memory region.
> > >         Not all device has device memory. Like IGD only uses system memory.
> > 
> > It seems a little gratuitous to me that this is a separate region or
> > that this data is handled separately.  All of this data is opaque to
> > QEMU, so why do we need to separate it?
> hi Alex,
> as the device state interfaces are provided by kernel, it is expected to
> meet as general needs as possible. So, do you think there are such use
> cases from user space that user space knows well of the device, and
> it wants kernel to return desired data back to it.
> E.g. It just wants to get whole device config data including all mmios,
> page tables, pci config data...
> or, It just wants to get current device memory snapshot, not including any
> dirty data.
> Or, It just needs the dirty pages in device memory or system memory.
> With all this accurate query, quite a lot of useful features can be
> developped in user space.
> 
> If all of this data is opaque to user app, seems the only use case is
> for live migration.
> 
> From another aspect, if the final solution is to let the data opaque to
> user space, like what NV did, kernel side's implementation will be more
> complicated, and actually a little challenge to vendor driver.
> in that case, in pre-copy phase,
> 1. in not LOGGING state, vendor driver first returns full data including
> full device memory snapshot
> 2. user space reads some data (you can't expect it to finish reading all
> data)
> 3. then userspace set the device state to LOGGING to start dirty data
> logging
> 4. vendor driver starts dirty data logging, and appends the dirty data to
> the tail of remaining unread full data and increase the pending data size?
> 5. user space keeps reading data.
> 6. vendor driver keeps appending new dirty data to the tail of remaining
> unread full data/dirty data and increase the pending data size?
> 
> in stop-and-copy phase
> 1. user space sets device state to exit LOGGING state,
> 2. vendor driver stops data logging. it has to append device config
>    data at the tail of remaining dirty data unread by userspace.
> 
> during this flow, when vendor driver should get dirty data? just keeps
> logging and appends to tail? how to ensure dirty data is refresh new before
> LOGGING state exit? how does vendor driver know whether certain dirty data
> is copied or not?
> 
> I've no idea how NVidia handle this problem, and they don't open their
> kernel side code. 
> just feel it's a bit hard for other vendor drivers to follow:)
> 
> > > System memory dirty pages: If a device produces dirty pages in system
> > >         memory, it is able to get dirty bitmap for certain range of system
> > >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > >         callback, dirty pages in system memory will be save/loaded by ram's
> > >         live migration code.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > >         If system memory range is larger than that dirty bitmap region can
> > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > >         succession.
> > > 
> > > 
> > > Device State Regions
> > > --------------------
> > > Vendor driver is required to expose two mandatory regions and another two
> > > optional regions if it plans to support device state management.
> > > 
> > > So, there are up to four regions in total.
> > > One control region: mandatory.
> > >         Get access via read/write system call.
> > >         Its layout is defined in struct vfio_device_state_ctl
> > > Three data regions: mmaped into qemu.
> > 
> > Is mmap mandatory?  I would think this would be defined by the mdev
> > device what access they want to support per region.  We don't want to
> > impose a more complicated interface if the device doesn't require it.
> I think it's "mmap is preferred, but allowed to fail".
> just like a normal region with MMAP flag on (like bar regions), we also
> allow its mmap to fail, right?
> 
> > >         device config region: mandatory, holding data of device config
> > >         device memory region: optional, holding data of device memory
> > >         dirty bitmap region: optional, holding bitmap of system memory
> > >                             dirty pages
> > > 
> > > (The reason why four seperate regions are defined is that the unit of mmap
> > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > control and three mmaped regions for data seems better than one big region
> > > padded and sparse mmaped).
> > 
> > It's not obvious to me how this is better, a big region isn't padded,
> > there's simply a gap in the file descriptor.  Is having a sub-PAGE_SIZE
> > gap in a file really of any consequence?  Each region beyond the header
> > is more than likely larger than PAGE_SIZE, therefore they can be nicely
> > aligned together.  We still need fields to tell us how much data is
> > available in each area, so another to tell us the start of each area is
> > a minor detail.  And I think we still want to allow drivers to specify
> > which parts of which areas support mmap, so I don't think we're getting
> > away from sparse mmap support.
> 
> with seperate regions and sub-region type defined, user space can explictly
> know which region is which region after vfio_get_dev_region_info(). along
> with it, user space knows region offset and size. mmap is allowed to fail
> and falls back to normal read/write to the region.
> 
> But with one big region and sparse mmapped subregions (1 data subregion or
> 3 data subregions, whatever), userspace can't tell which subregion is which
> one.
> So, if using one big region, I think we need to explictly define
> subregions' sequence (like index 0 is dedicated to control subregion,
> index 1 is for device config data subregion ...). Vendor driver cannot
> freely change the sequence.
> Then keep data offset the same as region->mmaps[i].offset, and data size
> the same as region->mmaps[i].size. (i.e. let actual data starts immediatly
> from first byte of its data subregion)
> Also, mmaps for sparse mmaped subregions are not allowed to fail.
> 
> 
> With one big region, we also need to consider the case when vendor driver
> does not want the data subregion to be mmaped.
> so, what is the data layout for that case?
> data subregion immedately follows control subregion, or not?
> Of couse, for this condition, we can specify the data filed's start offset
> and size through control region. And we must not expect the data start
> offset in source and target are equal.
> (because the big region's fd_offset
> may vary in source and target. consider the case when both source and
> target have one opregion and one device state region, but source has
> opregion in the first and target has device state region in the first.
> If we think this case is illegal, we must be able to detect it in the first
> place).
> Also, we must keep the start offset and size consistent with the above mmap
> case.
> 
> 
> > > kernel device state interface [1]
> > > --------------------------------------
> > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > 
> > If we were to go with this multi-region solution, isn't it evident from
> > the regions exposed that device memory and a dirty bitmap are
> > provided?  Alternatively, I believe Kirti's proposal doesn't require
> 
> > this distinction between device memory and device config, a device not
> > requiring runtime migration data would simply report no data until the
> > device moved to the stopped state, making it consistent path for
> > userspace.  Likewise, the dirty bitmap could report a zero page count
> > in the bitmap rather than branching based on device support.
> If the path in userspace is consistent for device config and device
> memory, there will be many unnecessary call of getting data size into
> vendor driver.
>  
> 
> > Note that it seems that cap VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY implies
> > VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP, but there's no
> > consistency in the naming.
> > 
> > > #define VFIO_DEVICE_STATE_RUNNING 0 
> > > #define VFIO_DEVICE_STATE_STOP 1
> > > #define VFIO_DEVICE_STATE_LOGGING 2
> > 
> > It looks like these are being defined as bits, since patch 1 talks
> > about RUNNING & LOGGING and STOP & LOGGING.  I think Connie already
> > posted some comments about this.  I'm not sure anything prevents us
> > from defining RUNNING a 1 and STOPPED as 0 so we don't have the
> > polarity flip vs LOGGING though.
> > 
> > The state "STOP & LOGGING" also feels like a strange "device state", if
> > the device is stopped, it's not logging any new state, so I think this
> > is more that the device state is STOP, but the LOGGING feature is
> > active.  Maybe we should consider these independent bits.  LOGGING is
> > active as we stop a device so that we can fetch the last dirtied pages,
> > but disabled as we load the state of the device into the target.
> > 
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > 
> > > struct vfio_device_state_ctl {
> > > 	__u32 version;		  /* ro */
> > > 	__u32 device_state;       /* VFIO device state, wo */
> > > 	__u32 caps;		 /* ro */
> > >         struct {
> > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;    /*rw*/
> > > 	} device_config;
> > 
> > Patch 1 indicates that to get the config buffer we write GET_BUFFER to
> > action and read from the config region.  The size is previously read
> > and apparently constant.  To set the config buffer, the config region
> > is written followed by writing SET_BUFFER to action.  Why is size
> > listed as read-write?
> this is the size of config data.
> size of config data <= size of config data region.
> 
> 
> > Doesn't this protocol also require that the mdev driver consume each
> > full region's worth of host kernel memory for backing pages in
> > anticipation of a rare event like migration?  This might be a strike
> > against separate regions if the driver needs to provide backing pages
> > for 3 separate regions vs 1.  To avoid this runtime overhead, would it
> > be expected that the user only mmap the regions during migration and
> > the mdev driver allocate backing pages on mmap?  Should the mmap be
> > restricted to the LOGGING feature being enabled?  Maybe I'm mistaken in
> > how the mdev driver would back these mmap'd pages.
> >
> yes, 3 seperate regions consume a little more memory than 1 region.
> but it's just a little overhead.
> As in intel's kernel implementation,
> device config region's size is 9M, dirty bitmap region's size is 16k.
> if there is device memory region, its size can be defined as 100M?
> so it's 109M vs 100M ?
> 
> > > 	struct {
> > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;     /* rw */  
> > >                 __u64 pos; /*the offset in total buffer of device memory*/
> > 
> > Patch 1 outlines the protocol here that getting device memory begins
> > with writing the position field, followed by reading from the device
> > memory region.  Setting device memory begins with writing the data to
> > the device memory region, followed by writing the position field.  Why
> > does the user need to have visibility of data position?  This is opaque
> > data to the user, the device should manage how the chunks fit together.
> > 
> > How does the user know when they reach the end?
> sorry, maybe I didn't explain clearly here.
> 
> device  ________________________________________
> memory  |    |    |////|    |    |    |    |    |
> data:   |____|____|////|____|____|____|____|____|
>                   :pos :
>                   :    :
> device            :____:
> memory            |    |
> region:           |____|
> 
> the whole sequence is like this:
> 
> 1. user space reads device_memory.size
> 2. driver gets device memory's data(full snapshot or dirty data, depending
> on it's in LOGGING state or not), and return the total size of
> this data. 
> 3. user space finishes reading device_memory.size (>= device memory
> region's size)
> 
> 4. user space starts a loop like
>   
>    while (pos < total_len) {
>         uint64_t len = region_devmem->size;
> 
>         if (pos + len >= total_len) {
>             len = total_len - pos;
>         }
>         if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
>             return -1;
>         }
> 	pos += len;
>     }
> 
>  vfio_save_data_device_memory_chunk reads each chunk from device memory
>  region by writing GET_BUFFER  to device_memory.action, and pos to
>  device_memory.pos.
> 
> 
> so. each time, userspace will finish getting device memory data in one
> time.
> 
> specifying "pos" is just like the "lseek" before "write".
> 
> > Bullets 8 and 9 in patch 1 also discuss setting and getting the device
> > memory size, but these aren't well integrated into the protocol for
> > getting and setting the memory buffer.  Is getting the device memory
> > really started by reading the size, which triggers the vendor driver to
> > snapshot the state in an internal buffer which the user then iterates
> > through using GET_BUFFER?  Therefore re-reading the size field could
> > corrupt the data stream?  Wouldn't it be easier to report bytes
> > available and countdown as the user marks them read?  What does
> > position mean when we switch from snapshot to dirty data?
> when switching to device memory's dirty data, pos means the pos in whole
> dirty data.
> 
> .save_live_pending ==> driver gets dirty data in device memory and returns
> total size.
> 
> .save_live_iterate ==> userspace reads all dirty data from device memory
> region chunk by chunk
> 
> So, in an iteration, all dirty data are saved.
> then in next iteration, dirty data is recalculated.
> 
> 
> > 
> > > 	} device_memory;
> > > 	struct {
> > > 		__u64 start_addr; /* wo */
> > > 		__u64 page_nr;   /* wo */
> > > 	} system_memory;
> > > };
> > 
> > Why is one specified as an address and the other as pages?  Note that
> Yes, start_addr ==> start pfn is better
> 
> > Kirti's implementation has an optimization to know how many pages are
> > set within a range to avoid unnecessary buffer reads.
> > 
> 
> Let's use start_pfn_all, page_nr_all to represent the start pfn and
> page_nr passed in from qemu .log_sync interface.
> 
> and use start_pfn_i, page_nr_i to the value passed to driver.
> 
> 
> start_pfn_all
>   |         start_pfn_i
>   |         |
>  \ /_______\_/_____________________________
>   |    |    |////|    |    |    |    |    |
>   |____|____|////|____|____|____|____|____|
>             :    :
>             :    :
>             :____:
> bitmap      |    |
> region:     |____|
>            
> 
> 1. Each time QEMU queries dirty bitmap from driver, it passes in
> start_pfn_i, and page_nr_i. (page_nr_i is the biggest page number the
> bitmap region can hold).
> 2. driver queries memory range from start_pfn_i with size page_nr_i.
> 3. driver return a bitmap (if no dirty data, the bitmap is all 0).
> 4. QEMU saves the pages according to the bitmap
> 
> If there's no dirty data found in step 2, step 4 can be skipped.
> (I'll add this check before step 4 in future, thanks)
> but if there's even 1 bit set in the bitmap, no step from 1-4 can be
> skipped.
> 
> Honestly, after reviewing Kirti's implementation, I don't think it's an
> optimization. As in below pseudo code for Kirti's code, I would think the
> copied_pfns corresponds to the page_nr_i in my case. so, the case of
> copied_pfns equaling to 0 is for the tail chunk? don't think it's working..
> 
> write start_pfn to driver
> write page_size  to driver
> write pfn_count to driver
> 
> do {
>     read copied_pfns from driver.
>     if (copied_pfns == 0) {
>        break;
>     }
>    bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
>    buf = get bitmap from driver
>    cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
>                                            (start_pfn + count) * page_size,
>                                                 copied_pfns);
> 
>      count +=  copied_pfns;
> 
> } while (count < pfn_count);
> 
> 
> 
> > > 
> > > Devcie States
> > > ------------- 
> > > After migration is initialzed, it will set device state via writing to
> > > device_state field of control region.
> > > 
> > > Four states are defined for a VFIO device:
> > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > > 
> > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > >         commands from device driver.
> > >         It is the default state that a VFIO device enters initially.
> > > 
> > > STOP:  In this state, a VFIO device is deactivated to interact with
> > >        device driver.
> > > 
> > > LOGGING: a special state that it CANNOT exist independently. It must be
> > >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > >        STOP & LOGGING).
> > >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > >        driver can start dirty data logging for device memory and system
> > >        memory.
> > >        LOGGING only impacts device/system memory. They return whole
> > >        snapshot outside LOGGING and dirty data since last get operation
> > >        inside LOGGING.
> > >        Device config should be always accessible and return whole config
> > >        snapshot regardless of LOGGING state.
> > >        
> > > Note:
> > > The reason why RUNNING is the default state is that device's active state
> > > must not depend on device state interface.
> > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > In that condition, a device needs be in active state by default. 
> > > 
> > > Get Version & Get Caps
> > > ----------------------
> > > On migration init phase, qemu will probe the existence of device state
> > > regions of vendor driver, then get version of the device state interface
> > > from the r/w control region.
> > > 
> > > Then it will probe VFIO device's data capability by reading caps field of
> > > control region.
> > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > >         device memory in pre-copy and stop-and-copy phase. The data of
> > >         device memory is held in device memory region.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > >         produced by VFIO device during pre-copy and stop-and-copy phase.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > 
> > As above, these capabilities seem redundant to the existence of the
> > device specific regions in this implementation.
> >
> seems so :)
> 
> > > If failing to find two mandatory regions and optional data regions
> > > corresponding to data caps or version mismatching, it will setup a
> > > migration blocker and disable live migration for VFIO device.
> > > 
> > > 
> > > Flows to call device state interface for VFIO live migration
> > > ------------------------------------------------------------
> > > 
> > > Live migration save path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_SAVE_SETUP
> > >  |
> > >  .save_setup callback -->
> > >  get device memory size (whole snapshot size)
> > >  get device memory buffer (whole snapshot data)
> > >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > >  .save_live_pending callback --> get device memory size (dirty data)
> > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > >  .log_sync callback --> get system memory dirty bitmap
> > >  |
> > > (vcpu stops) --> set device state -->
> > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > .save_live_complete_precopy callback -->
> > >  get device memory size (dirty data)
> > >  get device memory buffer (dirty data)
> > >  get device config size (whole snapshot size)
> > >  get device config buffer (whole snapshot data)
> > >  |
> > > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > MIGRATION_STATUS_CANCELLED or
> > > MIGRATION_STATUS_FAILED
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > 
> > > 
> > > Live migration load path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > > .load state callback -->
> > >  set device memory size, set device memory buffer, set device config size,
> > >  set device config buffer
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > 
> > > 
> > > In source VM side,
> > > In precopy phase,
> > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > qemu will first get whole snapshot of device memory in .save_setup
> > > callback, and then it will get total size of dirty data in device memory in
> > > .save_live_pending callback by reading device_memory.size field of control
> > > region.
> > 
> > This requires iterative reads of device memory buffer but the protocol
> > is unclear (to me) how the user knows how to do this or interact with
> > the position field. 
> > 
> > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > dirty data chunk by chunk from device memory region by writing pos &
> > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > control region. (size of each chunk is the size of device memory data
> > > region).
> > 
> > What if there's not enough dirty data to fill the region?  The data is
> > always padded to fill the region?
> >
> I think dirty data in vendor driver is orgnaized in a format like:
> (addr0, len0, data0, addr1, len1, data1, addr2, len2, data2,....addrN,
> lenN, dataN).
> for full snapshot, it's like (addr0, len0, data0).
> so, to userspace and data region, it doesn't matter whether it's full
> snapshot or dirty data.
> 
> 
> > > .save_live_pending and .save_live_iteration may be called several times in
> > > precopy phase to get dirty data in device memory.
> > > 
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > vendor driver's device state interface to get data from devcie memory.
> > 
> > Therefore through the entire precopy phase we have no data from source
> > to target to begin a compatibility check :-\  I think both proposals
> > currently still lack any sort of device compatibility or data
> > versioning check between source and target.  Thanks,
> I checked the compatibility, though not good enough:)
> 
> in migration_init, vfio_check_devstate_version() checked version from
> kernel with VFIO_DEVICE_STATE_INTERFACE_VERSION in both source and target,
> and in target side, vfio_load_state() checked source side version.
> 
> 
> int vfio_load_state(QEMUFile *f, void *opaque, int version_id)                          
> {       
>     ...
>     if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {                                   
>         return -EINVAL;
>     } 
>     ...
> }
> 
> Thanks
> Yan
> 
> > Alex
> > 
> > > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > region by writing system memory's start address, page count and action 
> > > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > > "system_memory.action" fields of control region.
> > > If page count passed in .log_sync callback is larger than the bitmap size
> > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > vendor driver's get system memory dirty bitmap interface.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > > returns without call to vendor driver.
> > > 
> > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > in save_live_complete_precopy callback,
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > get device memory size and get device memory buffer will be called again.
> > > After that,
> > > device config data is get from device config region by reading
> > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > device_config.action of control region.
> > > Then after migration completes, in cleanup handler, LOGGING state will be
> > > cleared (i.e. deivce state is set to STOP).
> > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > of "migration failed" and "migration cancelled". They can also leverage
> > > the cleanup handler to unset LOGGING state.
> > > 
> > > 
> > > References
> > > ----------
> > > 1. kernel side implementation of Device state interfaces:
> > > https://patchwork.freedesktop.org/series/56876/
> > > 
> > > 
> > > Yan Zhao (5):
> > >   vfio/migration: define kernel interfaces
> > >   vfio/migration: support device of device config capability
> > >   vfio/migration: tracking of dirty page in system memory
> > >   vfio/migration: turn on migration
> > >   vfio/migration: support device memory capability
> > > 
> > >  hw/vfio/Makefile.objs         |   2 +-
> > >  hw/vfio/common.c              |  26 ++
> > >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> > >  hw/vfio/pci.c                 |  10 +-
> > >  hw/vfio/pci.h                 |  26 +-
> > >  include/hw/vfio/vfio-common.h |   1 +
> > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > >  create mode 100644 hw/vfio/migration.c
> > > 
> > 
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-03-06  0:22       ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-03-06  0:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, Tian, Kevin, dgilbert,
	intel-gvt-dev, Liu, Changpeng, cohuck, Wang, Zhi A,
	jonathan.davies

hi Alex
we still have some opens as below. could you kindly help review on that? :)

Thanks
Yan

On Mon, Feb 25, 2019 at 10:22:56AM +0800, Zhao Yan wrote:
> On Thu, Feb 21, 2019 at 01:40:51PM -0700, Alex Williamson wrote:
> > Hi Yan,
> > 
> > Thanks for working on this!
> > 
> > On Tue, 19 Feb 2019 16:50:54 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > 
> > > This patchset enables VFIO devices to have live migration capability.
> > > Currently it does not support post-copy phase.
> > > 
> > > It follows Alex's comments on last version of VFIO live migration patches,
> > > including device states, VFIO device state region layout, dirty bitmap's
> > > query.
> > > 
> > > Device Data
> > > -----------
> > > Device data is divided into three types: device memory, device config,
> > > and system memory dirty pages produced by device.
> > > 
> > > Device config: data like MMIOs, page tables...
> > >         Every device is supposed to possess device config data.
> > >     	Usually device config's size is small (no big than 10M), and it
> > 
> > I'm not sure how we can really impose a limit here, it is what it is
> > for a device.  A smaller state is obviously desirable to reduce
> > downtime, but some devices could have very large states.
> > 
> > >         needs to be loaded in certain strict order.
> > >         Therefore, device config only needs to be saved/loaded in
> > >         stop-and-copy phase.
> > >         The data of device config is held in device config region.
> > >         Size of device config data is smaller than or equal to that of
> > >         device config region.
> > 
> > So the intention here is that this is the last data read from the
> > device and it's done in one pass, so the region needs to be large
> > enough to expose all config data at once.  On restore it's the last
> > data written before switching the device to the run state.
> > 
> > > 
> > > Device Memory: device's internal memory, standalone and outside system
> > 
> > s/system/VM/
> > 
> > >         memory. It is usually very big.
> > 
> > Or it doesn't exist.  Not sure we should be setting expectations since
> > it will vary per device.
> > 
> > >         This kind of data needs to be saved / loaded in pre-copy and
> > >         stop-and-copy phase.
> > >         The data of device memory is held in device memory region.
> > >         Size of devie memory is usually larger than that of device
> > >         memory region. qemu needs to save/load it in chunks of size of
> > >         device memory region.
> > >         Not all device has device memory. Like IGD only uses system memory.
> > 
> > It seems a little gratuitous to me that this is a separate region or
> > that this data is handled separately.  All of this data is opaque to
> > QEMU, so why do we need to separate it?
> hi Alex,
> as the device state interfaces are provided by kernel, it is expected to
> meet as general needs as possible. So, do you think there are such use
> cases from user space that user space knows well of the device, and
> it wants kernel to return desired data back to it.
> E.g. It just wants to get whole device config data including all mmios,
> page tables, pci config data...
> or, It just wants to get current device memory snapshot, not including any
> dirty data.
> Or, It just needs the dirty pages in device memory or system memory.
> With all this accurate query, quite a lot of useful features can be
> developped in user space.
> 
> If all of this data is opaque to user app, seems the only use case is
> for live migration.
> 
> From another aspect, if the final solution is to let the data opaque to
> user space, like what NV did, kernel side's implementation will be more
> complicated, and actually a little challenge to vendor driver.
> in that case, in pre-copy phase,
> 1. in not LOGGING state, vendor driver first returns full data including
> full device memory snapshot
> 2. user space reads some data (you can't expect it to finish reading all
> data)
> 3. then userspace set the device state to LOGGING to start dirty data
> logging
> 4. vendor driver starts dirty data logging, and appends the dirty data to
> the tail of remaining unread full data and increase the pending data size?
> 5. user space keeps reading data.
> 6. vendor driver keeps appending new dirty data to the tail of remaining
> unread full data/dirty data and increase the pending data size?
> 
> in stop-and-copy phase
> 1. user space sets device state to exit LOGGING state,
> 2. vendor driver stops data logging. it has to append device config
>    data at the tail of remaining dirty data unread by userspace.
> 
> during this flow, when vendor driver should get dirty data? just keeps
> logging and appends to tail? how to ensure dirty data is refresh new before
> LOGGING state exit? how does vendor driver know whether certain dirty data
> is copied or not?
> 
> I've no idea how NVidia handle this problem, and they don't open their
> kernel side code. 
> just feel it's a bit hard for other vendor drivers to follow:)
> 
> > > System memory dirty pages: If a device produces dirty pages in system
> > >         memory, it is able to get dirty bitmap for certain range of system
> > >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > >         callback, dirty pages in system memory will be save/loaded by ram's
> > >         live migration code.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > >         If system memory range is larger than that dirty bitmap region can
> > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > >         succession.
> > > 
> > > 
> > > Device State Regions
> > > --------------------
> > > Vendor driver is required to expose two mandatory regions and another two
> > > optional regions if it plans to support device state management.
> > > 
> > > So, there are up to four regions in total.
> > > One control region: mandatory.
> > >         Get access via read/write system call.
> > >         Its layout is defined in struct vfio_device_state_ctl
> > > Three data regions: mmaped into qemu.
> > 
> > Is mmap mandatory?  I would think this would be defined by the mdev
> > device what access they want to support per region.  We don't want to
> > impose a more complicated interface if the device doesn't require it.
> I think it's "mmap is preferred, but allowed to fail".
> just like a normal region with MMAP flag on (like bar regions), we also
> allow its mmap to fail, right?
> 
> > >         device config region: mandatory, holding data of device config
> > >         device memory region: optional, holding data of device memory
> > >         dirty bitmap region: optional, holding bitmap of system memory
> > >                             dirty pages
> > > 
> > > (The reason why four seperate regions are defined is that the unit of mmap
> > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > control and three mmaped regions for data seems better than one big region
> > > padded and sparse mmaped).
> > 
> > It's not obvious to me how this is better, a big region isn't padded,
> > there's simply a gap in the file descriptor.  Is having a sub-PAGE_SIZE
> > gap in a file really of any consequence?  Each region beyond the header
> > is more than likely larger than PAGE_SIZE, therefore they can be nicely
> > aligned together.  We still need fields to tell us how much data is
> > available in each area, so another to tell us the start of each area is
> > a minor detail.  And I think we still want to allow drivers to specify
> > which parts of which areas support mmap, so I don't think we're getting
> > away from sparse mmap support.
> 
> with seperate regions and sub-region type defined, user space can explictly
> know which region is which region after vfio_get_dev_region_info(). along
> with it, user space knows region offset and size. mmap is allowed to fail
> and falls back to normal read/write to the region.
> 
> But with one big region and sparse mmapped subregions (1 data subregion or
> 3 data subregions, whatever), userspace can't tell which subregion is which
> one.
> So, if using one big region, I think we need to explictly define
> subregions' sequence (like index 0 is dedicated to control subregion,
> index 1 is for device config data subregion ...). Vendor driver cannot
> freely change the sequence.
> Then keep data offset the same as region->mmaps[i].offset, and data size
> the same as region->mmaps[i].size. (i.e. let actual data starts immediatly
> from first byte of its data subregion)
> Also, mmaps for sparse mmaped subregions are not allowed to fail.
> 
> 
> With one big region, we also need to consider the case when vendor driver
> does not want the data subregion to be mmaped.
> so, what is the data layout for that case?
> data subregion immedately follows control subregion, or not?
> Of couse, for this condition, we can specify the data filed's start offset
> and size through control region. And we must not expect the data start
> offset in source and target are equal.
> (because the big region's fd_offset
> may vary in source and target. consider the case when both source and
> target have one opregion and one device state region, but source has
> opregion in the first and target has device state region in the first.
> If we think this case is illegal, we must be able to detect it in the first
> place).
> Also, we must keep the start offset and size consistent with the above mmap
> case.
> 
> 
> > > kernel device state interface [1]
> > > --------------------------------------
> > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > 
> > If we were to go with this multi-region solution, isn't it evident from
> > the regions exposed that device memory and a dirty bitmap are
> > provided?  Alternatively, I believe Kirti's proposal doesn't require
> 
> > this distinction between device memory and device config, a device not
> > requiring runtime migration data would simply report no data until the
> > device moved to the stopped state, making it consistent path for
> > userspace.  Likewise, the dirty bitmap could report a zero page count
> > in the bitmap rather than branching based on device support.
> If the path in userspace is consistent for device config and device
> memory, there will be many unnecessary call of getting data size into
> vendor driver.
>  
> 
> > Note that it seems that cap VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY implies
> > VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP, but there's no
> > consistency in the naming.
> > 
> > > #define VFIO_DEVICE_STATE_RUNNING 0 
> > > #define VFIO_DEVICE_STATE_STOP 1
> > > #define VFIO_DEVICE_STATE_LOGGING 2
> > 
> > It looks like these are being defined as bits, since patch 1 talks
> > about RUNNING & LOGGING and STOP & LOGGING.  I think Connie already
> > posted some comments about this.  I'm not sure anything prevents us
> > from defining RUNNING a 1 and STOPPED as 0 so we don't have the
> > polarity flip vs LOGGING though.
> > 
> > The state "STOP & LOGGING" also feels like a strange "device state", if
> > the device is stopped, it's not logging any new state, so I think this
> > is more that the device state is STOP, but the LOGGING feature is
> > active.  Maybe we should consider these independent bits.  LOGGING is
> > active as we stop a device so that we can fetch the last dirtied pages,
> > but disabled as we load the state of the device into the target.
> > 
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > 
> > > struct vfio_device_state_ctl {
> > > 	__u32 version;		  /* ro */
> > > 	__u32 device_state;       /* VFIO device state, wo */
> > > 	__u32 caps;		 /* ro */
> > >         struct {
> > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;    /*rw*/
> > > 	} device_config;
> > 
> > Patch 1 indicates that to get the config buffer we write GET_BUFFER to
> > action and read from the config region.  The size is previously read
> > and apparently constant.  To set the config buffer, the config region
> > is written followed by writing SET_BUFFER to action.  Why is size
> > listed as read-write?
> this is the size of config data.
> size of config data <= size of config data region.
> 
> 
> > Doesn't this protocol also require that the mdev driver consume each
> > full region's worth of host kernel memory for backing pages in
> > anticipation of a rare event like migration?  This might be a strike
> > against separate regions if the driver needs to provide backing pages
> > for 3 separate regions vs 1.  To avoid this runtime overhead, would it
> > be expected that the user only mmap the regions during migration and
> > the mdev driver allocate backing pages on mmap?  Should the mmap be
> > restricted to the LOGGING feature being enabled?  Maybe I'm mistaken in
> > how the mdev driver would back these mmap'd pages.
> >
> yes, 3 seperate regions consume a little more memory than 1 region.
> but it's just a little overhead.
> As in intel's kernel implementation,
> device config region's size is 9M, dirty bitmap region's size is 16k.
> if there is device memory region, its size can be defined as 100M?
> so it's 109M vs 100M ?
> 
> > > 	struct {
> > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;     /* rw */  
> > >                 __u64 pos; /*the offset in total buffer of device memory*/
> > 
> > Patch 1 outlines the protocol here that getting device memory begins
> > with writing the position field, followed by reading from the device
> > memory region.  Setting device memory begins with writing the data to
> > the device memory region, followed by writing the position field.  Why
> > does the user need to have visibility of data position?  This is opaque
> > data to the user, the device should manage how the chunks fit together.
> > 
> > How does the user know when they reach the end?
> sorry, maybe I didn't explain clearly here.
> 
> device  ________________________________________
> memory  |    |    |////|    |    |    |    |    |
> data:   |____|____|////|____|____|____|____|____|
>                   :pos :
>                   :    :
> device            :____:
> memory            |    |
> region:           |____|
> 
> the whole sequence is like this:
> 
> 1. user space reads device_memory.size
> 2. driver gets device memory's data(full snapshot or dirty data, depending
> on it's in LOGGING state or not), and return the total size of
> this data. 
> 3. user space finishes reading device_memory.size (>= device memory
> region's size)
> 
> 4. user space starts a loop like
>   
>    while (pos < total_len) {
>         uint64_t len = region_devmem->size;
> 
>         if (pos + len >= total_len) {
>             len = total_len - pos;
>         }
>         if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
>             return -1;
>         }
> 	pos += len;
>     }
> 
>  vfio_save_data_device_memory_chunk reads each chunk from device memory
>  region by writing GET_BUFFER  to device_memory.action, and pos to
>  device_memory.pos.
> 
> 
> so. each time, userspace will finish getting device memory data in one
> time.
> 
> specifying "pos" is just like the "lseek" before "write".
> 
> > Bullets 8 and 9 in patch 1 also discuss setting and getting the device
> > memory size, but these aren't well integrated into the protocol for
> > getting and setting the memory buffer.  Is getting the device memory
> > really started by reading the size, which triggers the vendor driver to
> > snapshot the state in an internal buffer which the user then iterates
> > through using GET_BUFFER?  Therefore re-reading the size field could
> > corrupt the data stream?  Wouldn't it be easier to report bytes
> > available and countdown as the user marks them read?  What does
> > position mean when we switch from snapshot to dirty data?
> when switching to device memory's dirty data, pos means the pos in whole
> dirty data.
> 
> .save_live_pending ==> driver gets dirty data in device memory and returns
> total size.
> 
> .save_live_iterate ==> userspace reads all dirty data from device memory
> region chunk by chunk
> 
> So, in an iteration, all dirty data are saved.
> then in next iteration, dirty data is recalculated.
> 
> 
> > 
> > > 	} device_memory;
> > > 	struct {
> > > 		__u64 start_addr; /* wo */
> > > 		__u64 page_nr;   /* wo */
> > > 	} system_memory;
> > > };
> > 
> > Why is one specified as an address and the other as pages?  Note that
> Yes, start_addr ==> start pfn is better
> 
> > Kirti's implementation has an optimization to know how many pages are
> > set within a range to avoid unnecessary buffer reads.
> > 
> 
> Let's use start_pfn_all, page_nr_all to represent the start pfn and
> page_nr passed in from qemu .log_sync interface.
> 
> and use start_pfn_i, page_nr_i to the value passed to driver.
> 
> 
> start_pfn_all
>   |         start_pfn_i
>   |         |
>  \ /_______\_/_____________________________
>   |    |    |////|    |    |    |    |    |
>   |____|____|////|____|____|____|____|____|
>             :    :
>             :    :
>             :____:
> bitmap      |    |
> region:     |____|
>            
> 
> 1. Each time QEMU queries dirty bitmap from driver, it passes in
> start_pfn_i, and page_nr_i. (page_nr_i is the biggest page number the
> bitmap region can hold).
> 2. driver queries memory range from start_pfn_i with size page_nr_i.
> 3. driver return a bitmap (if no dirty data, the bitmap is all 0).
> 4. QEMU saves the pages according to the bitmap
> 
> If there's no dirty data found in step 2, step 4 can be skipped.
> (I'll add this check before step 4 in future, thanks)
> but if there's even 1 bit set in the bitmap, no step from 1-4 can be
> skipped.
> 
> Honestly, after reviewing Kirti's implementation, I don't think it's an
> optimization. As in below pseudo code for Kirti's code, I would think the
> copied_pfns corresponds to the page_nr_i in my case. so, the case of
> copied_pfns equaling to 0 is for the tail chunk? don't think it's working..
> 
> write start_pfn to driver
> write page_size  to driver
> write pfn_count to driver
> 
> do {
>     read copied_pfns from driver.
>     if (copied_pfns == 0) {
>        break;
>     }
>    bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
>    buf = get bitmap from driver
>    cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
>                                            (start_pfn + count) * page_size,
>                                                 copied_pfns);
> 
>      count +=  copied_pfns;
> 
> } while (count < pfn_count);
> 
> 
> 
> > > 
> > > Devcie States
> > > ------------- 
> > > After migration is initialzed, it will set device state via writing to
> > > device_state field of control region.
> > > 
> > > Four states are defined for a VFIO device:
> > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > > 
> > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > >         commands from device driver.
> > >         It is the default state that a VFIO device enters initially.
> > > 
> > > STOP:  In this state, a VFIO device is deactivated to interact with
> > >        device driver.
> > > 
> > > LOGGING: a special state that it CANNOT exist independently. It must be
> > >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > >        STOP & LOGGING).
> > >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > >        driver can start dirty data logging for device memory and system
> > >        memory.
> > >        LOGGING only impacts device/system memory. They return whole
> > >        snapshot outside LOGGING and dirty data since last get operation
> > >        inside LOGGING.
> > >        Device config should be always accessible and return whole config
> > >        snapshot regardless of LOGGING state.
> > >        
> > > Note:
> > > The reason why RUNNING is the default state is that device's active state
> > > must not depend on device state interface.
> > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > In that condition, a device needs be in active state by default. 
> > > 
> > > Get Version & Get Caps
> > > ----------------------
> > > On migration init phase, qemu will probe the existence of device state
> > > regions of vendor driver, then get version of the device state interface
> > > from the r/w control region.
> > > 
> > > Then it will probe VFIO device's data capability by reading caps field of
> > > control region.
> > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > >         device memory in pre-copy and stop-and-copy phase. The data of
> > >         device memory is held in device memory region.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > >         produced by VFIO device during pre-copy and stop-and-copy phase.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > 
> > As above, these capabilities seem redundant to the existence of the
> > device specific regions in this implementation.
> >
> seems so :)
> 
> > > If failing to find two mandatory regions and optional data regions
> > > corresponding to data caps or version mismatching, it will setup a
> > > migration blocker and disable live migration for VFIO device.
> > > 
> > > 
> > > Flows to call device state interface for VFIO live migration
> > > ------------------------------------------------------------
> > > 
> > > Live migration save path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_SAVE_SETUP
> > >  |
> > >  .save_setup callback -->
> > >  get device memory size (whole snapshot size)
> > >  get device memory buffer (whole snapshot data)
> > >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > >  .save_live_pending callback --> get device memory size (dirty data)
> > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > >  .log_sync callback --> get system memory dirty bitmap
> > >  |
> > > (vcpu stops) --> set device state -->
> > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > .save_live_complete_precopy callback -->
> > >  get device memory size (dirty data)
> > >  get device memory buffer (dirty data)
> > >  get device config size (whole snapshot size)
> > >  get device config buffer (whole snapshot data)
> > >  |
> > > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > MIGRATION_STATUS_CANCELLED or
> > > MIGRATION_STATUS_FAILED
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > 
> > > 
> > > Live migration load path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > > .load state callback -->
> > >  set device memory size, set device memory buffer, set device config size,
> > >  set device config buffer
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > 
> > > 
> > > In source VM side,
> > > In precopy phase,
> > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > qemu will first get whole snapshot of device memory in .save_setup
> > > callback, and then it will get total size of dirty data in device memory in
> > > .save_live_pending callback by reading device_memory.size field of control
> > > region.
> > 
> > This requires iterative reads of device memory buffer but the protocol
> > is unclear (to me) how the user knows how to do this or interact with
> > the position field. 
> > 
> > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > dirty data chunk by chunk from device memory region by writing pos &
> > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > control region. (size of each chunk is the size of device memory data
> > > region).
> > 
> > What if there's not enough dirty data to fill the region?  The data is
> > always padded to fill the region?
> >
> I think dirty data in vendor driver is orgnaized in a format like:
> (addr0, len0, data0, addr1, len1, data1, addr2, len2, data2,....addrN,
> lenN, dataN).
> for full snapshot, it's like (addr0, len0, data0).
> so, to userspace and data region, it doesn't matter whether it's full
> snapshot or dirty data.
> 
> 
> > > .save_live_pending and .save_live_iteration may be called several times in
> > > precopy phase to get dirty data in device memory.
> > > 
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > vendor driver's device state interface to get data from devcie memory.
> > 
> > Therefore through the entire precopy phase we have no data from source
> > to target to begin a compatibility check :-\  I think both proposals
> > currently still lack any sort of device compatibility or data
> > versioning check between source and target.  Thanks,
> I checked the compatibility, though not good enough:)
> 
> in migration_init, vfio_check_devstate_version() checked version from
> kernel with VFIO_DEVICE_STATE_INTERFACE_VERSION in both source and target,
> and in target side, vfio_load_state() checked source side version.
> 
> 
> int vfio_load_state(QEMUFile *f, void *opaque, int version_id)                          
> {       
>     ...
>     if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {                                   
>         return -EINVAL;
>     } 
>     ...
> }
> 
> Thanks
> Yan
> 
> > Alex
> > 
> > > In precopy phase, if a device has VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY on,
> > > .log_sync callback will get system memory dirty bitmap from dirty bitmap
> > > region by writing system memory's start address, page count and action 
> > > (GET_BITMAP) to "system_memory.start_addr", "system_memory.page_nr", and
> > > "system_memory.action" fields of control region.
> > > If page count passed in .log_sync callback is larger than the bitmap size
> > > the dirty bitmap region supports, Qemu will cut it into chunks and call
> > > vendor driver's get system memory dirty bitmap interface.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is off, .log_sync callback just
> > > returns without call to vendor driver.
> > > 
> > > In stop-and-copy phase, device state will be set to STOP & LOGGING first.
> > > in save_live_complete_precopy callback,
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on,
> > > get device memory size and get device memory buffer will be called again.
> > > After that,
> > > device config data is get from device config region by reading
> > > devcie_config.size of control region and writing action (GET_BITMAP) to
> > > device_config.action of control region.
> > > Then after migration completes, in cleanup handler, LOGGING state will be
> > > cleared (i.e. deivce state is set to STOP).
> > > Clearing LOGGING state in cleanup handler is in consideration of the case
> > > of "migration failed" and "migration cancelled". They can also leverage
> > > the cleanup handler to unset LOGGING state.
> > > 
> > > 
> > > References
> > > ----------
> > > 1. kernel side implementation of Device state interfaces:
> > > https://patchwork.freedesktop.org/series/56876/
> > > 
> > > 
> > > Yan Zhao (5):
> > >   vfio/migration: define kernel interfaces
> > >   vfio/migration: support device of device config capability
> > >   vfio/migration: tracking of dirty page in system memory
> > >   vfio/migration: turn on migration
> > >   vfio/migration: support device memory capability
> > > 
> > >  hw/vfio/Makefile.objs         |   2 +-
> > >  hw/vfio/common.c              |  26 ++
> > >  hw/vfio/migration.c           | 858 ++++++++++++++++++++++++++++++++++++++++++
> > >  hw/vfio/pci.c                 |  10 +-
> > >  hw/vfio/pci.h                 |  26 +-
> > >  include/hw/vfio/vfio-common.h |   1 +
> > >  linux-headers/linux/vfio.h    | 260 +++++++++++++
> > >  7 files changed, 1174 insertions(+), 9 deletions(-)
> > >  create mode 100644 hw/vfio/migration.c
> > > 
> > 
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-25  2:22     ` [Qemu-devel] " Zhao Yan
@ 2019-03-07 17:44       ` Alex Williamson
  -1 siblings, 0 replies; 133+ messages in thread
From: Alex Williamson @ 2019-03-07 17:44 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, kevin.tian, dgilbert,
	intel-gvt-dev, changpeng.liu, cohuck, zhi.a.wang,
	jonathan.davies

Hi Yan,

Sorry for the delay, I've been on PTO...

On Sun, 24 Feb 2019 21:22:56 -0500
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Thu, Feb 21, 2019 at 01:40:51PM -0700, Alex Williamson wrote:
> > Hi Yan,
> > 
> > Thanks for working on this!
> > 
> > On Tue, 19 Feb 2019 16:50:54 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > This patchset enables VFIO devices to have live migration capability.
> > > Currently it does not support post-copy phase.
> > > 
> > > It follows Alex's comments on last version of VFIO live migration patches,
> > > including device states, VFIO device state region layout, dirty bitmap's
> > > query.
> > > 
> > > Device Data
> > > -----------
> > > Device data is divided into three types: device memory, device config,
> > > and system memory dirty pages produced by device.
> > > 
> > > Device config: data like MMIOs, page tables...
> > >         Every device is supposed to possess device config data.
> > >     	Usually device config's size is small (no big than 10M), and it  
> > 
> > I'm not sure how we can really impose a limit here, it is what it is
> > for a device.  A smaller state is obviously desirable to reduce
> > downtime, but some devices could have very large states.
> >   
> > >         needs to be loaded in certain strict order.
> > >         Therefore, device config only needs to be saved/loaded in
> > >         stop-and-copy phase.
> > >         The data of device config is held in device config region.
> > >         Size of device config data is smaller than or equal to that of
> > >         device config region.  
> > 
> > So the intention here is that this is the last data read from the
> > device and it's done in one pass, so the region needs to be large
> > enough to expose all config data at once.  On restore it's the last
> > data written before switching the device to the run state.
> >   
> > > 
> > > Device Memory: device's internal memory, standalone and outside system  
> > 
> > s/system/VM/
> >   
> > >         memory. It is usually very big.  
> > 
> > Or it doesn't exist.  Not sure we should be setting expectations since
> > it will vary per device.
> >   
> > >         This kind of data needs to be saved / loaded in pre-copy and
> > >         stop-and-copy phase.
> > >         The data of device memory is held in device memory region.
> > >         Size of devie memory is usually larger than that of device
> > >         memory region. qemu needs to save/load it in chunks of size of
> > >         device memory region.
> > >         Not all device has device memory. Like IGD only uses system memory.  
> > 
> > It seems a little gratuitous to me that this is a separate region or
> > that this data is handled separately.  All of this data is opaque to
> > QEMU, so why do we need to separate it?  
> hi Alex,
> as the device state interfaces are provided by kernel, it is expected to
> meet as general needs as possible. So, do you think there are such use
> cases from user space that user space knows well of the device, and
> it wants kernel to return desired data back to it.
> E.g. It just wants to get whole device config data including all mmios,
> page tables, pci config data...
> or, It just wants to get current device memory snapshot, not including any
> dirty data.
> Or, It just needs the dirty pages in device memory or system memory.
> With all this accurate query, quite a lot of useful features can be
> developped in user space.
> 
> If all of this data is opaque to user app, seems the only use case is
> for live migration.

I can certainly appreciate a more versatile interface, but I think
we're also trying to create the most simple interface we can, with the
primary target being live migration.  As soon as we start defining this
type of device memory and that type of device memory, we're going to
have another device come along that needs yet another because they have
a slightly different requirement.  Even without that, we're going to
have vendor drivers implement it differently, so what works for one
device for a more targeted approach may not work for all devices.  Can
you enumerate some specific examples of the use cases you imagine your
design to enable?

> From another aspect, if the final solution is to let the data opaque to
> user space, like what NV did, kernel side's implementation will be more
> complicated, and actually a little challenge to vendor driver.
> in that case, in pre-copy phase,
> 1. in not LOGGING state, vendor driver first returns full data including
> full device memory snapshot

When we're not LOGGING, does the vendor driver need to return
anything?  It seems that LOGGING could be considered an enable switch
for the interface.

> 2. user space reads some data (you can't expect it to finish reading all
> data)
> 3. then userspace set the device state to LOGGING to start dirty data
> logging
> 4. vendor driver starts dirty data logging, and appends the dirty data to
> the tail of remaining unread full data and increase the pending data size?

It seems a lot of overhead to expect the vendor driver to consider
state read by the user prior to LOGGING being enabled.  Does it log
those changes forever?  It seems like we should consider LOGGING
enabled to be a "session".

> 5. user space keeps reading data.
> 6. vendor driver keeps appending new dirty data to the tail of remaining
> unread full data/dirty data and increase the pending data size?

Until the device is stopped it can always generate new pending date and
the size of that pending data needs to be considered volatile by the
user, right?  What's different here?  This all seems to factor into
when the user decides whether the migration is converting and whether
to transition to stopped phase to force that convergence.

> in stop-and-copy phase
> 1. user space sets device state to exit LOGGING state,
> 2. vendor driver stops data logging. it has to append device config
>    data at the tail of remaining dirty data unread by userspace.
> 
> during this flow, when vendor driver should get dirty data? just keeps
> logging and appends to tail? how to ensure dirty data is refresh new before
> LOGGING state exit? how does vendor driver know whether certain dirty data
> is copied or not?

At stop-and-copy, I'd assume LOGGING remains enabled, only adding STOP,
such that the device does not generate new data, but perhaps I've
forgotten the details on vacation.  As above, I'd think we'd want to
bound any sort of dirty state tracking to a session bounded by the
LOGGING state.  The protocol defined with userspace needs to account
for determining what the user has and has not read, for instance to
support mmap'd data, a trapped interface needs to be used to setup the
data and acknowledge a read of that data.
 
> I've no idea how NVidia handle this problem, and they don't open their
> kernel side code. 
> just feel it's a bit hard for other vendor drivers to follow:)

Their interface proposal is available on the list, I don't have access
to their proprietary driver either, but I expect the best ideas from
each proposal to be combined into a unified solution.

> > > System memory dirty pages: If a device produces dirty pages in system
> > >         memory, it is able to get dirty bitmap for certain range of system
> > >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > >         callback, dirty pages in system memory will be save/loaded by ram's
> > >         live migration code.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > >         If system memory range is larger than that dirty bitmap region can
> > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > >         succession.
> > > 
> > > 
> > > Device State Regions
> > > --------------------
> > > Vendor driver is required to expose two mandatory regions and another two
> > > optional regions if it plans to support device state management.
> > > 
> > > So, there are up to four regions in total.
> > > One control region: mandatory.
> > >         Get access via read/write system call.
> > >         Its layout is defined in struct vfio_device_state_ctl
> > > Three data regions: mmaped into qemu.  
> > 
> > Is mmap mandatory?  I would think this would be defined by the mdev
> > device what access they want to support per region.  We don't want to
> > impose a more complicated interface if the device doesn't require it.  
> I think it's "mmap is preferred, but allowed to fail".
> just like a normal region with MMAP flag on (like bar regions), we also
> allow its mmap to fail, right?

Currently mmap support for any region is optional both from the vendor
driver and the user.  The vendor driver may or may not support mmap of
a region (or subset of region with sparse mmap) and the user may or may
not make use of mmap if it is available.  The question here was whether
this interface requires the vendor driver to support mmap of these
device specific regions.

> > >         device config region: mandatory, holding data of device config
> > >         device memory region: optional, holding data of device memory
> > >         dirty bitmap region: optional, holding bitmap of system memory
> > >                             dirty pages
> > > 
> > > (The reason why four seperate regions are defined is that the unit of mmap
> > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > control and three mmaped regions for data seems better than one big region
> > > padded and sparse mmaped).  
> > 
> > It's not obvious to me how this is better, a big region isn't padded,
> > there's simply a gap in the file descriptor.  Is having a sub-PAGE_SIZE
> > gap in a file really of any consequence?  Each region beyond the header
> > is more than likely larger than PAGE_SIZE, therefore they can be nicely
> > aligned together.  We still need fields to tell us how much data is
> > available in each area, so another to tell us the start of each area is
> > a minor detail.  And I think we still want to allow drivers to specify
> > which parts of which areas support mmap, so I don't think we're getting
> > away from sparse mmap support.  
> 
> with seperate regions and sub-region type defined, user space can explictly
> know which region is which region after vfio_get_dev_region_info(). along
> with it, user space knows region offset and size. mmap is allowed to fail
> and falls back to normal read/write to the region.
> 
> But with one big region and sparse mmapped subregions (1 data subregion or
> 3 data subregions, whatever), userspace can't tell which subregion is which
> one.

Of course they can, this is part of defining the header structure.  One
region could define a header including config_offset, config_size,
memory_offset, memory_size, dirty_offset, dirty_size.  Notice how Kirti
even uses the same area to support all of these (which leaves some
issues with vendor driver flexibility, but at least shows this is
possible).

> So, if using one big region, I think we need to explictly define
> subregions' sequence (like index 0 is dedicated to control subregion,
> index 1 is for device config data subregion ...). Vendor driver cannot
> freely change the sequence.
> Then keep data offset the same as region->mmaps[i].offset, and data size
> the same as region->mmaps[i].size. (i.e. let actual data starts immediatly
> from first byte of its data subregion)
> Also, mmaps for sparse mmaped subregions are not allowed to fail.

This doesn't make any sense to me, the vendor driver can define the
start and size of each area within the region with simple header
fields.  We don't need fixed sequence fields.  Likewise the sparse mmap
capability for the region can define which of those areas within the
region support mmap.  The mmap can be optional for both vendor driver
and user, just as it is elsewhere.  The header fields defining the
sub-areas can be read-only to the user, the sparse mmap only needs to
match what the vendor driver defines and supports.
 
> With one big region, we also need to consider the case when vendor driver
> does not want the data subregion to be mmaped.
> so, what is the data layout for that case?

Vendor driver defines data_offset and data_size, sparse mmap capability
does not list that area as mmap capable.

> data subregion immedately follows control subregion, or not?

The header needs to begin at offset zero, the layout of the rest is
defined by the vendor driver within this header.  I believe this is
(mostly) implemented in Kirti's version.

> Of couse, for this condition, we can specify the data filed's start offset
> and size through control region. And we must not expect the data start
> offset in source and target are equal.
> (because the big region's fd_offset
> may vary in source and target. consider the case when both source and
> target have one opregion and one device state region, but source has
> opregion in the first and target has device state region in the first.
> If we think this case is illegal, we must be able to detect it in the first
> place).
> Also, we must keep the start offset and size consistent with the above mmap
> case.

AFAICT, these are all non-issues.  Please look at Kirti's proposal.
The (one) migration region can define a header at offset zero that
allows the vendor driver to define where within that region the data,
config, and dirty bitmap areas are and the sparse mmap capability
defines which of those are mmap capable.  Clearly this migration region
offset within the vfio device file descriptor is independent between
source and target, as are the offsets of the sub-areas within the
migration region.  These all need to be defined as part of the
migration protocol where the source and target implement the same
protocol but have no requirement to be absolutely identical.

> > > kernel device state interface [1]
> > > --------------------------------------
> > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2  
> > 
> > If we were to go with this multi-region solution, isn't it evident from
> > the regions exposed that device memory and a dirty bitmap are
> > provided?  Alternatively, I believe Kirti's proposal doesn't require  
> 
> > this distinction between device memory and device config, a device not
> > requiring runtime migration data would simply report no data until the
> > device moved to the stopped state, making it consistent path for
> > userspace.  Likewise, the dirty bitmap could report a zero page count
> > in the bitmap rather than branching based on device support.  
> If the path in userspace is consistent for device config and device
> memory, there will be many unnecessary call of getting data size into
> vendor driver.

Consistency seems like a good thing, it makes code more simple, we
don't behave differently in one case versus another.  If the vendor
reports no data, skip.  It also provides versatility.  Are the "many
unnecessary call[s]" quantifiable?

> > Note that it seems that cap VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY implies
> > VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP, but there's no
> > consistency in the naming.
> >   
> > > #define VFIO_DEVICE_STATE_RUNNING 0 
> > > #define VFIO_DEVICE_STATE_STOP 1
> > > #define VFIO_DEVICE_STATE_LOGGING 2  
> > 
> > It looks like these are being defined as bits, since patch 1 talks
> > about RUNNING & LOGGING and STOP & LOGGING.  I think Connie already
> > posted some comments about this.  I'm not sure anything prevents us
> > from defining RUNNING a 1 and STOPPED as 0 so we don't have the
> > polarity flip vs LOGGING though.
> > 
> > The state "STOP & LOGGING" also feels like a strange "device state", if
> > the device is stopped, it's not logging any new state, so I think this
> > is more that the device state is STOP, but the LOGGING feature is
> > active.  Maybe we should consider these independent bits.  LOGGING is
> > active as we stop a device so that we can fetch the last dirtied pages,
> > but disabled as we load the state of the device into the target.
> >   
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > 
> > > struct vfio_device_state_ctl {
> > > 	__u32 version;		  /* ro */
> > > 	__u32 device_state;       /* VFIO device state, wo */
> > > 	__u32 caps;		 /* ro */
> > >         struct {
> > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;    /*rw*/
> > > 	} device_config;  
> > 
> > Patch 1 indicates that to get the config buffer we write GET_BUFFER to
> > action and read from the config region.  The size is previously read
> > and apparently constant.  To set the config buffer, the config region
> > is written followed by writing SET_BUFFER to action.  Why is size
> > listed as read-write?  
> this is the size of config data.
> size of config data <= size of config data region.

Where in the usage protocol does the user WRITE the config data size?

> > Doesn't this protocol also require that the mdev driver consume each
> > full region's worth of host kernel memory for backing pages in
> > anticipation of a rare event like migration?  This might be a strike
> > against separate regions if the driver needs to provide backing pages
> > for 3 separate regions vs 1.  To avoid this runtime overhead, would it
> > be expected that the user only mmap the regions during migration and
> > the mdev driver allocate backing pages on mmap?  Should the mmap be
> > restricted to the LOGGING feature being enabled?  Maybe I'm mistaken in
> > how the mdev driver would back these mmap'd pages.
> >  
> yes, 3 seperate regions consume a little more memory than 1 region.
> but it's just a little overhead.
> As in intel's kernel implementation,
> device config region's size is 9M, dirty bitmap region's size is 16k.
> if there is device memory region, its size can be defined as 100M?
> so it's 109M vs 100M ?

But what if it's 100M config with no device memory?  This proposal
requires 100M in-kernel backing due to the definition of the config
region when it could be implemented with significantly less by allowing
a small data area to be read multiple times until a bytes remaining
counter becomes zero.

> > > 	struct {
> > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;     /* rw */  
> > >                 __u64 pos; /*the offset in total buffer of device memory*/  
> > 
> > Patch 1 outlines the protocol here that getting device memory begins
> > with writing the position field, followed by reading from the device
> > memory region.  Setting device memory begins with writing the data to
> > the device memory region, followed by writing the position field.  Why
> > does the user need to have visibility of data position?  This is opaque
> > data to the user, the device should manage how the chunks fit together.
> > 
> > How does the user know when they reach the end?  
> sorry, maybe I didn't explain clearly here.
> 
> device  ________________________________________
> memory  |    |    |////|    |    |    |    |    |
> data:   |____|____|////|____|____|____|____|____|
>                   :pos :
>                   :    :
> device            :____:
> memory            |    |
> region:           |____|
> 
> the whole sequence is like this:
> 
> 1. user space reads device_memory.size
> 2. driver gets device memory's data(full snapshot or dirty data, depending
> on it's in LOGGING state or not), and return the total size of
> this data. 
> 3. user space finishes reading device_memory.size (>= device memory
> region's size)
> 
> 4. user space starts a loop like
>   
>    while (pos < total_len) {
>         uint64_t len = region_devmem->size;
> 
>         if (pos + len >= total_len) {
>             len = total_len - pos;
>         }
>         if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
>             return -1;
>         }
> 	pos += len;
>     }
> 
>  vfio_save_data_device_memory_chunk reads each chunk from device memory
>  region by writing GET_BUFFER  to device_memory.action, and pos to
>  device_memory.pos.
> 
> 
> so. each time, userspace will finish getting device memory data in one
> time.
> 
> specifying "pos" is just like the "lseek" before "write".

This could also be implemented as a remaining bytes counter in the
interface where the vendor driver wouldn't rely on the user to manage
the position.  What internal consistency checking is going to protect
the host kernel when the user writes data to the wrong position?  If we
consider the data to be opaque, the vendor driver can embed that sort
of meta data into the data blob the user reads and reassemble it
correctly or generate a consistency failure itself.

> > Bullets 8 and 9 in patch 1 also discuss setting and getting the device
> > memory size, but these aren't well integrated into the protocol for
> > getting and setting the memory buffer.  Is getting the device memory
> > really started by reading the size, which triggers the vendor driver to
> > snapshot the state in an internal buffer which the user then iterates
> > through using GET_BUFFER?  Therefore re-reading the size field could
> > corrupt the data stream?  Wouldn't it be easier to report bytes
> > available and countdown as the user marks them read?  What does
> > position mean when we switch from snapshot to dirty data?  
> when switching to device memory's dirty data, pos means the pos in whole
> dirty data.
> 
> .save_live_pending ==> driver gets dirty data in device memory and returns
> total size.
> 
> .save_live_iterate ==> userspace reads all dirty data from device memory
> region chunk by chunk
> 
> So, in an iteration, all dirty data are saved.
> then in next iteration, dirty data is recalculated.
> 
> > > 	} device_memory;
> > > 	struct {
> > > 		__u64 start_addr; /* wo */
> > > 		__u64 page_nr;   /* wo */
> > > 	} system_memory;
> > > };  
> > 
> > Why is one specified as an address and the other as pages?  Note that  
> Yes, start_addr ==> start pfn is better
> 
> > Kirti's implementation has an optimization to know how many pages are
> > set within a range to avoid unnecessary buffer reads.
> >   
> 
> Let's use start_pfn_all, page_nr_all to represent the start pfn and
> page_nr passed in from qemu .log_sync interface.
> 
> and use start_pfn_i, page_nr_i to the value passed to driver.
> 
> 
> start_pfn_all
>   |         start_pfn_i
>   |         |
>  \ /_______\_/_____________________________
>   |    |    |////|    |    |    |    |    |
>   |____|____|////|____|____|____|____|____|
>             :    :
>             :    :
>             :____:
> bitmap      |    |
> region:     |____|
>            
> 
> 1. Each time QEMU queries dirty bitmap from driver, it passes in
> start_pfn_i, and page_nr_i. (page_nr_i is the biggest page number the
> bitmap region can hold).
> 2. driver queries memory range from start_pfn_i with size page_nr_i.
> 3. driver return a bitmap (if no dirty data, the bitmap is all 0).
> 4. QEMU saves the pages according to the bitmap
> 
> If there's no dirty data found in step 2, step 4 can be skipped.
> (I'll add this check before step 4 in future, thanks)
> but if there's even 1 bit set in the bitmap, no step from 1-4 can be
> skipped.
> 
> Honestly, after reviewing Kirti's implementation, I don't think it's an
> optimization. As in below pseudo code for Kirti's code, I would think the
> copied_pfns corresponds to the page_nr_i in my case. so, the case of
> copied_pfns equaling to 0 is for the tail chunk? don't think it's working..
> 
> write start_pfn to driver
> write page_size  to driver
> write pfn_count to driver
> 
> do {
>     read copied_pfns from driver.
>     if (copied_pfns == 0) {
>        break;
>     }
>    bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
>    buf = get bitmap from driver
>    cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
>                                            (start_pfn + count) * page_size,
>                                                 copied_pfns);
> 
>      count +=  copied_pfns;
> 
> } while (count < pfn_count);

The intent of Kirti's copied_pfns is clearly to avoid unnecessarily
reading pages from the kernel when nothing has changed.  Perhaps the
implementation still requires work, but I don't see from above how
that's not considered an optimization.

> > > 
> > > Devcie States
> > > ------------- 
> > > After migration is initialzed, it will set device state via writing to
> > > device_state field of control region.
> > > 
> > > Four states are defined for a VFIO device:
> > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > > 
> > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > >         commands from device driver.
> > >         It is the default state that a VFIO device enters initially.
> > > 
> > > STOP:  In this state, a VFIO device is deactivated to interact with
> > >        device driver.
> > > 
> > > LOGGING: a special state that it CANNOT exist independently. It must be
> > >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > >        STOP & LOGGING).
> > >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > >        driver can start dirty data logging for device memory and system
> > >        memory.
> > >        LOGGING only impacts device/system memory. They return whole
> > >        snapshot outside LOGGING and dirty data since last get operation
> > >        inside LOGGING.
> > >        Device config should be always accessible and return whole config
> > >        snapshot regardless of LOGGING state.
> > >        
> > > Note:
> > > The reason why RUNNING is the default state is that device's active state
> > > must not depend on device state interface.
> > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > In that condition, a device needs be in active state by default. 
> > > 
> > > Get Version & Get Caps
> > > ----------------------
> > > On migration init phase, qemu will probe the existence of device state
> > > regions of vendor driver, then get version of the device state interface
> > > from the r/w control region.
> > > 
> > > Then it will probe VFIO device's data capability by reading caps field of
> > > control region.
> > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > >         device memory in pre-copy and stop-and-copy phase. The data of
> > >         device memory is held in device memory region.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > >         produced by VFIO device during pre-copy and stop-and-copy phase.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.  
> > 
> > As above, these capabilities seem redundant to the existence of the
> > device specific regions in this implementation.
> >  
> seems so :)
> 
> > > If failing to find two mandatory regions and optional data regions
> > > corresponding to data caps or version mismatching, it will setup a
> > > migration blocker and disable live migration for VFIO device.
> > > 
> > > 
> > > Flows to call device state interface for VFIO live migration
> > > ------------------------------------------------------------
> > > 
> > > Live migration save path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_SAVE_SETUP
> > >  |
> > >  .save_setup callback -->
> > >  get device memory size (whole snapshot size)
> > >  get device memory buffer (whole snapshot data)
> > >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > >  .save_live_pending callback --> get device memory size (dirty data)
> > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > >  .log_sync callback --> get system memory dirty bitmap
> > >  |
> > > (vcpu stops) --> set device state -->
> > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > .save_live_complete_precopy callback -->
> > >  get device memory size (dirty data)
> > >  get device memory buffer (dirty data)
> > >  get device config size (whole snapshot size)
> > >  get device config buffer (whole snapshot data)
> > >  |
> > > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > MIGRATION_STATUS_CANCELLED or
> > > MIGRATION_STATUS_FAILED
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > 
> > > 
> > > Live migration load path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > > .load state callback -->
> > >  set device memory size, set device memory buffer, set device config size,
> > >  set device config buffer
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > 
> > > 
> > > In source VM side,
> > > In precopy phase,
> > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > qemu will first get whole snapshot of device memory in .save_setup
> > > callback, and then it will get total size of dirty data in device memory in
> > > .save_live_pending callback by reading device_memory.size field of control
> > > region.  
> > 
> > This requires iterative reads of device memory buffer but the protocol
> > is unclear (to me) how the user knows how to do this or interact with
> > the position field. 
> >   
> > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > dirty data chunk by chunk from device memory region by writing pos &
> > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > control region. (size of each chunk is the size of device memory data
> > > region).  
> > 
> > What if there's not enough dirty data to fill the region?  The data is
> > always padded to fill the region?
> >  
> I think dirty data in vendor driver is orgnaized in a format like:
> (addr0, len0, data0, addr1, len1, data1, addr2, len2, data2,....addrN,
> lenN, dataN).
> for full snapshot, it's like (addr0, len0, data0).
> so, to userspace and data region, it doesn't matter whether it's full
> snapshot or dirty data.
> 
> 
> > > .save_live_pending and .save_live_iteration may be called several times in
> > > precopy phase to get dirty data in device memory.
> > > 
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > vendor driver's device state interface to get data from devcie memory.  
> > 
> > Therefore through the entire precopy phase we have no data from source
> > to target to begin a compatibility check :-\  I think both proposals
> > currently still lack any sort of device compatibility or data
> > versioning check between source and target.  Thanks,  
> I checked the compatibility, though not good enough:)
> 
> in migration_init, vfio_check_devstate_version() checked version from
> kernel with VFIO_DEVICE_STATE_INTERFACE_VERSION in both source and target,
> and in target side, vfio_load_state() checked source side version.
> 
> 
> int vfio_load_state(QEMUFile *f, void *opaque, int version_id)                          
> {       
>     ...
>     if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {                                   
>         return -EINVAL;
>     } 
>     ...
> }

But this only checks that both source and target are using the same
migration interface, how do we know that they're compatible devices and
that the vendor data stream is compatible between source and target?
Whether both ends use the same migration interface is potentially not
relevant if the data stream is compatible.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-03-07 17:44       ` Alex Williamson
  0 siblings, 0 replies; 133+ messages in thread
From: Alex Williamson @ 2019-03-07 17:44 UTC (permalink / raw)
  To: Zhao Yan
  Cc: qemu-devel, intel-gvt-dev, Zhengxiao.zx, yi.l.liu, eskultet,
	ziye.yang, cohuck, shuangtai.tst, dgilbert, zhi.a.wang, mlevitsk,
	pasic, aik, eauger, felipe, jonathan.davies, changpeng.liu,
	Ken.Xue, kwankhede, kevin.tian, cjia, arei.gonglei, kvm

Hi Yan,

Sorry for the delay, I've been on PTO...

On Sun, 24 Feb 2019 21:22:56 -0500
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Thu, Feb 21, 2019 at 01:40:51PM -0700, Alex Williamson wrote:
> > Hi Yan,
> > 
> > Thanks for working on this!
> > 
> > On Tue, 19 Feb 2019 16:50:54 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > This patchset enables VFIO devices to have live migration capability.
> > > Currently it does not support post-copy phase.
> > > 
> > > It follows Alex's comments on last version of VFIO live migration patches,
> > > including device states, VFIO device state region layout, dirty bitmap's
> > > query.
> > > 
> > > Device Data
> > > -----------
> > > Device data is divided into three types: device memory, device config,
> > > and system memory dirty pages produced by device.
> > > 
> > > Device config: data like MMIOs, page tables...
> > >         Every device is supposed to possess device config data.
> > >     	Usually device config's size is small (no big than 10M), and it  
> > 
> > I'm not sure how we can really impose a limit here, it is what it is
> > for a device.  A smaller state is obviously desirable to reduce
> > downtime, but some devices could have very large states.
> >   
> > >         needs to be loaded in certain strict order.
> > >         Therefore, device config only needs to be saved/loaded in
> > >         stop-and-copy phase.
> > >         The data of device config is held in device config region.
> > >         Size of device config data is smaller than or equal to that of
> > >         device config region.  
> > 
> > So the intention here is that this is the last data read from the
> > device and it's done in one pass, so the region needs to be large
> > enough to expose all config data at once.  On restore it's the last
> > data written before switching the device to the run state.
> >   
> > > 
> > > Device Memory: device's internal memory, standalone and outside system  
> > 
> > s/system/VM/
> >   
> > >         memory. It is usually very big.  
> > 
> > Or it doesn't exist.  Not sure we should be setting expectations since
> > it will vary per device.
> >   
> > >         This kind of data needs to be saved / loaded in pre-copy and
> > >         stop-and-copy phase.
> > >         The data of device memory is held in device memory region.
> > >         Size of devie memory is usually larger than that of device
> > >         memory region. qemu needs to save/load it in chunks of size of
> > >         device memory region.
> > >         Not all device has device memory. Like IGD only uses system memory.  
> > 
> > It seems a little gratuitous to me that this is a separate region or
> > that this data is handled separately.  All of this data is opaque to
> > QEMU, so why do we need to separate it?  
> hi Alex,
> as the device state interfaces are provided by kernel, it is expected to
> meet as general needs as possible. So, do you think there are such use
> cases from user space that user space knows well of the device, and
> it wants kernel to return desired data back to it.
> E.g. It just wants to get whole device config data including all mmios,
> page tables, pci config data...
> or, It just wants to get current device memory snapshot, not including any
> dirty data.
> Or, It just needs the dirty pages in device memory or system memory.
> With all this accurate query, quite a lot of useful features can be
> developped in user space.
> 
> If all of this data is opaque to user app, seems the only use case is
> for live migration.

I can certainly appreciate a more versatile interface, but I think
we're also trying to create the most simple interface we can, with the
primary target being live migration.  As soon as we start defining this
type of device memory and that type of device memory, we're going to
have another device come along that needs yet another because they have
a slightly different requirement.  Even without that, we're going to
have vendor drivers implement it differently, so what works for one
device for a more targeted approach may not work for all devices.  Can
you enumerate some specific examples of the use cases you imagine your
design to enable?

> From another aspect, if the final solution is to let the data opaque to
> user space, like what NV did, kernel side's implementation will be more
> complicated, and actually a little challenge to vendor driver.
> in that case, in pre-copy phase,
> 1. in not LOGGING state, vendor driver first returns full data including
> full device memory snapshot

When we're not LOGGING, does the vendor driver need to return
anything?  It seems that LOGGING could be considered an enable switch
for the interface.

> 2. user space reads some data (you can't expect it to finish reading all
> data)
> 3. then userspace set the device state to LOGGING to start dirty data
> logging
> 4. vendor driver starts dirty data logging, and appends the dirty data to
> the tail of remaining unread full data and increase the pending data size?

It seems a lot of overhead to expect the vendor driver to consider
state read by the user prior to LOGGING being enabled.  Does it log
those changes forever?  It seems like we should consider LOGGING
enabled to be a "session".

> 5. user space keeps reading data.
> 6. vendor driver keeps appending new dirty data to the tail of remaining
> unread full data/dirty data and increase the pending data size?

Until the device is stopped it can always generate new pending date and
the size of that pending data needs to be considered volatile by the
user, right?  What's different here?  This all seems to factor into
when the user decides whether the migration is converting and whether
to transition to stopped phase to force that convergence.

> in stop-and-copy phase
> 1. user space sets device state to exit LOGGING state,
> 2. vendor driver stops data logging. it has to append device config
>    data at the tail of remaining dirty data unread by userspace.
> 
> during this flow, when vendor driver should get dirty data? just keeps
> logging and appends to tail? how to ensure dirty data is refresh new before
> LOGGING state exit? how does vendor driver know whether certain dirty data
> is copied or not?

At stop-and-copy, I'd assume LOGGING remains enabled, only adding STOP,
such that the device does not generate new data, but perhaps I've
forgotten the details on vacation.  As above, I'd think we'd want to
bound any sort of dirty state tracking to a session bounded by the
LOGGING state.  The protocol defined with userspace needs to account
for determining what the user has and has not read, for instance to
support mmap'd data, a trapped interface needs to be used to setup the
data and acknowledge a read of that data.
 
> I've no idea how NVidia handle this problem, and they don't open their
> kernel side code. 
> just feel it's a bit hard for other vendor drivers to follow:)

Their interface proposal is available on the list, I don't have access
to their proprietary driver either, but I expect the best ideas from
each proposal to be combined into a unified solution.

> > > System memory dirty pages: If a device produces dirty pages in system
> > >         memory, it is able to get dirty bitmap for certain range of system
> > >         memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> > >         phase in .log_sync callback. By setting dirty bitmap in .log_sync
> > >         callback, dirty pages in system memory will be save/loaded by ram's
> > >         live migration code.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.
> > >         If system memory range is larger than that dirty bitmap region can
> > >         hold, qemu will cut it into several chunks and get dirty bitmap in
> > >         succession.
> > > 
> > > 
> > > Device State Regions
> > > --------------------
> > > Vendor driver is required to expose two mandatory regions and another two
> > > optional regions if it plans to support device state management.
> > > 
> > > So, there are up to four regions in total.
> > > One control region: mandatory.
> > >         Get access via read/write system call.
> > >         Its layout is defined in struct vfio_device_state_ctl
> > > Three data regions: mmaped into qemu.  
> > 
> > Is mmap mandatory?  I would think this would be defined by the mdev
> > device what access they want to support per region.  We don't want to
> > impose a more complicated interface if the device doesn't require it.  
> I think it's "mmap is preferred, but allowed to fail".
> just like a normal region with MMAP flag on (like bar regions), we also
> allow its mmap to fail, right?

Currently mmap support for any region is optional both from the vendor
driver and the user.  The vendor driver may or may not support mmap of
a region (or subset of region with sparse mmap) and the user may or may
not make use of mmap if it is available.  The question here was whether
this interface requires the vendor driver to support mmap of these
device specific regions.

> > >         device config region: mandatory, holding data of device config
> > >         device memory region: optional, holding data of device memory
> > >         dirty bitmap region: optional, holding bitmap of system memory
> > >                             dirty pages
> > > 
> > > (The reason why four seperate regions are defined is that the unit of mmap
> > > system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> > > control and three mmaped regions for data seems better than one big region
> > > padded and sparse mmaped).  
> > 
> > It's not obvious to me how this is better, a big region isn't padded,
> > there's simply a gap in the file descriptor.  Is having a sub-PAGE_SIZE
> > gap in a file really of any consequence?  Each region beyond the header
> > is more than likely larger than PAGE_SIZE, therefore they can be nicely
> > aligned together.  We still need fields to tell us how much data is
> > available in each area, so another to tell us the start of each area is
> > a minor detail.  And I think we still want to allow drivers to specify
> > which parts of which areas support mmap, so I don't think we're getting
> > away from sparse mmap support.  
> 
> with seperate regions and sub-region type defined, user space can explictly
> know which region is which region after vfio_get_dev_region_info(). along
> with it, user space knows region offset and size. mmap is allowed to fail
> and falls back to normal read/write to the region.
> 
> But with one big region and sparse mmapped subregions (1 data subregion or
> 3 data subregions, whatever), userspace can't tell which subregion is which
> one.

Of course they can, this is part of defining the header structure.  One
region could define a header including config_offset, config_size,
memory_offset, memory_size, dirty_offset, dirty_size.  Notice how Kirti
even uses the same area to support all of these (which leaves some
issues with vendor driver flexibility, but at least shows this is
possible).

> So, if using one big region, I think we need to explictly define
> subregions' sequence (like index 0 is dedicated to control subregion,
> index 1 is for device config data subregion ...). Vendor driver cannot
> freely change the sequence.
> Then keep data offset the same as region->mmaps[i].offset, and data size
> the same as region->mmaps[i].size. (i.e. let actual data starts immediatly
> from first byte of its data subregion)
> Also, mmaps for sparse mmaped subregions are not allowed to fail.

This doesn't make any sense to me, the vendor driver can define the
start and size of each area within the region with simple header
fields.  We don't need fixed sequence fields.  Likewise the sparse mmap
capability for the region can define which of those areas within the
region support mmap.  The mmap can be optional for both vendor driver
and user, just as it is elsewhere.  The header fields defining the
sub-areas can be read-only to the user, the sparse mmap only needs to
match what the vendor driver defines and supports.
 
> With one big region, we also need to consider the case when vendor driver
> does not want the data subregion to be mmaped.
> so, what is the data layout for that case?

Vendor driver defines data_offset and data_size, sparse mmap capability
does not list that area as mmap capable.

> data subregion immedately follows control subregion, or not?

The header needs to begin at offset zero, the layout of the rest is
defined by the vendor driver within this header.  I believe this is
(mostly) implemented in Kirti's version.

> Of couse, for this condition, we can specify the data filed's start offset
> and size through control region. And we must not expect the data start
> offset in source and target are equal.
> (because the big region's fd_offset
> may vary in source and target. consider the case when both source and
> target have one opregion and one device state region, but source has
> opregion in the first and target has device state region in the first.
> If we think this case is illegal, we must be able to detect it in the first
> place).
> Also, we must keep the start offset and size consistent with the above mmap
> case.

AFAICT, these are all non-issues.  Please look at Kirti's proposal.
The (one) migration region can define a header at offset zero that
allows the vendor driver to define where within that region the data,
config, and dirty bitmap areas are and the sparse mmap capability
defines which of those are mmap capable.  Clearly this migration region
offset within the vfio device file descriptor is independent between
source and target, as are the offsets of the sub-areas within the
migration region.  These all need to be defined as part of the
migration protocol where the source and target implement the same
protocol but have no requirement to be absolutely identical.

> > > kernel device state interface [1]
> > > --------------------------------------
> > > #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> > > #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > > #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2  
> > 
> > If we were to go with this multi-region solution, isn't it evident from
> > the regions exposed that device memory and a dirty bitmap are
> > provided?  Alternatively, I believe Kirti's proposal doesn't require  
> 
> > this distinction between device memory and device config, a device not
> > requiring runtime migration data would simply report no data until the
> > device moved to the stopped state, making it consistent path for
> > userspace.  Likewise, the dirty bitmap could report a zero page count
> > in the bitmap rather than branching based on device support.  
> If the path in userspace is consistent for device config and device
> memory, there will be many unnecessary call of getting data size into
> vendor driver.

Consistency seems like a good thing, it makes code more simple, we
don't behave differently in one case versus another.  If the vendor
reports no data, skip.  It also provides versatility.  Are the "many
unnecessary call[s]" quantifiable?

> > Note that it seems that cap VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY implies
> > VFIO_REGION_SUBTYPE_DEVICE_STATE_DATA_DIRTYBITMAP, but there's no
> > consistency in the naming.
> >   
> > > #define VFIO_DEVICE_STATE_RUNNING 0 
> > > #define VFIO_DEVICE_STATE_STOP 1
> > > #define VFIO_DEVICE_STATE_LOGGING 2  
> > 
> > It looks like these are being defined as bits, since patch 1 talks
> > about RUNNING & LOGGING and STOP & LOGGING.  I think Connie already
> > posted some comments about this.  I'm not sure anything prevents us
> > from defining RUNNING a 1 and STOPPED as 0 so we don't have the
> > polarity flip vs LOGGING though.
> > 
> > The state "STOP & LOGGING" also feels like a strange "device state", if
> > the device is stopped, it's not logging any new state, so I think this
> > is more that the device state is STOP, but the LOGGING feature is
> > active.  Maybe we should consider these independent bits.  LOGGING is
> > active as we stop a device so that we can fetch the last dirtied pages,
> > but disabled as we load the state of the device into the target.
> >   
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BUFFER 1
> > > #define VFIO_DEVICE_DATA_ACTION_SET_BUFFER 2
> > > #define VFIO_DEVICE_DATA_ACTION_GET_BITMAP 3
> > > 
> > > struct vfio_device_state_ctl {
> > > 	__u32 version;		  /* ro */
> > > 	__u32 device_state;       /* VFIO device state, wo */
> > > 	__u32 caps;		 /* ro */
> > >         struct {
> > > 		__u32 action;  /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;    /*rw*/
> > > 	} device_config;  
> > 
> > Patch 1 indicates that to get the config buffer we write GET_BUFFER to
> > action and read from the config region.  The size is previously read
> > and apparently constant.  To set the config buffer, the config region
> > is written followed by writing SET_BUFFER to action.  Why is size
> > listed as read-write?  
> this is the size of config data.
> size of config data <= size of config data region.

Where in the usage protocol does the user WRITE the config data size?

> > Doesn't this protocol also require that the mdev driver consume each
> > full region's worth of host kernel memory for backing pages in
> > anticipation of a rare event like migration?  This might be a strike
> > against separate regions if the driver needs to provide backing pages
> > for 3 separate regions vs 1.  To avoid this runtime overhead, would it
> > be expected that the user only mmap the regions during migration and
> > the mdev driver allocate backing pages on mmap?  Should the mmap be
> > restricted to the LOGGING feature being enabled?  Maybe I'm mistaken in
> > how the mdev driver would back these mmap'd pages.
> >  
> yes, 3 seperate regions consume a little more memory than 1 region.
> but it's just a little overhead.
> As in intel's kernel implementation,
> device config region's size is 9M, dirty bitmap region's size is 16k.
> if there is device memory region, its size can be defined as 100M?
> so it's 109M vs 100M ?

But what if it's 100M config with no device memory?  This proposal
requires 100M in-kernel backing due to the definition of the config
region when it could be implemented with significantly less by allowing
a small data area to be read multiple times until a bytes remaining
counter becomes zero.

> > > 	struct {
> > > 		__u32 action;    /* wo, GET_BUFFER or SET_BUFFER */ 
> > > 		__u64 size;     /* rw */  
> > >                 __u64 pos; /*the offset in total buffer of device memory*/  
> > 
> > Patch 1 outlines the protocol here that getting device memory begins
> > with writing the position field, followed by reading from the device
> > memory region.  Setting device memory begins with writing the data to
> > the device memory region, followed by writing the position field.  Why
> > does the user need to have visibility of data position?  This is opaque
> > data to the user, the device should manage how the chunks fit together.
> > 
> > How does the user know when they reach the end?  
> sorry, maybe I didn't explain clearly here.
> 
> device  ________________________________________
> memory  |    |    |////|    |    |    |    |    |
> data:   |____|____|////|____|____|____|____|____|
>                   :pos :
>                   :    :
> device            :____:
> memory            |    |
> region:           |____|
> 
> the whole sequence is like this:
> 
> 1. user space reads device_memory.size
> 2. driver gets device memory's data(full snapshot or dirty data, depending
> on it's in LOGGING state or not), and return the total size of
> this data. 
> 3. user space finishes reading device_memory.size (>= device memory
> region's size)
> 
> 4. user space starts a loop like
>   
>    while (pos < total_len) {
>         uint64_t len = region_devmem->size;
> 
>         if (pos + len >= total_len) {
>             len = total_len - pos;
>         }
>         if (vfio_save_data_device_memory_chunk(vdev, f, pos, len)) {
>             return -1;
>         }
> 	pos += len;
>     }
> 
>  vfio_save_data_device_memory_chunk reads each chunk from device memory
>  region by writing GET_BUFFER  to device_memory.action, and pos to
>  device_memory.pos.
> 
> 
> so. each time, userspace will finish getting device memory data in one
> time.
> 
> specifying "pos" is just like the "lseek" before "write".

This could also be implemented as a remaining bytes counter in the
interface where the vendor driver wouldn't rely on the user to manage
the position.  What internal consistency checking is going to protect
the host kernel when the user writes data to the wrong position?  If we
consider the data to be opaque, the vendor driver can embed that sort
of meta data into the data blob the user reads and reassemble it
correctly or generate a consistency failure itself.

> > Bullets 8 and 9 in patch 1 also discuss setting and getting the device
> > memory size, but these aren't well integrated into the protocol for
> > getting and setting the memory buffer.  Is getting the device memory
> > really started by reading the size, which triggers the vendor driver to
> > snapshot the state in an internal buffer which the user then iterates
> > through using GET_BUFFER?  Therefore re-reading the size field could
> > corrupt the data stream?  Wouldn't it be easier to report bytes
> > available and countdown as the user marks them read?  What does
> > position mean when we switch from snapshot to dirty data?  
> when switching to device memory's dirty data, pos means the pos in whole
> dirty data.
> 
> .save_live_pending ==> driver gets dirty data in device memory and returns
> total size.
> 
> .save_live_iterate ==> userspace reads all dirty data from device memory
> region chunk by chunk
> 
> So, in an iteration, all dirty data are saved.
> then in next iteration, dirty data is recalculated.
> 
> > > 	} device_memory;
> > > 	struct {
> > > 		__u64 start_addr; /* wo */
> > > 		__u64 page_nr;   /* wo */
> > > 	} system_memory;
> > > };  
> > 
> > Why is one specified as an address and the other as pages?  Note that  
> Yes, start_addr ==> start pfn is better
> 
> > Kirti's implementation has an optimization to know how many pages are
> > set within a range to avoid unnecessary buffer reads.
> >   
> 
> Let's use start_pfn_all, page_nr_all to represent the start pfn and
> page_nr passed in from qemu .log_sync interface.
> 
> and use start_pfn_i, page_nr_i to the value passed to driver.
> 
> 
> start_pfn_all
>   |         start_pfn_i
>   |         |
>  \ /_______\_/_____________________________
>   |    |    |////|    |    |    |    |    |
>   |____|____|////|____|____|____|____|____|
>             :    :
>             :    :
>             :____:
> bitmap      |    |
> region:     |____|
>            
> 
> 1. Each time QEMU queries dirty bitmap from driver, it passes in
> start_pfn_i, and page_nr_i. (page_nr_i is the biggest page number the
> bitmap region can hold).
> 2. driver queries memory range from start_pfn_i with size page_nr_i.
> 3. driver return a bitmap (if no dirty data, the bitmap is all 0).
> 4. QEMU saves the pages according to the bitmap
> 
> If there's no dirty data found in step 2, step 4 can be skipped.
> (I'll add this check before step 4 in future, thanks)
> but if there's even 1 bit set in the bitmap, no step from 1-4 can be
> skipped.
> 
> Honestly, after reviewing Kirti's implementation, I don't think it's an
> optimization. As in below pseudo code for Kirti's code, I would think the
> copied_pfns corresponds to the page_nr_i in my case. so, the case of
> copied_pfns equaling to 0 is for the tail chunk? don't think it's working..
> 
> write start_pfn to driver
> write page_size  to driver
> write pfn_count to driver
> 
> do {
>     read copied_pfns from driver.
>     if (copied_pfns == 0) {
>        break;
>     }
>    bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned long);
>    buf = get bitmap from driver
>    cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
>                                            (start_pfn + count) * page_size,
>                                                 copied_pfns);
> 
>      count +=  copied_pfns;
> 
> } while (count < pfn_count);

The intent of Kirti's copied_pfns is clearly to avoid unnecessarily
reading pages from the kernel when nothing has changed.  Perhaps the
implementation still requires work, but I don't see from above how
that's not considered an optimization.

> > > 
> > > Devcie States
> > > ------------- 
> > > After migration is initialzed, it will set device state via writing to
> > > device_state field of control region.
> > > 
> > > Four states are defined for a VFIO device:
> > >         RUNNING, RUNNING & LOGGING, STOP & LOGGING, STOP 
> > > 
> > > RUNNING: In this state, a VFIO device is in active state ready to receive
> > >         commands from device driver.
> > >         It is the default state that a VFIO device enters initially.
> > > 
> > > STOP:  In this state, a VFIO device is deactivated to interact with
> > >        device driver.
> > > 
> > > LOGGING: a special state that it CANNOT exist independently. It must be
> > >        set alongside with state RUNNING or STOP (i.e. RUNNING & LOGGING,
> > >        STOP & LOGGING).
> > >        Qemu will set LOGGING state on in .save_setup callbacks, then vendor
> > >        driver can start dirty data logging for device memory and system
> > >        memory.
> > >        LOGGING only impacts device/system memory. They return whole
> > >        snapshot outside LOGGING and dirty data since last get operation
> > >        inside LOGGING.
> > >        Device config should be always accessible and return whole config
> > >        snapshot regardless of LOGGING state.
> > >        
> > > Note:
> > > The reason why RUNNING is the default state is that device's active state
> > > must not depend on device state interface.
> > > It is possible that region vfio_device_state_ctl fails to get registered.
> > > In that condition, a device needs be in active state by default. 
> > > 
> > > Get Version & Get Caps
> > > ----------------------
> > > On migration init phase, qemu will probe the existence of device state
> > > regions of vendor driver, then get version of the device state interface
> > > from the r/w control region.
> > > 
> > > Then it will probe VFIO device's data capability by reading caps field of
> > > control region.
> > >         #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> > >         #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is on, it will save/load data of
> > >         device memory in pre-copy and stop-and-copy phase. The data of
> > >         device memory is held in device memory region.
> > > If VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY is on, it will query of dirty pages
> > >         produced by VFIO device during pre-copy and stop-and-copy phase.
> > >         The dirty bitmap of system memory is held in dirty bitmap region.  
> > 
> > As above, these capabilities seem redundant to the existence of the
> > device specific regions in this implementation.
> >  
> seems so :)
> 
> > > If failing to find two mandatory regions and optional data regions
> > > corresponding to data caps or version mismatching, it will setup a
> > > migration blocker and disable live migration for VFIO device.
> > > 
> > > 
> > > Flows to call device state interface for VFIO live migration
> > > ------------------------------------------------------------
> > > 
> > > Live migration save path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_SAVE_SETUP
> > >  |
> > >  .save_setup callback -->
> > >  get device memory size (whole snapshot size)
> > >  get device memory buffer (whole snapshot data)
> > >  set device state --> VFIO_DEVICE_STATE_RUNNING & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > >  .save_live_pending callback --> get device memory size (dirty data)
> > >  .save_live_iteration callback --> get device memory buffer (dirty data)
> > >  .log_sync callback --> get system memory dirty bitmap
> > >  |
> > > (vcpu stops) --> set device state -->
> > >  VFIO_DEVICE_STATE_STOP & VFIO_DEVICE_STATE_LOGGING
> > >  |
> > > .save_live_complete_precopy callback -->
> > >  get device memory size (dirty data)
> > >  get device memory buffer (dirty data)
> > >  get device config size (whole snapshot size)
> > >  get device config buffer (whole snapshot data)
> > >  |
> > > .save_cleanup callback -->  set device state --> VFIO_DEVICE_STATE_STOP
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > MIGRATION_STATUS_CANCELLED or
> > > MIGRATION_STATUS_FAILED
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > > 
> > > 
> > > Live migration load path:
> > > 
> > > (QEMU LIVE MIGRATION STATE --> DEVICE STATE INTERFACE --> DEVICE STATE)
> > > 
> > > MIGRATION_STATUS_NONE --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > (vcpu stops) --> set device state --> VFIO_DEVICE_STATE_STOP
> > >  |
> > > MIGRATION_STATUS_ACTIVE
> > >  |
> > > .load state callback -->
> > >  set device memory size, set device memory buffer, set device config size,
> > >  set device config buffer
> > >  |
> > > (vcpu starts) --> set device state --> VFIO_DEVICE_STATE_RUNNING
> > >  |
> > > MIGRATION_STATUS_COMPLETED
> > > 
> > > 
> > > 
> > > In source VM side,
> > > In precopy phase,
> > > if a device has VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY on,
> > > qemu will first get whole snapshot of device memory in .save_setup
> > > callback, and then it will get total size of dirty data in device memory in
> > > .save_live_pending callback by reading device_memory.size field of control
> > > region.  
> > 
> > This requires iterative reads of device memory buffer but the protocol
> > is unclear (to me) how the user knows how to do this or interact with
> > the position field. 
> >   
> > > Then in .save_live_iteration callback, it will get buffer of device memory's
> > > dirty data chunk by chunk from device memory region by writing pos &
> > > action (GET_BUFFER) to device_memory.pos & device_memory.action fields of
> > > control region. (size of each chunk is the size of device memory data
> > > region).  
> > 
> > What if there's not enough dirty data to fill the region?  The data is
> > always padded to fill the region?
> >  
> I think dirty data in vendor driver is orgnaized in a format like:
> (addr0, len0, data0, addr1, len1, data1, addr2, len2, data2,....addrN,
> lenN, dataN).
> for full snapshot, it's like (addr0, len0, data0).
> so, to userspace and data region, it doesn't matter whether it's full
> snapshot or dirty data.
> 
> 
> > > .save_live_pending and .save_live_iteration may be called several times in
> > > precopy phase to get dirty data in device memory.
> > > 
> > > If VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY is off, callbacks in precopy phase
> > > like .save_setup, .save_live_pending, .save_live_iteration will not call
> > > vendor driver's device state interface to get data from devcie memory.  
> > 
> > Therefore through the entire precopy phase we have no data from source
> > to target to begin a compatibility check :-\  I think both proposals
> > currently still lack any sort of device compatibility or data
> > versioning check between source and target.  Thanks,  
> I checked the compatibility, though not good enough:)
> 
> in migration_init, vfio_check_devstate_version() checked version from
> kernel with VFIO_DEVICE_STATE_INTERFACE_VERSION in both source and target,
> and in target side, vfio_load_state() checked source side version.
> 
> 
> int vfio_load_state(QEMUFile *f, void *opaque, int version_id)                          
> {       
>     ...
>     if (version_id != VFIO_DEVICE_STATE_INTERFACE_VERSION) {                                   
>         return -EINVAL;
>     } 
>     ...
> }

But this only checks that both source and target are using the same
migration interface, how do we know that they're compatible devices and
that the vendor data stream is compatible between source and target?
Whether both ends use the same migration interface is potentially not
relevant if the data stream is compatible.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-07 17:44       ` [Qemu-devel] " Alex Williamson
@ 2019-03-07 23:20         ` Tian, Kevin
  -1 siblings, 0 replies; 133+ messages in thread
From: Tian, Kevin @ 2019-03-07 23:20 UTC (permalink / raw)
  To: Alex Williamson, Zhao, Yan Y
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, dgilbert, intel-gvt-dev,
	Liu, Changpeng

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, March 8, 2019 1:44 AM
> > >
> > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > >         stop-and-copy phase.
> > > >         The data of device memory is held in device memory region.
> > > >         Size of devie memory is usually larger than that of device
> > > >         memory region. qemu needs to save/load it in chunks of size of
> > > >         device memory region.
> > > >         Not all device has device memory. Like IGD only uses system
> memory.
> > >
> > > It seems a little gratuitous to me that this is a separate region or
> > > that this data is handled separately.  All of this data is opaque to
> > > QEMU, so why do we need to separate it?
> > hi Alex,
> > as the device state interfaces are provided by kernel, it is expected to
> > meet as general needs as possible. So, do you think there are such use
> > cases from user space that user space knows well of the device, and
> > it wants kernel to return desired data back to it.
> > E.g. It just wants to get whole device config data including all mmios,
> > page tables, pci config data...
> > or, It just wants to get current device memory snapshot, not including any
> > dirty data.
> > Or, It just needs the dirty pages in device memory or system memory.
> > With all this accurate query, quite a lot of useful features can be
> > developped in user space.
> >
> > If all of this data is opaque to user app, seems the only use case is
> > for live migration.
> 
> I can certainly appreciate a more versatile interface, but I think
> we're also trying to create the most simple interface we can, with the
> primary target being live migration.  As soon as we start defining this
> type of device memory and that type of device memory, we're going to
> have another device come along that needs yet another because they have
> a slightly different requirement.  Even without that, we're going to
> have vendor drivers implement it differently, so what works for one
> device for a more targeted approach may not work for all devices.  Can
> you enumerate some specific examples of the use cases you imagine your
> design to enable?
> 

Do we want to consider an use case where user space would like to
selectively introspect a portion of the device state (including implicit 
state which are not available through PCI regions), and may ask for
capability of direct mapping of selected portion for scanning (e.g.
device memory) instead of always turning on dirty logging on all
device state?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-03-07 23:20         ` Tian, Kevin
  0 siblings, 0 replies; 133+ messages in thread
From: Tian, Kevin @ 2019-03-07 23:20 UTC (permalink / raw)
  To: Alex Williamson, Zhao, Yan Y
  Cc: qemu-devel, intel-gvt-dev, Zhengxiao.zx, Liu, Yi L, eskultet,
	Yang, Ziye, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies, Liu,
	Changpeng, Ken.Xue, kwankhede, cjia, arei.gonglei, kvm

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, March 8, 2019 1:44 AM
> > >
> > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > >         stop-and-copy phase.
> > > >         The data of device memory is held in device memory region.
> > > >         Size of devie memory is usually larger than that of device
> > > >         memory region. qemu needs to save/load it in chunks of size of
> > > >         device memory region.
> > > >         Not all device has device memory. Like IGD only uses system
> memory.
> > >
> > > It seems a little gratuitous to me that this is a separate region or
> > > that this data is handled separately.  All of this data is opaque to
> > > QEMU, so why do we need to separate it?
> > hi Alex,
> > as the device state interfaces are provided by kernel, it is expected to
> > meet as general needs as possible. So, do you think there are such use
> > cases from user space that user space knows well of the device, and
> > it wants kernel to return desired data back to it.
> > E.g. It just wants to get whole device config data including all mmios,
> > page tables, pci config data...
> > or, It just wants to get current device memory snapshot, not including any
> > dirty data.
> > Or, It just needs the dirty pages in device memory or system memory.
> > With all this accurate query, quite a lot of useful features can be
> > developped in user space.
> >
> > If all of this data is opaque to user app, seems the only use case is
> > for live migration.
> 
> I can certainly appreciate a more versatile interface, but I think
> we're also trying to create the most simple interface we can, with the
> primary target being live migration.  As soon as we start defining this
> type of device memory and that type of device memory, we're going to
> have another device come along that needs yet another because they have
> a slightly different requirement.  Even without that, we're going to
> have vendor drivers implement it differently, so what works for one
> device for a more targeted approach may not work for all devices.  Can
> you enumerate some specific examples of the use cases you imagine your
> design to enable?
> 

Do we want to consider an use case where user space would like to
selectively introspect a portion of the device state (including implicit 
state which are not available through PCI regions), and may ask for
capability of direct mapping of selected portion for scanning (e.g.
device memory) instead of always turning on dirty logging on all
device state?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-07 23:20         ` [Qemu-devel] " Tian, Kevin
@ 2019-03-08 16:11           ` Alex Williamson
  -1 siblings, 0 replies; 133+ messages in thread
From: Alex Williamson @ 2019-03-08 16:11 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, Zhao, Yan Y, dgilbert,
	intel-gvt-dev,

On Thu, 7 Mar 2019 23:20:36 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, March 8, 2019 1:44 AM  
> > > >  
> > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > >         stop-and-copy phase.
> > > > >         The data of device memory is held in device memory region.
> > > > >         Size of devie memory is usually larger than that of device
> > > > >         memory region. qemu needs to save/load it in chunks of size of
> > > > >         device memory region.
> > > > >         Not all device has device memory. Like IGD only uses system  
> > memory.  
> > > >
> > > > It seems a little gratuitous to me that this is a separate region or
> > > > that this data is handled separately.  All of this data is opaque to
> > > > QEMU, so why do we need to separate it?  
> > > hi Alex,
> > > as the device state interfaces are provided by kernel, it is expected to
> > > meet as general needs as possible. So, do you think there are such use
> > > cases from user space that user space knows well of the device, and
> > > it wants kernel to return desired data back to it.
> > > E.g. It just wants to get whole device config data including all mmios,
> > > page tables, pci config data...
> > > or, It just wants to get current device memory snapshot, not including any
> > > dirty data.
> > > Or, It just needs the dirty pages in device memory or system memory.
> > > With all this accurate query, quite a lot of useful features can be
> > > developped in user space.
> > >
> > > If all of this data is opaque to user app, seems the only use case is
> > > for live migration.  
> > 
> > I can certainly appreciate a more versatile interface, but I think
> > we're also trying to create the most simple interface we can, with the
> > primary target being live migration.  As soon as we start defining this
> > type of device memory and that type of device memory, we're going to
> > have another device come along that needs yet another because they have
> > a slightly different requirement.  Even without that, we're going to
> > have vendor drivers implement it differently, so what works for one
> > device for a more targeted approach may not work for all devices.  Can
> > you enumerate some specific examples of the use cases you imagine your
> > design to enable?
> >   
> 
> Do we want to consider an use case where user space would like to
> selectively introspect a portion of the device state (including implicit 
> state which are not available through PCI regions), and may ask for
> capability of direct mapping of selected portion for scanning (e.g.
> device memory) instead of always turning on dirty logging on all
> device state?

I don't see that a migration interface necessarily lends itself to this
use case.  A migration data stream has no requirement to be user
consumable as anything other than opaque data, there's also no
requirement that it expose state in a form that directly represents the
internal state of the device.  In fact I'm not sure we want to encourage
introspection via this data stream.  If a user knows how to interpret
the data, what prevents them from modifying the data in-flight?  I've
raised the question previously regarding how the vendor driver can
validate the integrity of the migration data stream.  Using the
migration interface to introspect the device certainly suggests an
interface ripe for exploiting any potential weakness in the vendor
driver reassembling that migration stream.  If the user has an mmap to
the actual live working state of the vendor driver, protection in the
hardware seems like the only way you could protect against a malicious
user.  Please be defensive in what is directly exposed to the user and
what safeguards are in place within the vendor driver for validating
incoming data.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-03-08 16:11           ` Alex Williamson
  0 siblings, 0 replies; 133+ messages in thread
From: Alex Williamson @ 2019-03-08 16:11 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Zhao, Yan Y, qemu-devel, intel-gvt-dev, Zhengxiao.zx, Liu, Yi L,
	eskultet, Yang, Ziye, cohuck, shuangtai.tst, dgilbert, Wang,
	Zhi A, mlevitsk, pasic, aik, eauger, felipe, jonathan.davies,
	Liu, Changpeng, Ken.Xue, kwankhede, cjia, arei.gonglei, kvm

On Thu, 7 Mar 2019 23:20:36 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, March 8, 2019 1:44 AM  
> > > >  
> > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > >         stop-and-copy phase.
> > > > >         The data of device memory is held in device memory region.
> > > > >         Size of devie memory is usually larger than that of device
> > > > >         memory region. qemu needs to save/load it in chunks of size of
> > > > >         device memory region.
> > > > >         Not all device has device memory. Like IGD only uses system  
> > memory.  
> > > >
> > > > It seems a little gratuitous to me that this is a separate region or
> > > > that this data is handled separately.  All of this data is opaque to
> > > > QEMU, so why do we need to separate it?  
> > > hi Alex,
> > > as the device state interfaces are provided by kernel, it is expected to
> > > meet as general needs as possible. So, do you think there are such use
> > > cases from user space that user space knows well of the device, and
> > > it wants kernel to return desired data back to it.
> > > E.g. It just wants to get whole device config data including all mmios,
> > > page tables, pci config data...
> > > or, It just wants to get current device memory snapshot, not including any
> > > dirty data.
> > > Or, It just needs the dirty pages in device memory or system memory.
> > > With all this accurate query, quite a lot of useful features can be
> > > developped in user space.
> > >
> > > If all of this data is opaque to user app, seems the only use case is
> > > for live migration.  
> > 
> > I can certainly appreciate a more versatile interface, but I think
> > we're also trying to create the most simple interface we can, with the
> > primary target being live migration.  As soon as we start defining this
> > type of device memory and that type of device memory, we're going to
> > have another device come along that needs yet another because they have
> > a slightly different requirement.  Even without that, we're going to
> > have vendor drivers implement it differently, so what works for one
> > device for a more targeted approach may not work for all devices.  Can
> > you enumerate some specific examples of the use cases you imagine your
> > design to enable?
> >   
> 
> Do we want to consider an use case where user space would like to
> selectively introspect a portion of the device state (including implicit 
> state which are not available through PCI regions), and may ask for
> capability of direct mapping of selected portion for scanning (e.g.
> device memory) instead of always turning on dirty logging on all
> device state?

I don't see that a migration interface necessarily lends itself to this
use case.  A migration data stream has no requirement to be user
consumable as anything other than opaque data, there's also no
requirement that it expose state in a form that directly represents the
internal state of the device.  In fact I'm not sure we want to encourage
introspection via this data stream.  If a user knows how to interpret
the data, what prevents them from modifying the data in-flight?  I've
raised the question previously regarding how the vendor driver can
validate the integrity of the migration data stream.  Using the
migration interface to introspect the device certainly suggests an
interface ripe for exploiting any potential weakness in the vendor
driver reassembling that migration stream.  If the user has an mmap to
the actual live working state of the vendor driver, protection in the
hardware seems like the only way you could protect against a malicious
user.  Please be defensive in what is directly exposed to the user and
what safeguards are in place within the vendor driver for validating
incoming data.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-08 16:11           ` [Qemu-devel] " Alex Williamson
@ 2019-03-08 16:21             ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 133+ messages in thread
From: Dr. David Alan Gilbert @ 2019-03-08 16:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst,
	qemu-devel, kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye,
	mlevitsk, pasic, arei.gonglei, felipe, Ken.Xue, Tian, Kevin,
	Zhao, Yan Y, intel-gvt-dev,

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Thu, 7 Mar 2019 23:20:36 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Friday, March 8, 2019 1:44 AM  
> > > > >  
> > > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > > >         stop-and-copy phase.
> > > > > >         The data of device memory is held in device memory region.
> > > > > >         Size of devie memory is usually larger than that of device
> > > > > >         memory region. qemu needs to save/load it in chunks of size of
> > > > > >         device memory region.
> > > > > >         Not all device has device memory. Like IGD only uses system  
> > > memory.  
> > > > >
> > > > > It seems a little gratuitous to me that this is a separate region or
> > > > > that this data is handled separately.  All of this data is opaque to
> > > > > QEMU, so why do we need to separate it?  
> > > > hi Alex,
> > > > as the device state interfaces are provided by kernel, it is expected to
> > > > meet as general needs as possible. So, do you think there are such use
> > > > cases from user space that user space knows well of the device, and
> > > > it wants kernel to return desired data back to it.
> > > > E.g. It just wants to get whole device config data including all mmios,
> > > > page tables, pci config data...
> > > > or, It just wants to get current device memory snapshot, not including any
> > > > dirty data.
> > > > Or, It just needs the dirty pages in device memory or system memory.
> > > > With all this accurate query, quite a lot of useful features can be
> > > > developped in user space.
> > > >
> > > > If all of this data is opaque to user app, seems the only use case is
> > > > for live migration.  
> > > 
> > > I can certainly appreciate a more versatile interface, but I think
> > > we're also trying to create the most simple interface we can, with the
> > > primary target being live migration.  As soon as we start defining this
> > > type of device memory and that type of device memory, we're going to
> > > have another device come along that needs yet another because they have
> > > a slightly different requirement.  Even without that, we're going to
> > > have vendor drivers implement it differently, so what works for one
> > > device for a more targeted approach may not work for all devices.  Can
> > > you enumerate some specific examples of the use cases you imagine your
> > > design to enable?
> > >   
> > 
> > Do we want to consider an use case where user space would like to
> > selectively introspect a portion of the device state (including implicit 
> > state which are not available through PCI regions), and may ask for
> > capability of direct mapping of selected portion for scanning (e.g.
> > device memory) instead of always turning on dirty logging on all
> > device state?
> 
> I don't see that a migration interface necessarily lends itself to this
> use case.  A migration data stream has no requirement to be user
> consumable as anything other than opaque data, there's also no
> requirement that it expose state in a form that directly represents the
> internal state of the device.  In fact I'm not sure we want to encourage
> introspection via this data stream.  If a user knows how to interpret
> the data, what prevents them from modifying the data in-flight?  I've
> raised the question previously regarding how the vendor driver can
> validate the integrity of the migration data stream.  Using the
> migration interface to introspect the device certainly suggests an
> interface ripe for exploiting any potential weakness in the vendor
> driver reassembling that migration stream.  If the user has an mmap to
> the actual live working state of the vendor driver, protection in the
> hardware seems like the only way you could protect against a malicious
> user.  Please be defensive in what is directly exposed to the user and
> what safeguards are in place within the vendor driver for validating
> incoming data.  Thanks,

Hmm; that sounds like a security-by-obscurity answer!

The scripts/analyze-migration.py scripts will actually dump the
migration stream data in an almost readable format.
So if you properly define the VMState definitions it should be almost
readable; it's occasionally been useful.

I agree that you should be very very careful to validate the incoming
migration stream against:
  a) Corruption
  b) Wrong driver versions
  c) Malicious intent
    c.1) Especially by the guest
    c.2) Or by someone trying to feed you a duff stream
  d) Someone trying load the VFIO stream into completely the wrong
device.

Whether the migration interface is the right thing to use for that
inspection hmm; well it might be - if you're trying to debug
your device and need a dump of it's state, then why not?
(I guess you end up with something not dissimilar to what things
like intek_reg_snapshot in intel-gpu-tools does).

Dave

> Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-03-08 16:21             ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 133+ messages in thread
From: Dr. David Alan Gilbert @ 2019-03-08 16:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Zhao, Yan Y, qemu-devel, intel-gvt-dev,
	Zhengxiao.zx@Alibaba-inc.com, Liu, Yi L, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, Wang, Zhi A, mlevitsk, pasic, aik, eauger,
	felipe, jonathan.davies, Liu, Changpeng, Ken.Xue, kwankhede,
	cjia, arei.gonglei, kvm

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Thu, 7 Mar 2019 23:20:36 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Friday, March 8, 2019 1:44 AM  
> > > > >  
> > > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > > >         stop-and-copy phase.
> > > > > >         The data of device memory is held in device memory region.
> > > > > >         Size of devie memory is usually larger than that of device
> > > > > >         memory region. qemu needs to save/load it in chunks of size of
> > > > > >         device memory region.
> > > > > >         Not all device has device memory. Like IGD only uses system  
> > > memory.  
> > > > >
> > > > > It seems a little gratuitous to me that this is a separate region or
> > > > > that this data is handled separately.  All of this data is opaque to
> > > > > QEMU, so why do we need to separate it?  
> > > > hi Alex,
> > > > as the device state interfaces are provided by kernel, it is expected to
> > > > meet as general needs as possible. So, do you think there are such use
> > > > cases from user space that user space knows well of the device, and
> > > > it wants kernel to return desired data back to it.
> > > > E.g. It just wants to get whole device config data including all mmios,
> > > > page tables, pci config data...
> > > > or, It just wants to get current device memory snapshot, not including any
> > > > dirty data.
> > > > Or, It just needs the dirty pages in device memory or system memory.
> > > > With all this accurate query, quite a lot of useful features can be
> > > > developped in user space.
> > > >
> > > > If all of this data is opaque to user app, seems the only use case is
> > > > for live migration.  
> > > 
> > > I can certainly appreciate a more versatile interface, but I think
> > > we're also trying to create the most simple interface we can, with the
> > > primary target being live migration.  As soon as we start defining this
> > > type of device memory and that type of device memory, we're going to
> > > have another device come along that needs yet another because they have
> > > a slightly different requirement.  Even without that, we're going to
> > > have vendor drivers implement it differently, so what works for one
> > > device for a more targeted approach may not work for all devices.  Can
> > > you enumerate some specific examples of the use cases you imagine your
> > > design to enable?
> > >   
> > 
> > Do we want to consider an use case where user space would like to
> > selectively introspect a portion of the device state (including implicit 
> > state which are not available through PCI regions), and may ask for
> > capability of direct mapping of selected portion for scanning (e.g.
> > device memory) instead of always turning on dirty logging on all
> > device state?
> 
> I don't see that a migration interface necessarily lends itself to this
> use case.  A migration data stream has no requirement to be user
> consumable as anything other than opaque data, there's also no
> requirement that it expose state in a form that directly represents the
> internal state of the device.  In fact I'm not sure we want to encourage
> introspection via this data stream.  If a user knows how to interpret
> the data, what prevents them from modifying the data in-flight?  I've
> raised the question previously regarding how the vendor driver can
> validate the integrity of the migration data stream.  Using the
> migration interface to introspect the device certainly suggests an
> interface ripe for exploiting any potential weakness in the vendor
> driver reassembling that migration stream.  If the user has an mmap to
> the actual live working state of the vendor driver, protection in the
> hardware seems like the only way you could protect against a malicious
> user.  Please be defensive in what is directly exposed to the user and
> what safeguards are in place within the vendor driver for validating
> incoming data.  Thanks,

Hmm; that sounds like a security-by-obscurity answer!

The scripts/analyze-migration.py scripts will actually dump the
migration stream data in an almost readable format.
So if you properly define the VMState definitions it should be almost
readable; it's occasionally been useful.

I agree that you should be very very careful to validate the incoming
migration stream against:
  a) Corruption
  b) Wrong driver versions
  c) Malicious intent
    c.1) Especially by the guest
    c.2) Or by someone trying to feed you a duff stream
  d) Someone trying load the VFIO stream into completely the wrong
device.

Whether the migration interface is the right thing to use for that
inspection hmm; well it might be - if you're trying to debug
your device and need a dump of it's state, then why not?
(I guess you end up with something not dissimilar to what things
like intek_reg_snapshot in intel-gpu-tools does).

Dave

> Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-08 16:21             ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2019-03-08 22:02               ` Alex Williamson
  -1 siblings, 0 replies; 133+ messages in thread
From: Alex Williamson @ 2019-03-08 22:02 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: cjia, kvm, aik, Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst,
	qemu-devel, kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye,
	mlevitsk, pasic, arei.gonglei, felipe, Ken.Xue, Tian, Kevin,
	Zhao, Yan Y, intel-gvt-dev,

On Fri, 8 Mar 2019 16:21:46 +0000
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Alex Williamson (alex.williamson@redhat.com) wrote:
> > On Thu, 7 Mar 2019 23:20:36 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Friday, March 8, 2019 1:44 AM    
> > > > > >    
> > > > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > > > >         stop-and-copy phase.
> > > > > > >         The data of device memory is held in device memory region.
> > > > > > >         Size of devie memory is usually larger than that of device
> > > > > > >         memory region. qemu needs to save/load it in chunks of size of
> > > > > > >         device memory region.
> > > > > > >         Not all device has device memory. Like IGD only uses system    
> > > > memory.    
> > > > > >
> > > > > > It seems a little gratuitous to me that this is a separate region or
> > > > > > that this data is handled separately.  All of this data is opaque to
> > > > > > QEMU, so why do we need to separate it?    
> > > > > hi Alex,
> > > > > as the device state interfaces are provided by kernel, it is expected to
> > > > > meet as general needs as possible. So, do you think there are such use
> > > > > cases from user space that user space knows well of the device, and
> > > > > it wants kernel to return desired data back to it.
> > > > > E.g. It just wants to get whole device config data including all mmios,
> > > > > page tables, pci config data...
> > > > > or, It just wants to get current device memory snapshot, not including any
> > > > > dirty data.
> > > > > Or, It just needs the dirty pages in device memory or system memory.
> > > > > With all this accurate query, quite a lot of useful features can be
> > > > > developped in user space.
> > > > >
> > > > > If all of this data is opaque to user app, seems the only use case is
> > > > > for live migration.    
> > > > 
> > > > I can certainly appreciate a more versatile interface, but I think
> > > > we're also trying to create the most simple interface we can, with the
> > > > primary target being live migration.  As soon as we start defining this
> > > > type of device memory and that type of device memory, we're going to
> > > > have another device come along that needs yet another because they have
> > > > a slightly different requirement.  Even without that, we're going to
> > > > have vendor drivers implement it differently, so what works for one
> > > > device for a more targeted approach may not work for all devices.  Can
> > > > you enumerate some specific examples of the use cases you imagine your
> > > > design to enable?
> > > >     
> > > 
> > > Do we want to consider an use case where user space would like to
> > > selectively introspect a portion of the device state (including implicit 
> > > state which are not available through PCI regions), and may ask for
> > > capability of direct mapping of selected portion for scanning (e.g.
> > > device memory) instead of always turning on dirty logging on all
> > > device state?  
> > 
> > I don't see that a migration interface necessarily lends itself to this
> > use case.  A migration data stream has no requirement to be user
> > consumable as anything other than opaque data, there's also no
> > requirement that it expose state in a form that directly represents the
> > internal state of the device.  In fact I'm not sure we want to encourage
> > introspection via this data stream.  If a user knows how to interpret
> > the data, what prevents them from modifying the data in-flight?  I've
> > raised the question previously regarding how the vendor driver can
> > validate the integrity of the migration data stream.  Using the
> > migration interface to introspect the device certainly suggests an
> > interface ripe for exploiting any potential weakness in the vendor
> > driver reassembling that migration stream.  If the user has an mmap to
> > the actual live working state of the vendor driver, protection in the
> > hardware seems like the only way you could protect against a malicious
> > user.  Please be defensive in what is directly exposed to the user and
> > what safeguards are in place within the vendor driver for validating
> > incoming data.  Thanks,  
> 
> Hmm; that sounds like a security-by-obscurity answer!

Yup, that's fair.  I won't deny that in-kernel vendor driver state
passing through userspace from source to target systems scares me quite
a bit, but defining device introspection as a use case for the
migration interface imposes requirements on the vendor drivers that
don't otherwise exist.  Mdev vendor specific utilities could always be
written to interpret the migration stream to deduce the internal state,
but I think that imposing segregated device memory vs device config
regions with the expectation that internal state can be directly
tracked is beyond the scope of a migration interface.
 
> The scripts/analyze-migration.py scripts will actually dump the
> migration stream data in an almost readable format.
> So if you properly define the VMState definitions it should be almost
> readable; it's occasionally been useful.

That's true for emulated devices, but I expect an mdev device migration
stream is simply one blob of opaque data followed by another.  We can
impose the protocol that userspace uses to read and write this data
stream from the device, but not the data it contains.
 
> I agree that you should be very very careful to validate the incoming
> migration stream against:
>   a) Corruption
>   b) Wrong driver versions
>   c) Malicious intent
>     c.1) Especially by the guest
>     c.2) Or by someone trying to feed you a duff stream
>   d) Someone trying load the VFIO stream into completely the wrong
> device.

Yes, and with open source mdev vendor drivers we can at least
theoretically audit the reload, but of course we also have proprietary
drivers.  I wonder if we should install the kill switch in advance to
allow users to opt-out of enabling migration at the mdev layer.

> Whether the migration interface is the right thing to use for that
> inspection hmm; well it might be - if you're trying to debug
> your device and need a dump of it's state, then why not?
> (I guess you end up with something not dissimilar to what things
> like intek_reg_snapshot in intel-gpu-tools does).

Sure, as above there's nothing preventing mdev specific utilities from
decoding the migration stream, but I begin to have an issue if this
introspective use case imposes requirements on how device state is
represented through the migration interface that don't otherwise
exist.  If we want to define a standard for the actual data from the
device, we'll be at this for years :-\  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-03-08 22:02               ` Alex Williamson
  0 siblings, 0 replies; 133+ messages in thread
From: Alex Williamson @ 2019-03-08 22:02 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Tian, Kevin, Zhao, Yan Y, qemu-devel, intel-gvt-dev,
	Zhengxiao.zx@Alibaba-inc.com, Liu, Yi L, eskultet, Yang, Ziye,
	cohuck, shuangtai.tst, Wang, Zhi A, mlevitsk, pasic, aik, eauger,
	felipe, jonathan.davies, Liu, Changpeng, Ken.Xue, kwankhede,
	cjia, arei.gonglei, kvm

On Fri, 8 Mar 2019 16:21:46 +0000
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Alex Williamson (alex.williamson@redhat.com) wrote:
> > On Thu, 7 Mar 2019 23:20:36 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Friday, March 8, 2019 1:44 AM    
> > > > > >    
> > > > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > > > >         stop-and-copy phase.
> > > > > > >         The data of device memory is held in device memory region.
> > > > > > >         Size of devie memory is usually larger than that of device
> > > > > > >         memory region. qemu needs to save/load it in chunks of size of
> > > > > > >         device memory region.
> > > > > > >         Not all device has device memory. Like IGD only uses system    
> > > > memory.    
> > > > > >
> > > > > > It seems a little gratuitous to me that this is a separate region or
> > > > > > that this data is handled separately.  All of this data is opaque to
> > > > > > QEMU, so why do we need to separate it?    
> > > > > hi Alex,
> > > > > as the device state interfaces are provided by kernel, it is expected to
> > > > > meet as general needs as possible. So, do you think there are such use
> > > > > cases from user space that user space knows well of the device, and
> > > > > it wants kernel to return desired data back to it.
> > > > > E.g. It just wants to get whole device config data including all mmios,
> > > > > page tables, pci config data...
> > > > > or, It just wants to get current device memory snapshot, not including any
> > > > > dirty data.
> > > > > Or, It just needs the dirty pages in device memory or system memory.
> > > > > With all this accurate query, quite a lot of useful features can be
> > > > > developped in user space.
> > > > >
> > > > > If all of this data is opaque to user app, seems the only use case is
> > > > > for live migration.    
> > > > 
> > > > I can certainly appreciate a more versatile interface, but I think
> > > > we're also trying to create the most simple interface we can, with the
> > > > primary target being live migration.  As soon as we start defining this
> > > > type of device memory and that type of device memory, we're going to
> > > > have another device come along that needs yet another because they have
> > > > a slightly different requirement.  Even without that, we're going to
> > > > have vendor drivers implement it differently, so what works for one
> > > > device for a more targeted approach may not work for all devices.  Can
> > > > you enumerate some specific examples of the use cases you imagine your
> > > > design to enable?
> > > >     
> > > 
> > > Do we want to consider an use case where user space would like to
> > > selectively introspect a portion of the device state (including implicit 
> > > state which are not available through PCI regions), and may ask for
> > > capability of direct mapping of selected portion for scanning (e.g.
> > > device memory) instead of always turning on dirty logging on all
> > > device state?  
> > 
> > I don't see that a migration interface necessarily lends itself to this
> > use case.  A migration data stream has no requirement to be user
> > consumable as anything other than opaque data, there's also no
> > requirement that it expose state in a form that directly represents the
> > internal state of the device.  In fact I'm not sure we want to encourage
> > introspection via this data stream.  If a user knows how to interpret
> > the data, what prevents them from modifying the data in-flight?  I've
> > raised the question previously regarding how the vendor driver can
> > validate the integrity of the migration data stream.  Using the
> > migration interface to introspect the device certainly suggests an
> > interface ripe for exploiting any potential weakness in the vendor
> > driver reassembling that migration stream.  If the user has an mmap to
> > the actual live working state of the vendor driver, protection in the
> > hardware seems like the only way you could protect against a malicious
> > user.  Please be defensive in what is directly exposed to the user and
> > what safeguards are in place within the vendor driver for validating
> > incoming data.  Thanks,  
> 
> Hmm; that sounds like a security-by-obscurity answer!

Yup, that's fair.  I won't deny that in-kernel vendor driver state
passing through userspace from source to target systems scares me quite
a bit, but defining device introspection as a use case for the
migration interface imposes requirements on the vendor drivers that
don't otherwise exist.  Mdev vendor specific utilities could always be
written to interpret the migration stream to deduce the internal state,
but I think that imposing segregated device memory vs device config
regions with the expectation that internal state can be directly
tracked is beyond the scope of a migration interface.
 
> The scripts/analyze-migration.py scripts will actually dump the
> migration stream data in an almost readable format.
> So if you properly define the VMState definitions it should be almost
> readable; it's occasionally been useful.

That's true for emulated devices, but I expect an mdev device migration
stream is simply one blob of opaque data followed by another.  We can
impose the protocol that userspace uses to read and write this data
stream from the device, but not the data it contains.
 
> I agree that you should be very very careful to validate the incoming
> migration stream against:
>   a) Corruption
>   b) Wrong driver versions
>   c) Malicious intent
>     c.1) Especially by the guest
>     c.2) Or by someone trying to feed you a duff stream
>   d) Someone trying load the VFIO stream into completely the wrong
> device.

Yes, and with open source mdev vendor drivers we can at least
theoretically audit the reload, but of course we also have proprietary
drivers.  I wonder if we should install the kill switch in advance to
allow users to opt-out of enabling migration at the mdev layer.

> Whether the migration interface is the right thing to use for that
> inspection hmm; well it might be - if you're trying to debug
> your device and need a dump of it's state, then why not?
> (I guess you end up with something not dissimilar to what things
> like intek_reg_snapshot in intel-gpu-tools does).

Sure, as above there's nothing preventing mdev specific utilities from
decoding the migration stream, but I begin to have an issue if this
introspective use case imposes requirements on how device state is
represented through the migration interface that don't otherwise
exist.  If we want to define a standard for the actual data from the
device, we'll be at this for years :-\  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-08 22:02               ` [Qemu-devel] " Alex Williamson
@ 2019-03-11  2:33                 ` Tian, Kevin
  -1 siblings, 0 replies; 133+ messages in thread
From: Tian, Kevin @ 2019-03-11  2:33 UTC (permalink / raw)
  To: Alex Williamson, Dr. David Alan Gilbert
  Cc: cjia, kvm, aik, Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst,
	qemu-devel, kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye,
	mlevitsk, pasic, arei.gonglei, felipe, Wang, Zhi A, Zhao, Yan Y,
	intel-gvt-dev, Liu, Changpeng

> From: Alex Williamson
> Sent: Saturday, March 9, 2019 6:03 AM
> 
> On Fri, 8 Mar 2019 16:21:46 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Alex Williamson (alex.williamson@redhat.com) wrote:
> > > On Thu, 7 Mar 2019 23:20:36 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Friday, March 8, 2019 1:44 AM
> > > > > > >
> > > > > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > > > > >         stop-and-copy phase.
> > > > > > > >         The data of device memory is held in device memory region.
> > > > > > > >         Size of devie memory is usually larger than that of device
> > > > > > > >         memory region. qemu needs to save/load it in chunks of size
> of
> > > > > > > >         device memory region.
> > > > > > > >         Not all device has device memory. Like IGD only uses system
> > > > > memory.
> > > > > > >
> > > > > > > It seems a little gratuitous to me that this is a separate region or
> > > > > > > that this data is handled separately.  All of this data is opaque to
> > > > > > > QEMU, so why do we need to separate it?
> > > > > > hi Alex,
> > > > > > as the device state interfaces are provided by kernel, it is expected to
> > > > > > meet as general needs as possible. So, do you think there are such
> use
> > > > > > cases from user space that user space knows well of the device, and
> > > > > > it wants kernel to return desired data back to it.
> > > > > > E.g. It just wants to get whole device config data including all mmios,
> > > > > > page tables, pci config data...
> > > > > > or, It just wants to get current device memory snapshot, not
> including any
> > > > > > dirty data.
> > > > > > Or, It just needs the dirty pages in device memory or system memory.
> > > > > > With all this accurate query, quite a lot of useful features can be
> > > > > > developped in user space.
> > > > > >
> > > > > > If all of this data is opaque to user app, seems the only use case is
> > > > > > for live migration.
> > > > >
> > > > > I can certainly appreciate a more versatile interface, but I think
> > > > > we're also trying to create the most simple interface we can, with the
> > > > > primary target being live migration.  As soon as we start defining this
> > > > > type of device memory and that type of device memory, we're going to
> > > > > have another device come along that needs yet another because they
> have
> > > > > a slightly different requirement.  Even without that, we're going to
> > > > > have vendor drivers implement it differently, so what works for one
> > > > > device for a more targeted approach may not work for all devices.  Can
> > > > > you enumerate some specific examples of the use cases you imagine
> your
> > > > > design to enable?
> > > > >
> > > >
> > > > Do we want to consider an use case where user space would like to
> > > > selectively introspect a portion of the device state (including implicit
> > > > state which are not available through PCI regions), and may ask for
> > > > capability of direct mapping of selected portion for scanning (e.g.
> > > > device memory) instead of always turning on dirty logging on all
> > > > device state?
> > >
> > > I don't see that a migration interface necessarily lends itself to this
> > > use case.  A migration data stream has no requirement to be user
> > > consumable as anything other than opaque data, there's also no
> > > requirement that it expose state in a form that directly represents the
> > > internal state of the device.  In fact I'm not sure we want to encourage
> > > introspection via this data stream.  If a user knows how to interpret
> > > the data, what prevents them from modifying the data in-flight?  I've
> > > raised the question previously regarding how the vendor driver can
> > > validate the integrity of the migration data stream.  Using the
> > > migration interface to introspect the device certainly suggests an
> > > interface ripe for exploiting any potential weakness in the vendor
> > > driver reassembling that migration stream.  If the user has an mmap to
> > > the actual live working state of the vendor driver, protection in the
> > > hardware seems like the only way you could protect against a malicious
> > > user.  Please be defensive in what is directly exposed to the user and
> > > what safeguards are in place within the vendor driver for validating
> > > incoming data.  Thanks,
> >
> > Hmm; that sounds like a security-by-obscurity answer!
> 
> Yup, that's fair.  I won't deny that in-kernel vendor driver state
> passing through userspace from source to target systems scares me quite
> a bit, but defining device introspection as a use case for the
> migration interface imposes requirements on the vendor drivers that
> don't otherwise exist.  Mdev vendor specific utilities could always be
> written to interpret the migration stream to deduce the internal state,
> but I think that imposing segregated device memory vs device config
> regions with the expectation that internal state can be directly
> tracked is beyond the scope of a migration interface.

I'm fine with defining such interface aimed only for migration-like
usages (e.g. also including fast check-pointing), but I didn't buy-in
the point that such opaque way is more secure than segregated
style since the layout can be anyway dumped out by looking at 
source code of mdev driver.

Also better we don't include any 'migration' word in related interface
structure definition. It's just an opaque/dirty-logged way of get/set
device state, e.g. instead of calling it "migration interface" can we
call it "dirty-logged state interface"?

> 
> > The scripts/analyze-migration.py scripts will actually dump the
> > migration stream data in an almost readable format.
> > So if you properly define the VMState definitions it should be almost
> > readable; it's occasionally been useful.
> 
> That's true for emulated devices, but I expect an mdev device migration
> stream is simply one blob of opaque data followed by another.  We can
> impose the protocol that userspace uses to read and write this data
> stream from the device, but not the data it contains.
> 
> > I agree that you should be very very careful to validate the incoming
> > migration stream against:
> >   a) Corruption
> >   b) Wrong driver versions
> >   c) Malicious intent
> >     c.1) Especially by the guest
> >     c.2) Or by someone trying to feed you a duff stream
> >   d) Someone trying load the VFIO stream into completely the wrong
> > device.
> 
> Yes, and with open source mdev vendor drivers we can at least
> theoretically audit the reload, but of course we also have proprietary
> drivers.  I wonder if we should install the kill switch in advance to
> allow users to opt-out of enabling migration at the mdev layer.
> 
> > Whether the migration interface is the right thing to use for that
> > inspection hmm; well it might be - if you're trying to debug
> > your device and need a dump of it's state, then why not?
> > (I guess you end up with something not dissimilar to what things
> > like intek_reg_snapshot in intel-gpu-tools does).
> 
> Sure, as above there's nothing preventing mdev specific utilities from
> decoding the migration stream, but I begin to have an issue if this
> introspective use case imposes requirements on how device state is
> represented through the migration interface that don't otherwise
> exist.  If we want to define a standard for the actual data from the
> device, we'll be at this for years :-\  Thanks,
> 

Introspection is one potential usage when thinking about mmapped
style in Yan's proposal, but it's not strong enough since introspection
can be also done with opaque way (just not optimal meaning always
need to track all the states). We may introduce new interface in the
future when it becomes a real problem.

But I still didn't get your exact concern about security part. For
version yes we still haven't worked out a sane way to represent
vendor-specific compatibility requirement. But allowing user
space to modify data through this interface has really no difference
from allowing guest to modify data through trapped MMIO interface.
mdev driver should guarantee that operations through both interfaces
can modify only the state associated with the said mdev instance,
w/o breaking the isolation boundary. Then the former becomes just
a batch of operations to be verified in the same way as if they are
done individually through the latter interface. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-03-11  2:33                 ` Tian, Kevin
  0 siblings, 0 replies; 133+ messages in thread
From: Tian, Kevin @ 2019-03-11  2:33 UTC (permalink / raw)
  To: Alex Williamson, Dr. David Alan Gilbert
  Cc: cjia, kvm, aik, Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst,
	qemu-devel, kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye,
	mlevitsk, pasic, arei.gonglei, felipe, Ken.Xue, Zhao, Yan Y,
	intel-gvt-dev, Liu, Changpeng, cohuck, Wang, Zhi A,
	jonathan.davies

> From: Alex Williamson
> Sent: Saturday, March 9, 2019 6:03 AM
> 
> On Fri, 8 Mar 2019 16:21:46 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Alex Williamson (alex.williamson@redhat.com) wrote:
> > > On Thu, 7 Mar 2019 23:20:36 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Friday, March 8, 2019 1:44 AM
> > > > > > >
> > > > > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > > > > >         stop-and-copy phase.
> > > > > > > >         The data of device memory is held in device memory region.
> > > > > > > >         Size of devie memory is usually larger than that of device
> > > > > > > >         memory region. qemu needs to save/load it in chunks of size
> of
> > > > > > > >         device memory region.
> > > > > > > >         Not all device has device memory. Like IGD only uses system
> > > > > memory.
> > > > > > >
> > > > > > > It seems a little gratuitous to me that this is a separate region or
> > > > > > > that this data is handled separately.  All of this data is opaque to
> > > > > > > QEMU, so why do we need to separate it?
> > > > > > hi Alex,
> > > > > > as the device state interfaces are provided by kernel, it is expected to
> > > > > > meet as general needs as possible. So, do you think there are such
> use
> > > > > > cases from user space that user space knows well of the device, and
> > > > > > it wants kernel to return desired data back to it.
> > > > > > E.g. It just wants to get whole device config data including all mmios,
> > > > > > page tables, pci config data...
> > > > > > or, It just wants to get current device memory snapshot, not
> including any
> > > > > > dirty data.
> > > > > > Or, It just needs the dirty pages in device memory or system memory.
> > > > > > With all this accurate query, quite a lot of useful features can be
> > > > > > developped in user space.
> > > > > >
> > > > > > If all of this data is opaque to user app, seems the only use case is
> > > > > > for live migration.
> > > > >
> > > > > I can certainly appreciate a more versatile interface, but I think
> > > > > we're also trying to create the most simple interface we can, with the
> > > > > primary target being live migration.  As soon as we start defining this
> > > > > type of device memory and that type of device memory, we're going to
> > > > > have another device come along that needs yet another because they
> have
> > > > > a slightly different requirement.  Even without that, we're going to
> > > > > have vendor drivers implement it differently, so what works for one
> > > > > device for a more targeted approach may not work for all devices.  Can
> > > > > you enumerate some specific examples of the use cases you imagine
> your
> > > > > design to enable?
> > > > >
> > > >
> > > > Do we want to consider an use case where user space would like to
> > > > selectively introspect a portion of the device state (including implicit
> > > > state which are not available through PCI regions), and may ask for
> > > > capability of direct mapping of selected portion for scanning (e.g.
> > > > device memory) instead of always turning on dirty logging on all
> > > > device state?
> > >
> > > I don't see that a migration interface necessarily lends itself to this
> > > use case.  A migration data stream has no requirement to be user
> > > consumable as anything other than opaque data, there's also no
> > > requirement that it expose state in a form that directly represents the
> > > internal state of the device.  In fact I'm not sure we want to encourage
> > > introspection via this data stream.  If a user knows how to interpret
> > > the data, what prevents them from modifying the data in-flight?  I've
> > > raised the question previously regarding how the vendor driver can
> > > validate the integrity of the migration data stream.  Using the
> > > migration interface to introspect the device certainly suggests an
> > > interface ripe for exploiting any potential weakness in the vendor
> > > driver reassembling that migration stream.  If the user has an mmap to
> > > the actual live working state of the vendor driver, protection in the
> > > hardware seems like the only way you could protect against a malicious
> > > user.  Please be defensive in what is directly exposed to the user and
> > > what safeguards are in place within the vendor driver for validating
> > > incoming data.  Thanks,
> >
> > Hmm; that sounds like a security-by-obscurity answer!
> 
> Yup, that's fair.  I won't deny that in-kernel vendor driver state
> passing through userspace from source to target systems scares me quite
> a bit, but defining device introspection as a use case for the
> migration interface imposes requirements on the vendor drivers that
> don't otherwise exist.  Mdev vendor specific utilities could always be
> written to interpret the migration stream to deduce the internal state,
> but I think that imposing segregated device memory vs device config
> regions with the expectation that internal state can be directly
> tracked is beyond the scope of a migration interface.

I'm fine with defining such interface aimed only for migration-like
usages (e.g. also including fast check-pointing), but I didn't buy-in
the point that such opaque way is more secure than segregated
style since the layout can be anyway dumped out by looking at 
source code of mdev driver.

Also better we don't include any 'migration' word in related interface
structure definition. It's just an opaque/dirty-logged way of get/set
device state, e.g. instead of calling it "migration interface" can we
call it "dirty-logged state interface"?

> 
> > The scripts/analyze-migration.py scripts will actually dump the
> > migration stream data in an almost readable format.
> > So if you properly define the VMState definitions it should be almost
> > readable; it's occasionally been useful.
> 
> That's true for emulated devices, but I expect an mdev device migration
> stream is simply one blob of opaque data followed by another.  We can
> impose the protocol that userspace uses to read and write this data
> stream from the device, but not the data it contains.
> 
> > I agree that you should be very very careful to validate the incoming
> > migration stream against:
> >   a) Corruption
> >   b) Wrong driver versions
> >   c) Malicious intent
> >     c.1) Especially by the guest
> >     c.2) Or by someone trying to feed you a duff stream
> >   d) Someone trying load the VFIO stream into completely the wrong
> > device.
> 
> Yes, and with open source mdev vendor drivers we can at least
> theoretically audit the reload, but of course we also have proprietary
> drivers.  I wonder if we should install the kill switch in advance to
> allow users to opt-out of enabling migration at the mdev layer.
> 
> > Whether the migration interface is the right thing to use for that
> > inspection hmm; well it might be - if you're trying to debug
> > your device and need a dump of it's state, then why not?
> > (I guess you end up with something not dissimilar to what things
> > like intek_reg_snapshot in intel-gpu-tools does).
> 
> Sure, as above there's nothing preventing mdev specific utilities from
> decoding the migration stream, but I begin to have an issue if this
> introspective use case imposes requirements on how device state is
> represented through the migration interface that don't otherwise
> exist.  If we want to define a standard for the actual data from the
> device, we'll be at this for years :-\  Thanks,
> 

Introspection is one potential usage when thinking about mmapped
style in Yan's proposal, but it's not strong enough since introspection
can be also done with opaque way (just not optimal meaning always
need to track all the states). We may introduce new interface in the
future when it becomes a real problem.

But I still didn't get your exact concern about security part. For
version yes we still haven't worked out a sane way to represent
vendor-specific compatibility requirement. But allowing user
space to modify data through this interface has really no difference
from allowing guest to modify data through trapped MMIO interface.
mdev driver should guarantee that operations through both interfaces
can modify only the state associated with the said mdev instance,
w/o breaking the isolation boundary. Then the former becomes just
a batch of operations to be verified in the same way as if they are
done individually through the latter interface. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-11  2:33                 ` [Qemu-devel] " Tian, Kevin
@ 2019-03-11 20:19                   ` Alex Williamson
  -1 siblings, 0 replies; 133+ messages in thread
From: Alex Williamson @ 2019-03-11 20:19 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: cjia, kvm, aik, Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst,
	qemu-devel, kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye,
	mlevitsk, pasic, arei.gonglei, felipe, Wang, Zhi A, Zhao, Yan Y,
	Dr. David Alan Gilbert, intel-gvt-dev

On Mon, 11 Mar 2019 02:33:11 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson
> > Sent: Saturday, March 9, 2019 6:03 AM
> > 
> > On Fri, 8 Mar 2019 16:21:46 +0000
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >   
> > > * Alex Williamson (alex.williamson@redhat.com) wrote:  
> > > > On Thu, 7 Mar 2019 23:20:36 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >  
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Friday, March 8, 2019 1:44 AM  
> > > > > > > >  
> > > > > > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > > > > > >         stop-and-copy phase.
> > > > > > > > >         The data of device memory is held in device memory region.
> > > > > > > > >         Size of devie memory is usually larger than that of device
> > > > > > > > >         memory region. qemu needs to save/load it in chunks of size  
> > of  
> > > > > > > > >         device memory region.
> > > > > > > > >         Not all device has device memory. Like IGD only uses system  
> > > > > > memory.  
> > > > > > > >
> > > > > > > > It seems a little gratuitous to me that this is a separate region or
> > > > > > > > that this data is handled separately.  All of this data is opaque to
> > > > > > > > QEMU, so why do we need to separate it?  
> > > > > > > hi Alex,
> > > > > > > as the device state interfaces are provided by kernel, it is expected to
> > > > > > > meet as general needs as possible. So, do you think there are such  
> > use  
> > > > > > > cases from user space that user space knows well of the device, and
> > > > > > > it wants kernel to return desired data back to it.
> > > > > > > E.g. It just wants to get whole device config data including all mmios,
> > > > > > > page tables, pci config data...
> > > > > > > or, It just wants to get current device memory snapshot, not  
> > including any  
> > > > > > > dirty data.
> > > > > > > Or, It just needs the dirty pages in device memory or system memory.
> > > > > > > With all this accurate query, quite a lot of useful features can be
> > > > > > > developped in user space.
> > > > > > >
> > > > > > > If all of this data is opaque to user app, seems the only use case is
> > > > > > > for live migration.  
> > > > > >
> > > > > > I can certainly appreciate a more versatile interface, but I think
> > > > > > we're also trying to create the most simple interface we can, with the
> > > > > > primary target being live migration.  As soon as we start defining this
> > > > > > type of device memory and that type of device memory, we're going to
> > > > > > have another device come along that needs yet another because they  
> > have  
> > > > > > a slightly different requirement.  Even without that, we're going to
> > > > > > have vendor drivers implement it differently, so what works for one
> > > > > > device for a more targeted approach may not work for all devices.  Can
> > > > > > you enumerate some specific examples of the use cases you imagine  
> > your  
> > > > > > design to enable?
> > > > > >  
> > > > >
> > > > > Do we want to consider an use case where user space would like to
> > > > > selectively introspect a portion of the device state (including implicit
> > > > > state which are not available through PCI regions), and may ask for
> > > > > capability of direct mapping of selected portion for scanning (e.g.
> > > > > device memory) instead of always turning on dirty logging on all
> > > > > device state?  
> > > >
> > > > I don't see that a migration interface necessarily lends itself to this
> > > > use case.  A migration data stream has no requirement to be user
> > > > consumable as anything other than opaque data, there's also no
> > > > requirement that it expose state in a form that directly represents the
> > > > internal state of the device.  In fact I'm not sure we want to encourage
> > > > introspection via this data stream.  If a user knows how to interpret
> > > > the data, what prevents them from modifying the data in-flight?  I've
> > > > raised the question previously regarding how the vendor driver can
> > > > validate the integrity of the migration data stream.  Using the
> > > > migration interface to introspect the device certainly suggests an
> > > > interface ripe for exploiting any potential weakness in the vendor
> > > > driver reassembling that migration stream.  If the user has an mmap to
> > > > the actual live working state of the vendor driver, protection in the
> > > > hardware seems like the only way you could protect against a malicious
> > > > user.  Please be defensive in what is directly exposed to the user and
> > > > what safeguards are in place within the vendor driver for validating
> > > > incoming data.  Thanks,  
> > >
> > > Hmm; that sounds like a security-by-obscurity answer!  
> > 
> > Yup, that's fair.  I won't deny that in-kernel vendor driver state
> > passing through userspace from source to target systems scares me quite
> > a bit, but defining device introspection as a use case for the
> > migration interface imposes requirements on the vendor drivers that
> > don't otherwise exist.  Mdev vendor specific utilities could always be
> > written to interpret the migration stream to deduce the internal state,
> > but I think that imposing segregated device memory vs device config
> > regions with the expectation that internal state can be directly
> > tracked is beyond the scope of a migration interface.  
> 
> I'm fine with defining such interface aimed only for migration-like
> usages (e.g. also including fast check-pointing), but I didn't buy-in
> the point that such opaque way is more secure than segregated
> style since the layout can be anyway dumped out by looking at 
> source code of mdev driver.

I think I've fully conceded any notion of security by obscurity towards
opaque data already, but segregating types of device data still seems
to unnecessarily impose a usage model on the vendor driver that I think
we should try to avoid.  The migration interface should define the
protocol through which userspace can save and restore the device state,
not impose how the vendor driver exposes or manages that state.  Also, I
got the impression (perhaps incorrectly) that you were trying to mmap
live data to userspace, which would allow not only saving the state,
but also unchecked state modification by userspace. I think we want
more of a producer/consumer model of the state where consuming state
also involves at least some degree of sanity or consistency checking.
Let's not forget too that we're obviously dealing with non-open source
driver in the mdev universe as well.
 
> Also better we don't include any 'migration' word in related interface
> structure definition. It's just an opaque/dirty-logged way of get/set
> device state, e.g. instead of calling it "migration interface" can we
> call it "dirty-logged state interface"?

I think we're talking about the color of the interface now ;)

> > > The scripts/analyze-migration.py scripts will actually dump the
> > > migration stream data in an almost readable format.
> > > So if you properly define the VMState definitions it should be almost
> > > readable; it's occasionally been useful.  
> > 
> > That's true for emulated devices, but I expect an mdev device migration
> > stream is simply one blob of opaque data followed by another.  We can
> > impose the protocol that userspace uses to read and write this data
> > stream from the device, but not the data it contains.
> >   
> > > I agree that you should be very very careful to validate the incoming
> > > migration stream against:
> > >   a) Corruption
> > >   b) Wrong driver versions
> > >   c) Malicious intent
> > >     c.1) Especially by the guest
> > >     c.2) Or by someone trying to feed you a duff stream
> > >   d) Someone trying load the VFIO stream into completely the wrong
> > > device.  
> > 
> > Yes, and with open source mdev vendor drivers we can at least
> > theoretically audit the reload, but of course we also have proprietary
> > drivers.  I wonder if we should install the kill switch in advance to
> > allow users to opt-out of enabling migration at the mdev layer.
> >   
> > > Whether the migration interface is the right thing to use for that
> > > inspection hmm; well it might be - if you're trying to debug
> > > your device and need a dump of it's state, then why not?
> > > (I guess you end up with something not dissimilar to what things
> > > like intek_reg_snapshot in intel-gpu-tools does).  
> > 
> > Sure, as above there's nothing preventing mdev specific utilities from
> > decoding the migration stream, but I begin to have an issue if this
> > introspective use case imposes requirements on how device state is
> > represented through the migration interface that don't otherwise
> > exist.  If we want to define a standard for the actual data from the
> > device, we'll be at this for years :-\  Thanks,
> >   
> 
> Introspection is one potential usage when thinking about mmapped
> style in Yan's proposal, but it's not strong enough since introspection
> can be also done with opaque way (just not optimal meaning always
> need to track all the states). We may introduce new interface in the
> future when it becomes a real problem.
> 
> But I still didn't get your exact concern about security part. For
> version yes we still haven't worked out a sane way to represent
> vendor-specific compatibility requirement. But allowing user
> space to modify data through this interface has really no difference
> from allowing guest to modify data through trapped MMIO interface.
> mdev driver should guarantee that operations through both interfaces
> can modify only the state associated with the said mdev instance,
> w/o breaking the isolation boundary. Then the former becomes just
> a batch of operations to be verified in the same way as if they are
> done individually through the latter interface. 

It seems like you're assuming a working model for the vendor driver and
the data entering and exiting through this interface.  The vendor
drivers can expose data any way that they want.  All we need to do is
imagine that the migration data stream includes an array index count
somewhere which the user could modify to trigger the in-kernel vendor
driver to allocate an absurd array size and DoS the target.  This is
probably the most simplistic attack, possibly knowing the state machine
of the vendor driver a malicious user could trick it into providing
host kernel data.  We're not necessarily only conveying state that the
user already has access to via this interface, the vendor driver may
include non-visible internal state as well.  Even the state that is
user accessible is being pushed into the vendor driver via an alternate
path from the user mediation we have on the existing paths.

On the other hand, if your assertion that an incoming migration is
nothing more than a batch of operations through existing interfaces to
the device, then maybe this migration interface should be read-only to
generate an interpreted series of operations to the device.  I expect
we wouldn't get terribly far with such an approach though.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-03-11 20:19                   ` Alex Williamson
  0 siblings, 0 replies; 133+ messages in thread
From: Alex Williamson @ 2019-03-11 20:19 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Dr. David Alan Gilbert, cjia, kvm, aik,
	Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, Zhao, Yan Y, intel-gvt-dev,
	Liu, Changpeng, cohuck, Wang, Zhi A, jonathan.davies

On Mon, 11 Mar 2019 02:33:11 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson
> > Sent: Saturday, March 9, 2019 6:03 AM
> > 
> > On Fri, 8 Mar 2019 16:21:46 +0000
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >   
> > > * Alex Williamson (alex.williamson@redhat.com) wrote:  
> > > > On Thu, 7 Mar 2019 23:20:36 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >  
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Friday, March 8, 2019 1:44 AM  
> > > > > > > >  
> > > > > > > > >         This kind of data needs to be saved / loaded in pre-copy and
> > > > > > > > >         stop-and-copy phase.
> > > > > > > > >         The data of device memory is held in device memory region.
> > > > > > > > >         Size of devie memory is usually larger than that of device
> > > > > > > > >         memory region. qemu needs to save/load it in chunks of size  
> > of  
> > > > > > > > >         device memory region.
> > > > > > > > >         Not all device has device memory. Like IGD only uses system  
> > > > > > memory.  
> > > > > > > >
> > > > > > > > It seems a little gratuitous to me that this is a separate region or
> > > > > > > > that this data is handled separately.  All of this data is opaque to
> > > > > > > > QEMU, so why do we need to separate it?  
> > > > > > > hi Alex,
> > > > > > > as the device state interfaces are provided by kernel, it is expected to
> > > > > > > meet as general needs as possible. So, do you think there are such  
> > use  
> > > > > > > cases from user space that user space knows well of the device, and
> > > > > > > it wants kernel to return desired data back to it.
> > > > > > > E.g. It just wants to get whole device config data including all mmios,
> > > > > > > page tables, pci config data...
> > > > > > > or, It just wants to get current device memory snapshot, not  
> > including any  
> > > > > > > dirty data.
> > > > > > > Or, It just needs the dirty pages in device memory or system memory.
> > > > > > > With all this accurate query, quite a lot of useful features can be
> > > > > > > developped in user space.
> > > > > > >
> > > > > > > If all of this data is opaque to user app, seems the only use case is
> > > > > > > for live migration.  
> > > > > >
> > > > > > I can certainly appreciate a more versatile interface, but I think
> > > > > > we're also trying to create the most simple interface we can, with the
> > > > > > primary target being live migration.  As soon as we start defining this
> > > > > > type of device memory and that type of device memory, we're going to
> > > > > > have another device come along that needs yet another because they  
> > have  
> > > > > > a slightly different requirement.  Even without that, we're going to
> > > > > > have vendor drivers implement it differently, so what works for one
> > > > > > device for a more targeted approach may not work for all devices.  Can
> > > > > > you enumerate some specific examples of the use cases you imagine  
> > your  
> > > > > > design to enable?
> > > > > >  
> > > > >
> > > > > Do we want to consider an use case where user space would like to
> > > > > selectively introspect a portion of the device state (including implicit
> > > > > state which are not available through PCI regions), and may ask for
> > > > > capability of direct mapping of selected portion for scanning (e.g.
> > > > > device memory) instead of always turning on dirty logging on all
> > > > > device state?  
> > > >
> > > > I don't see that a migration interface necessarily lends itself to this
> > > > use case.  A migration data stream has no requirement to be user
> > > > consumable as anything other than opaque data, there's also no
> > > > requirement that it expose state in a form that directly represents the
> > > > internal state of the device.  In fact I'm not sure we want to encourage
> > > > introspection via this data stream.  If a user knows how to interpret
> > > > the data, what prevents them from modifying the data in-flight?  I've
> > > > raised the question previously regarding how the vendor driver can
> > > > validate the integrity of the migration data stream.  Using the
> > > > migration interface to introspect the device certainly suggests an
> > > > interface ripe for exploiting any potential weakness in the vendor
> > > > driver reassembling that migration stream.  If the user has an mmap to
> > > > the actual live working state of the vendor driver, protection in the
> > > > hardware seems like the only way you could protect against a malicious
> > > > user.  Please be defensive in what is directly exposed to the user and
> > > > what safeguards are in place within the vendor driver for validating
> > > > incoming data.  Thanks,  
> > >
> > > Hmm; that sounds like a security-by-obscurity answer!  
> > 
> > Yup, that's fair.  I won't deny that in-kernel vendor driver state
> > passing through userspace from source to target systems scares me quite
> > a bit, but defining device introspection as a use case for the
> > migration interface imposes requirements on the vendor drivers that
> > don't otherwise exist.  Mdev vendor specific utilities could always be
> > written to interpret the migration stream to deduce the internal state,
> > but I think that imposing segregated device memory vs device config
> > regions with the expectation that internal state can be directly
> > tracked is beyond the scope of a migration interface.  
> 
> I'm fine with defining such interface aimed only for migration-like
> usages (e.g. also including fast check-pointing), but I didn't buy-in
> the point that such opaque way is more secure than segregated
> style since the layout can be anyway dumped out by looking at 
> source code of mdev driver.

I think I've fully conceded any notion of security by obscurity towards
opaque data already, but segregating types of device data still seems
to unnecessarily impose a usage model on the vendor driver that I think
we should try to avoid.  The migration interface should define the
protocol through which userspace can save and restore the device state,
not impose how the vendor driver exposes or manages that state.  Also, I
got the impression (perhaps incorrectly) that you were trying to mmap
live data to userspace, which would allow not only saving the state,
but also unchecked state modification by userspace. I think we want
more of a producer/consumer model of the state where consuming state
also involves at least some degree of sanity or consistency checking.
Let's not forget too that we're obviously dealing with non-open source
driver in the mdev universe as well.
 
> Also better we don't include any 'migration' word in related interface
> structure definition. It's just an opaque/dirty-logged way of get/set
> device state, e.g. instead of calling it "migration interface" can we
> call it "dirty-logged state interface"?

I think we're talking about the color of the interface now ;)

> > > The scripts/analyze-migration.py scripts will actually dump the
> > > migration stream data in an almost readable format.
> > > So if you properly define the VMState definitions it should be almost
> > > readable; it's occasionally been useful.  
> > 
> > That's true for emulated devices, but I expect an mdev device migration
> > stream is simply one blob of opaque data followed by another.  We can
> > impose the protocol that userspace uses to read and write this data
> > stream from the device, but not the data it contains.
> >   
> > > I agree that you should be very very careful to validate the incoming
> > > migration stream against:
> > >   a) Corruption
> > >   b) Wrong driver versions
> > >   c) Malicious intent
> > >     c.1) Especially by the guest
> > >     c.2) Or by someone trying to feed you a duff stream
> > >   d) Someone trying load the VFIO stream into completely the wrong
> > > device.  
> > 
> > Yes, and with open source mdev vendor drivers we can at least
> > theoretically audit the reload, but of course we also have proprietary
> > drivers.  I wonder if we should install the kill switch in advance to
> > allow users to opt-out of enabling migration at the mdev layer.
> >   
> > > Whether the migration interface is the right thing to use for that
> > > inspection hmm; well it might be - if you're trying to debug
> > > your device and need a dump of it's state, then why not?
> > > (I guess you end up with something not dissimilar to what things
> > > like intek_reg_snapshot in intel-gpu-tools does).  
> > 
> > Sure, as above there's nothing preventing mdev specific utilities from
> > decoding the migration stream, but I begin to have an issue if this
> > introspective use case imposes requirements on how device state is
> > represented through the migration interface that don't otherwise
> > exist.  If we want to define a standard for the actual data from the
> > device, we'll be at this for years :-\  Thanks,
> >   
> 
> Introspection is one potential usage when thinking about mmapped
> style in Yan's proposal, but it's not strong enough since introspection
> can be also done with opaque way (just not optimal meaning always
> need to track all the states). We may introduce new interface in the
> future when it becomes a real problem.
> 
> But I still didn't get your exact concern about security part. For
> version yes we still haven't worked out a sane way to represent
> vendor-specific compatibility requirement. But allowing user
> space to modify data through this interface has really no difference
> from allowing guest to modify data through trapped MMIO interface.
> mdev driver should guarantee that operations through both interfaces
> can modify only the state associated with the said mdev instance,
> w/o breaking the isolation boundary. Then the former becomes just
> a batch of operations to be verified in the same way as if they are
> done individually through the latter interface. 

It seems like you're assuming a working model for the vendor driver and
the data entering and exiting through this interface.  The vendor
drivers can expose data any way that they want.  All we need to do is
imagine that the migration data stream includes an array index count
somewhere which the user could modify to trigger the in-kernel vendor
driver to allocate an absurd array size and DoS the target.  This is
probably the most simplistic attack, possibly knowing the state machine
of the vendor driver a malicious user could trick it into providing
host kernel data.  We're not necessarily only conveying state that the
user already has access to via this interface, the vendor driver may
include non-visible internal state as well.  Even the state that is
user accessible is being pushed into the vendor driver via an alternate
path from the user mediation we have on the existing paths.

On the other hand, if your assertion that an incoming migration is
nothing more than a batch of operations through existing interfaces to
the device, then maybe this migration interface should be read-only to
generate an interpreted series of operations to the device.  I expect
we wouldn't get terribly far with such an approach though.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-11 20:19                   ` [Qemu-devel] " Alex Williamson
@ 2019-03-12  2:48                     ` Tian, Kevin
  -1 siblings, 0 replies; 133+ messages in thread
From: Tian, Kevin @ 2019-03-12  2:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst,
	qemu-devel, kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye,
	mlevitsk, pasic, arei.gonglei, felipe, Wang, Zhi A, Zhao, Yan Y,
	Dr. David Alan Gilbert, intel-gvt-dev

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Tuesday, March 12, 2019 4:19 AM
> On Mon, 11 Mar 2019 02:33:11 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
[...]
> 
> I think I've fully conceded any notion of security by obscurity towards
> opaque data already, but segregating types of device data still seems
> to unnecessarily impose a usage model on the vendor driver that I think
> we should try to avoid.  The migration interface should define the
> protocol through which userspace can save and restore the device state,
> not impose how the vendor driver exposes or manages that state.  Also, I
> got the impression (perhaps incorrectly) that you were trying to mmap
> live data to userspace, which would allow not only saving the state,
> but also unchecked state modification by userspace. I think we want
> more of a producer/consumer model of the state where consuming state
> also involves at least some degree of sanity or consistency checking.
> Let's not forget too that we're obviously dealing with non-open source
> driver in the mdev universe as well.

OK. I think for this part we are in agreement - as long as the goal of
this interface is clearly defined as such way. :-)

[...]
> > But I still didn't get your exact concern about security part. For
> > version yes we still haven't worked out a sane way to represent
> > vendor-specific compatibility requirement. But allowing user
> > space to modify data through this interface has really no difference
> > from allowing guest to modify data through trapped MMIO interface.
> > mdev driver should guarantee that operations through both interfaces
> > can modify only the state associated with the said mdev instance,
> > w/o breaking the isolation boundary. Then the former becomes just
> > a batch of operations to be verified in the same way as if they are
> > done individually through the latter interface.
> 
> It seems like you're assuming a working model for the vendor driver and
> the data entering and exiting through this interface.  The vendor
> drivers can expose data any way that they want.  All we need to do is
> imagine that the migration data stream includes an array index count
> somewhere which the user could modify to trigger the in-kernel vendor
> driver to allocate an absurd array size and DoS the target.  This is
> probably the most simplistic attack, possibly knowing the state machine
> of the vendor driver a malicious user could trick it into providing
> host kernel data.  We're not necessarily only conveying state that the
> user already has access to via this interface, the vendor driver may
> include non-visible internal state as well.  Even the state that is
> user accessible is being pushed into the vendor driver via an alternate
> path from the user mediation we have on the existing paths.

Then I don't know how this concern can be effectively addressed 
since you assume vendor drivers are not trusted here. and why do
you trust vendor drivers on mediating existing path but not this
alternative one? non-visible internal states just mean more stuff
to be carefully scrutinized, which is not essentially causing a 
conceptual difference of trust level.

Or can this concern be partially mitigated if we create some 
test cases which poke random data through the new interface,
and mark vendor drivers which pass such tests as trusted? Then
there is also an open who should be in charge of such certification 
process...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-03-12  2:48                     ` Tian, Kevin
  0 siblings, 0 replies; 133+ messages in thread
From: Tian, Kevin @ 2019-03-12  2:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Dr. David Alan Gilbert, cjia, kvm, aik,
	Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, Zhao, Yan Y, intel-gvt-dev,
	Liu, Changpeng, cohuck, Wang, Zhi A, jonathan.davies

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Tuesday, March 12, 2019 4:19 AM
> On Mon, 11 Mar 2019 02:33:11 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
[...]
> 
> I think I've fully conceded any notion of security by obscurity towards
> opaque data already, but segregating types of device data still seems
> to unnecessarily impose a usage model on the vendor driver that I think
> we should try to avoid.  The migration interface should define the
> protocol through which userspace can save and restore the device state,
> not impose how the vendor driver exposes or manages that state.  Also, I
> got the impression (perhaps incorrectly) that you were trying to mmap
> live data to userspace, which would allow not only saving the state,
> but also unchecked state modification by userspace. I think we want
> more of a producer/consumer model of the state where consuming state
> also involves at least some degree of sanity or consistency checking.
> Let's not forget too that we're obviously dealing with non-open source
> driver in the mdev universe as well.

OK. I think for this part we are in agreement - as long as the goal of
this interface is clearly defined as such way. :-)

[...]
> > But I still didn't get your exact concern about security part. For
> > version yes we still haven't worked out a sane way to represent
> > vendor-specific compatibility requirement. But allowing user
> > space to modify data through this interface has really no difference
> > from allowing guest to modify data through trapped MMIO interface.
> > mdev driver should guarantee that operations through both interfaces
> > can modify only the state associated with the said mdev instance,
> > w/o breaking the isolation boundary. Then the former becomes just
> > a batch of operations to be verified in the same way as if they are
> > done individually through the latter interface.
> 
> It seems like you're assuming a working model for the vendor driver and
> the data entering and exiting through this interface.  The vendor
> drivers can expose data any way that they want.  All we need to do is
> imagine that the migration data stream includes an array index count
> somewhere which the user could modify to trigger the in-kernel vendor
> driver to allocate an absurd array size and DoS the target.  This is
> probably the most simplistic attack, possibly knowing the state machine
> of the vendor driver a malicious user could trick it into providing
> host kernel data.  We're not necessarily only conveying state that the
> user already has access to via this interface, the vendor driver may
> include non-visible internal state as well.  Even the state that is
> user accessible is being pushed into the vendor driver via an alternate
> path from the user mediation we have on the existing paths.

Then I don't know how this concern can be effectively addressed 
since you assume vendor drivers are not trusted here. and why do
you trust vendor drivers on mediating existing path but not this
alternative one? non-visible internal states just mean more stuff
to be carefully scrutinized, which is not essentially causing a 
conceptual difference of trust level.

Or can this concern be partially mitigated if we create some 
test cases which poke random data through the new interface,
and mark vendor drivers which pass such tests as trusted? Then
there is also an open who should be in charge of such certification 
process...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-07 17:44       ` [Qemu-devel] " Alex Williamson
@ 2019-03-12  2:57         ` Zhao Yan
  -1 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-03-12  2:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, Tian, Kevin, dgilbert,
	intel-gvt-dev,

hi Alex
thanks for your reply.

So, if we choose migration data to be userspace opaque, do you think below
sequence is the right behavior for vendor driver to follow:

1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
vendor driver,  vendor driver should reject and return 0.

2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
driver,
   a. vendor driver shoud first query a whole snapshot of device memory
   (let's use this term to represent device's standalone memory for now),
   b. vendor driver returns a chunk of data just queried to userspace,
   while recording current pos in data.
   c. vendor driver finds all data just queried is finished transmitting to
   userspace, and queries only dirty data in device memory now.
   d. vendor driver returns a chunk of data just quered (this time is dirty
   data )to userspace while recording current pos in data
   e. if all data is transmited to usespace and still GET_BUFFERs come from
   userspace, vendor driver starts another round of dirty data query.

3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
driver,
   a. if vendor driver finds there's previously untransmitted data, returns
   them until all transmitted.
   b. vendor driver then queries dirty data again and transmits them.
   c. at last, vendor driver queris device config data (which has to be
   queried at last and sent once) and transmits them.


for the 1 bullet, if LOGGING state is firstly set and migration aborts
then,  vendor driver has to be able to detect that condition. so seemingly,
vendor driver has to know more qemu's migration state, like migration
called and failed. Do you think that's acceptable?


Thanks
Yan

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-03-12  2:57         ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-03-12  2:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, intel-gvt-dev, Zhengxiao.zx, Liu, Yi L, eskultet,
	Yang, Ziye, cohuck, shuangtai.tst, dgilbert, Wang, Zhi A,
	mlevitsk, pasic, aik, eauger, felipe, jonathan.davies, Liu,
	Changpeng, Ken.Xue, kwankhede, Tian, Kevin, cjia, arei.gonglei,
	kvm

hi Alex
thanks for your reply.

So, if we choose migration data to be userspace opaque, do you think below
sequence is the right behavior for vendor driver to follow:

1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
vendor driver,  vendor driver should reject and return 0.

2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
driver,
   a. vendor driver shoud first query a whole snapshot of device memory
   (let's use this term to represent device's standalone memory for now),
   b. vendor driver returns a chunk of data just queried to userspace,
   while recording current pos in data.
   c. vendor driver finds all data just queried is finished transmitting to
   userspace, and queries only dirty data in device memory now.
   d. vendor driver returns a chunk of data just quered (this time is dirty
   data )to userspace while recording current pos in data
   e. if all data is transmited to usespace and still GET_BUFFERs come from
   userspace, vendor driver starts another round of dirty data query.

3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
driver,
   a. if vendor driver finds there's previously untransmitted data, returns
   them until all transmitted.
   b. vendor driver then queries dirty data again and transmits them.
   c. at last, vendor driver queris device config data (which has to be
   queried at last and sent once) and transmits them.


for the 1 bullet, if LOGGING state is firstly set and migration aborts
then,  vendor driver has to be able to detect that condition. so seemingly,
vendor driver has to know more qemu's migration state, like migration
called and failed. Do you think that's acceptable?


Thanks
Yan

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-12  2:57         ` [Qemu-devel] " Zhao Yan
  (?)
@ 2019-03-13  1:13         ` Zhao Yan
  2019-03-13 19:14           ` Alex Williamson
  -1 siblings, 1 reply; 133+ messages in thread
From: Zhao Yan @ 2019-03-13  1:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Wang, Zhi A, Tian, Kevin, dgilbert,
	intel-gvt-dev,

hi Alex
Any comments to the sequence below?

Actaully we have some concerns and suggestions to userspace-opaque migration
data.

1. if data is opaque to userspace, kernel interface must be tightly bound to
migration. 
   e.g. vendor driver has to know state (running + not logging) should not
   return any data, and state (running + logging) should return whole
   snapshot first and dirty later. it also has to know qemu migration will
   not call GET_BUFFER in state (running + not logging), otherwise, it has
   to adjust its behavior.

2. vendor driver cannot ensure userspace get all the data it intends to
save in pre-copy phase.
  e.g. in stop-and-copy phase, vendor driver has to first check and send
  data in previous phase.
 

3. if all the sequence is tightly bound to live migration, can we remove the
logging state? what about adding two states migrate-in and migrate-out?
so there are four states: running, stopped, migrate-in, migrate-out.
   migrate-out is for source side when migration starts. together with
   state running and stopped, it can substitute state logging.
   migrate-in is for target side.


Thanks
Yan

On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:
> hi Alex
> thanks for your reply.
> 
> So, if we choose migration data to be userspace opaque, do you think below
> sequence is the right behavior for vendor driver to follow:
> 
> 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> vendor driver,  vendor driver should reject and return 0.
> 
> 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> driver,
>    a. vendor driver shoud first query a whole snapshot of device memory
>    (let's use this term to represent device's standalone memory for now),
>    b. vendor driver returns a chunk of data just queried to userspace,
>    while recording current pos in data.
>    c. vendor driver finds all data just queried is finished transmitting to
>    userspace, and queries only dirty data in device memory now.
>    d. vendor driver returns a chunk of data just quered (this time is dirty
>    data )to userspace while recording current pos in data
>    e. if all data is transmited to usespace and still GET_BUFFERs come from
>    userspace, vendor driver starts another round of dirty data query.
> 
> 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> driver,
>    a. if vendor driver finds there's previously untransmitted data, returns
>    them until all transmitted.
>    b. vendor driver then queries dirty data again and transmits them.
>    c. at last, vendor driver queris device config data (which has to be
>    queried at last and sent once) and transmits them.
> 
> 
> for the 1 bullet, if LOGGING state is firstly set and migration aborts
> then,  vendor driver has to be able to detect that condition. so seemingly,
> vendor driver has to know more qemu's migration state, like migration
> called and failed. Do you think that's acceptable?
> 
> 
> Thanks
> Yan
> 
> 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-13  1:13         ` Zhao Yan
@ 2019-03-13 19:14           ` Alex Williamson
  2019-03-14  1:12             ` Zhao Yan
  0 siblings, 1 reply; 133+ messages in thread
From: Alex Williamson @ 2019-03-13 19:14 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Wang, Zhi A, Tian, Kevin, dgilbert,
	intel-gvt-dev

On Tue, 12 Mar 2019 21:13:01 -0400
Zhao Yan <yan.y.zhao@intel.com> wrote:

> hi Alex
> Any comments to the sequence below?
> 
> Actaully we have some concerns and suggestions to userspace-opaque migration
> data.
> 
> 1. if data is opaque to userspace, kernel interface must be tightly bound to
> migration. 
>    e.g. vendor driver has to know state (running + not logging) should not
>    return any data, and state (running + logging) should return whole
>    snapshot first and dirty later. it also has to know qemu migration will
>    not call GET_BUFFER in state (running + not logging), otherwise, it has
>    to adjust its behavior.

This all just sounds like defining the protocol we expect with the
interface.  For instance if we define a session as beginning when
logging is enabled and ending when the device is stopped and the
interface reports no more data is available, then we can state that any
partial accumulation of data is incomplete relative to migration.  If
userspace wants to initiate a new migration stream, they can simply
toggle logging.  How the vendor driver provides the data during the
session is not defined, but beginning the session with a snapshot
followed by repeated iterations of dirtied data is certainly a valid
approach.

> 2. vendor driver cannot ensure userspace get all the data it intends to
> save in pre-copy phase.
>   e.g. in stop-and-copy phase, vendor driver has to first check and send
>   data in previous phase.

First, I don't think the device has control of when QEMU switches from
pre-copy to stop-and-copy, the protocol needs to support that
transition at any point.  However, it seems a simply data available
counter provides an indication of when it might be optimal to make such
a transition.  If a vendor driver follows a scheme as above, the
available data counter would indicate a large value, the entire initial
snapshot of the device.  As the migration continues and pages are
dirtied, the device would reach a steady state amount of data
available, depending on the guest activity.  This could indicate to the
user to stop the device.  The migration stream would not be considered
completed until the available data counter reaches zero while the
device is in the stopped|logging state.

> 3. if all the sequence is tightly bound to live migration, can we remove the
> logging state? what about adding two states migrate-in and migrate-out?
> so there are four states: running, stopped, migrate-in, migrate-out.
>    migrate-out is for source side when migration starts. together with
>    state running and stopped, it can substitute state logging.
>    migrate-in is for target side.

In fact, Kirti's implementation specifies a data direction, but I think
we still need logging to indicate sessions.  I'd also assume that
logging implies some overhead for the vendor driver.

> On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:
> > hi Alex
> > thanks for your reply.
> > 
> > So, if we choose migration data to be userspace opaque, do you think below
> > sequence is the right behavior for vendor driver to follow:
> > 
> > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > vendor driver,  vendor driver should reject and return 0.

What would this state mean otherwise?  If we're not logging then it
should not be expected that we can construct dirtied data from a
previous read of the state before logging was enabled (it would be
outside of the "session").  So at best this is an incomplete segment of
the initial snapshot of the device, but that presumes how the vendor
driver constructs the data.  I wouldn't necessarily mandate the vendor
driver reject it, but I think we should consider it undefined and
vendor specific relative to the migration interface.

> > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > driver,
> >    a. vendor driver shoud first query a whole snapshot of device memory
> >    (let's use this term to represent device's standalone memory for now),
> >    b. vendor driver returns a chunk of data just queried to userspace,
> >    while recording current pos in data.
> >    c. vendor driver finds all data just queried is finished transmitting to
> >    userspace, and queries only dirty data in device memory now.
> >    d. vendor driver returns a chunk of data just quered (this time is dirty
> >    data )to userspace while recording current pos in data
> >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> >    userspace, vendor driver starts another round of dirty data query.

This is a valid vendor driver approach, but it's outside the scope of
the interface definition.  A vendor driver could also decide to not
provide any data until both stopped and logging are set and then
provide a fixed, final snapshot.  The interface supports either
approach by defining the protocol to interact with it.

> > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > driver,
> >    a. if vendor driver finds there's previously untransmitted data, returns
> >    them until all transmitted.
> >    b. vendor driver then queries dirty data again and transmits them.
> >    c. at last, vendor driver queris device config data (which has to be
> >    queried at last and sent once) and transmits them.

This seems broken, the vendor driver is presuming the user intentions.
If logging is unset, we return to bullet 1, reading data is undefined
and vendor specific.  It's outside of the session.

> > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > then,  vendor driver has to be able to detect that condition. so seemingly,
> > vendor driver has to know more qemu's migration state, like migration
> > called and failed. Do you think that's acceptable?

If migration aborts, logging is cleared and the device continues
operation.  If a new migration is started, the session is initiated by
enabling logging.  Sound reasonable?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-12  2:48                     ` [Qemu-devel] " Tian, Kevin
  (?)
@ 2019-03-13 19:57                     ` Alex Williamson
  -1 siblings, 0 replies; 133+ messages in thread
From: Alex Williamson @ 2019-03-13 19:57 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: cjia, kvm, aik, Zhengxiao.zx@Alibaba-inc.com, shuangtai.tst,
	qemu-devel, kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye,
	mlevitsk, pasic, arei.gonglei, felipe, Wang, Zhi A, Zhao, Yan Y,
	Dr. David Alan Gilbert, intel-gvt-dev

On Tue, 12 Mar 2019 02:48:39 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Tuesday, March 12, 2019 4:19 AM
> > On Mon, 11 Mar 2019 02:33:11 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> [...]
> > 
> > I think I've fully conceded any notion of security by obscurity towards
> > opaque data already, but segregating types of device data still seems
> > to unnecessarily impose a usage model on the vendor driver that I think
> > we should try to avoid.  The migration interface should define the
> > protocol through which userspace can save and restore the device state,
> > not impose how the vendor driver exposes or manages that state.  Also, I
> > got the impression (perhaps incorrectly) that you were trying to mmap
> > live data to userspace, which would allow not only saving the state,
> > but also unchecked state modification by userspace. I think we want
> > more of a producer/consumer model of the state where consuming state
> > also involves at least some degree of sanity or consistency checking.
> > Let's not forget too that we're obviously dealing with non-open source
> > driver in the mdev universe as well.  
> 
> OK. I think for this part we are in agreement - as long as the goal of
> this interface is clearly defined as such way. :-)
> 
> [...]
> > > But I still didn't get your exact concern about security part. For
> > > version yes we still haven't worked out a sane way to represent
> > > vendor-specific compatibility requirement. But allowing user
> > > space to modify data through this interface has really no difference
> > > from allowing guest to modify data through trapped MMIO interface.
> > > mdev driver should guarantee that operations through both interfaces
> > > can modify only the state associated with the said mdev instance,
> > > w/o breaking the isolation boundary. Then the former becomes just
> > > a batch of operations to be verified in the same way as if they are
> > > done individually through the latter interface.  
> > 
> > It seems like you're assuming a working model for the vendor driver and
> > the data entering and exiting through this interface.  The vendor
> > drivers can expose data any way that they want.  All we need to do is
> > imagine that the migration data stream includes an array index count
> > somewhere which the user could modify to trigger the in-kernel vendor
> > driver to allocate an absurd array size and DoS the target.  This is
> > probably the most simplistic attack, possibly knowing the state machine
> > of the vendor driver a malicious user could trick it into providing
> > host kernel data.  We're not necessarily only conveying state that the
> > user already has access to via this interface, the vendor driver may
> > include non-visible internal state as well.  Even the state that is
> > user accessible is being pushed into the vendor driver via an alternate
> > path from the user mediation we have on the existing paths.  
> 
> Then I don't know how this concern can be effectively addressed 
> since you assume vendor drivers are not trusted here. and why do
> you trust vendor drivers on mediating existing path but not this
> alternative one? non-visible internal states just mean more stuff
> to be carefully scrutinized, which is not essentially causing a 
> conceptual difference of trust level.
> 
> Or can this concern be partially mitigated if we create some 
> test cases which poke random data through the new interface,
> and mark vendor drivers which pass such tests as trusted? Then
> there is also an open who should be in charge of such certification 
> process...

The vendor driver is necessarily trusted, it lives in the kernel, it
works in the kernel address space.  Unfortunately that's also the risk
with passing data from userspace into the vendor driver, the vendor
driver needs to take every precaution in sanitizing and validating that
data.  I wish we had a common way to perform that checking, but it
seems that each vendor driver is going to need to define their own
protocol and battle their own bugs and exploits in the code
implementing that protocol.  For open source drivers we can continue to
rely on review and openness, for closed drivers... the user has already
accepted the risk to trust the driver themselves.  Perhaps all I can do
is raise the visibility that there are potential security issues here
and vendor drivers need to own that risk.

A fuzzing test would be great, we could at least validate whether a
vendor driver implements some sort of CRC test, but I don't think we
can create a certification process around that.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-13 19:14           ` Alex Williamson
@ 2019-03-14  1:12             ` Zhao Yan
  2019-03-14 22:44               ` Alex Williamson
  0 siblings, 1 reply; 133+ messages in thread
From: Zhao Yan @ 2019-03-14  1:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, Tian, Kevin, dgilbert,
	intel-gvt-dev,

On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:
> On Tue, 12 Mar 2019 21:13:01 -0400
> Zhao Yan <yan.y.zhao@intel.com> wrote:
> 
> > hi Alex
> > Any comments to the sequence below?
> > 
> > Actaully we have some concerns and suggestions to userspace-opaque migration
> > data.
> > 
> > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > migration. 
> >    e.g. vendor driver has to know state (running + not logging) should not
> >    return any data, and state (running + logging) should return whole
> >    snapshot first and dirty later. it also has to know qemu migration will
> >    not call GET_BUFFER in state (running + not logging), otherwise, it has
> >    to adjust its behavior.
> 
> This all just sounds like defining the protocol we expect with the
> interface.  For instance if we define a session as beginning when
> logging is enabled and ending when the device is stopped and the
> interface reports no more data is available, then we can state that any
> partial accumulation of data is incomplete relative to migration.  If
> userspace wants to initiate a new migration stream, they can simply
> toggle logging.  How the vendor driver provides the data during the
> session is not defined, but beginning the session with a snapshot
> followed by repeated iterations of dirtied data is certainly a valid
> approach.
> 
> > 2. vendor driver cannot ensure userspace get all the data it intends to
> > save in pre-copy phase.
> >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> >   data in previous phase.
> 
> First, I don't think the device has control of when QEMU switches from
> pre-copy to stop-and-copy, the protocol needs to support that
> transition at any point.  However, it seems a simply data available
> counter provides an indication of when it might be optimal to make such
> a transition.  If a vendor driver follows a scheme as above, the
> available data counter would indicate a large value, the entire initial
> snapshot of the device.  As the migration continues and pages are
> dirtied, the device would reach a steady state amount of data
> available, depending on the guest activity.  This could indicate to the
> user to stop the device.  The migration stream would not be considered
> completed until the available data counter reaches zero while the
> device is in the stopped|logging state.
> 
> > 3. if all the sequence is tightly bound to live migration, can we remove the
> > logging state? what about adding two states migrate-in and migrate-out?
> > so there are four states: running, stopped, migrate-in, migrate-out.
> >    migrate-out is for source side when migration starts. together with
> >    state running and stopped, it can substitute state logging.
> >    migrate-in is for target side.
> 
> In fact, Kirti's implementation specifies a data direction, but I think
> we still need logging to indicate sessions.  I'd also assume that
> logging implies some overhead for the vendor driver.
>
ok. If you prefer logging, I'm ok with it. just found migrate-in and
migrate-out are more universal againt hardware requirement changes.

> > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:
> > > hi Alex
> > > thanks for your reply.
> > > 
> > > So, if we choose migration data to be userspace opaque, do you think below
> > > sequence is the right behavior for vendor driver to follow:
> > > 
> > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > vendor driver,  vendor driver should reject and return 0.
> 
> What would this state mean otherwise?  If we're not logging then it
> should not be expected that we can construct dirtied data from a
> previous read of the state before logging was enabled (it would be
> outside of the "session").  So at best this is an incomplete segment of
> the initial snapshot of the device, but that presumes how the vendor
> driver constructs the data.  I wouldn't necessarily mandate the vendor
> driver reject it, but I think we should consider it undefined and
> vendor specific relative to the migration interface.
> 
> > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > driver,
> > >    a. vendor driver shoud first query a whole snapshot of device memory
> > >    (let's use this term to represent device's standalone memory for now),
> > >    b. vendor driver returns a chunk of data just queried to userspace,
> > >    while recording current pos in data.
> > >    c. vendor driver finds all data just queried is finished transmitting to
> > >    userspace, and queries only dirty data in device memory now.
> > >    d. vendor driver returns a chunk of data just quered (this time is dirty
> > >    data )to userspace while recording current pos in data
> > >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> > >    userspace, vendor driver starts another round of dirty data query.
> 
> This is a valid vendor driver approach, but it's outside the scope of
> the interface definition.  A vendor driver could also decide to not
> provide any data until both stopped and logging are set and then
> provide a fixed, final snapshot.  The interface supports either
> approach by defining the protocol to interact with it.
> 
> > > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > > driver,
> > >    a. if vendor driver finds there's previously untransmitted data, returns
> > >    them until all transmitted.
> > >    b. vendor driver then queries dirty data again and transmits them.
> > >    c. at last, vendor driver queris device config data (which has to be
> > >    queried at last and sent once) and transmits them.
> 
> This seems broken, the vendor driver is presuming the user intentions.
> If logging is unset, we return to bullet 1, reading data is undefined
> and vendor specific.  It's outside of the session.
> 
> > > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > > then,  vendor driver has to be able to detect that condition. so seemingly,
> > > vendor driver has to know more qemu's migration state, like migration
> > > called and failed. Do you think that's acceptable?
> 
> If migration aborts, logging is cleared and the device continues
> operation.  If a new migration is started, the session is initiated by
> enabling logging.  Sound reasonable?  Thanks,
>

For the flow, I still have a question.
There are 2 approaches below, which one do you prefer?

Approach A, in precopy stage, the sequence is

(1)
.save_live_pending --> return whole snapshot size
.save_live_iterate --> save whole snapshot

(2)
.save_live_pending --> get dirty data, return dirty data size
.save_live_iterate --> save all dirty data

(3)
.save_live_pending --> get dirty data again, return dirty data size
.save_live_iterate --> save all dirty data


Approach B, in precopy stage, the sequence is
(1)
.save_live_pending --> return whole snapshot size
.save_live_iterate --> save part of snapshot

(2)
.save_live_pending --> return rest part of whole snapshot size +
                              current dirty data size
.save_live_iterate --> save part of snapshot 

(3) repeat (2) until whole snapshot saved.

(4) 
.save_live_pending --> get diryt data and return current dirty data size
.save_live_iterate --> save part of dirty data

(5)
.save_live_pending --> return reset part of dirty data size +
			     delta size of dirty data
.save_live_iterate --> save part of dirty data

(6)
repeat (5) until precopy stops


> Alex
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-14  1:12             ` Zhao Yan
@ 2019-03-14 22:44               ` Alex Williamson
  2019-03-14 23:05                 ` Zhao Yan
  0 siblings, 1 reply; 133+ messages in thread
From: Alex Williamson @ 2019-03-14 22:44 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, Tian, Kevin, dgilbert,
	intel-gvt-dev,

On Wed, 13 Mar 2019 21:12:22 -0400
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:
> > On Tue, 12 Mar 2019 21:13:01 -0400
> > Zhao Yan <yan.y.zhao@intel.com> wrote:
> >   
> > > hi Alex
> > > Any comments to the sequence below?
> > > 
> > > Actaully we have some concerns and suggestions to userspace-opaque migration
> > > data.
> > > 
> > > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > > migration. 
> > >    e.g. vendor driver has to know state (running + not logging) should not
> > >    return any data, and state (running + logging) should return whole
> > >    snapshot first and dirty later. it also has to know qemu migration will
> > >    not call GET_BUFFER in state (running + not logging), otherwise, it has
> > >    to adjust its behavior.  
> > 
> > This all just sounds like defining the protocol we expect with the
> > interface.  For instance if we define a session as beginning when
> > logging is enabled and ending when the device is stopped and the
> > interface reports no more data is available, then we can state that any
> > partial accumulation of data is incomplete relative to migration.  If
> > userspace wants to initiate a new migration stream, they can simply
> > toggle logging.  How the vendor driver provides the data during the
> > session is not defined, but beginning the session with a snapshot
> > followed by repeated iterations of dirtied data is certainly a valid
> > approach.
> >   
> > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > save in pre-copy phase.
> > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > >   data in previous phase.  
> > 
> > First, I don't think the device has control of when QEMU switches from
> > pre-copy to stop-and-copy, the protocol needs to support that
> > transition at any point.  However, it seems a simply data available
> > counter provides an indication of when it might be optimal to make such
> > a transition.  If a vendor driver follows a scheme as above, the
> > available data counter would indicate a large value, the entire initial
> > snapshot of the device.  As the migration continues and pages are
> > dirtied, the device would reach a steady state amount of data
> > available, depending on the guest activity.  This could indicate to the
> > user to stop the device.  The migration stream would not be considered
> > completed until the available data counter reaches zero while the
> > device is in the stopped|logging state.
> >   
> > > 3. if all the sequence is tightly bound to live migration, can we remove the
> > > logging state? what about adding two states migrate-in and migrate-out?
> > > so there are four states: running, stopped, migrate-in, migrate-out.
> > >    migrate-out is for source side when migration starts. together with
> > >    state running and stopped, it can substitute state logging.
> > >    migrate-in is for target side.  
> > 
> > In fact, Kirti's implementation specifies a data direction, but I think
> > we still need logging to indicate sessions.  I'd also assume that
> > logging implies some overhead for the vendor driver.
> >  
> ok. If you prefer logging, I'm ok with it. just found migrate-in and
> migrate-out are more universal againt hardware requirement changes.
> 
> > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:  
> > > > hi Alex
> > > > thanks for your reply.
> > > > 
> > > > So, if we choose migration data to be userspace opaque, do you think below
> > > > sequence is the right behavior for vendor driver to follow:
> > > > 
> > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > > vendor driver,  vendor driver should reject and return 0.  
> > 
> > What would this state mean otherwise?  If we're not logging then it
> > should not be expected that we can construct dirtied data from a
> > previous read of the state before logging was enabled (it would be
> > outside of the "session").  So at best this is an incomplete segment of
> > the initial snapshot of the device, but that presumes how the vendor
> > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > driver reject it, but I think we should consider it undefined and
> > vendor specific relative to the migration interface.
> >   
> > > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > > driver,
> > > >    a. vendor driver shoud first query a whole snapshot of device memory
> > > >    (let's use this term to represent device's standalone memory for now),
> > > >    b. vendor driver returns a chunk of data just queried to userspace,
> > > >    while recording current pos in data.
> > > >    c. vendor driver finds all data just queried is finished transmitting to
> > > >    userspace, and queries only dirty data in device memory now.
> > > >    d. vendor driver returns a chunk of data just quered (this time is dirty
> > > >    data )to userspace while recording current pos in data
> > > >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> > > >    userspace, vendor driver starts another round of dirty data query.  
> > 
> > This is a valid vendor driver approach, but it's outside the scope of
> > the interface definition.  A vendor driver could also decide to not
> > provide any data until both stopped and logging are set and then
> > provide a fixed, final snapshot.  The interface supports either
> > approach by defining the protocol to interact with it.
> >   
> > > > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > > > driver,
> > > >    a. if vendor driver finds there's previously untransmitted data, returns
> > > >    them until all transmitted.
> > > >    b. vendor driver then queries dirty data again and transmits them.
> > > >    c. at last, vendor driver queris device config data (which has to be
> > > >    queried at last and sent once) and transmits them.  
> > 
> > This seems broken, the vendor driver is presuming the user intentions.
> > If logging is unset, we return to bullet 1, reading data is undefined
> > and vendor specific.  It's outside of the session.
> >   
> > > > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > > > then,  vendor driver has to be able to detect that condition. so seemingly,
> > > > vendor driver has to know more qemu's migration state, like migration
> > > > called and failed. Do you think that's acceptable?  
> > 
> > If migration aborts, logging is cleared and the device continues
> > operation.  If a new migration is started, the session is initiated by
> > enabling logging.  Sound reasonable?  Thanks,
> >  
> 
> For the flow, I still have a question.
> There are 2 approaches below, which one do you prefer?
> 
> Approach A, in precopy stage, the sequence is
> 
> (1)
> .save_live_pending --> return whole snapshot size
> .save_live_iterate --> save whole snapshot
> 
> (2)
> .save_live_pending --> get dirty data, return dirty data size
> .save_live_iterate --> save all dirty data
> 
> (3)
> .save_live_pending --> get dirty data again, return dirty data size
> .save_live_iterate --> save all dirty data
> 
> 
> Approach B, in precopy stage, the sequence is
> (1)
> .save_live_pending --> return whole snapshot size
> .save_live_iterate --> save part of snapshot
> 
> (2)
> .save_live_pending --> return rest part of whole snapshot size +
>                               current dirty data size
> .save_live_iterate --> save part of snapshot 
> 
> (3) repeat (2) until whole snapshot saved.
> 
> (4) 
> .save_live_pending --> get diryt data and return current dirty data size
> .save_live_iterate --> save part of dirty data
> 
> (5)
> .save_live_pending --> return reset part of dirty data size +
> 			     delta size of dirty data
> .save_live_iterate --> save part of dirty data
> 
> (6)
> repeat (5) until precopy stops

I don't really understand the question here.  If the vendor driver's
approach is to send a full snapshot followed by iterations of dirty
data, then when the user enables logging and reads the counter for
available data it should report the (size of the snapshot).  The next
time the user reads the counter, it should report the size of the
(size of the snapshot) - (what the user has already read) + (size of
the dirty data since the snapshot).  As the user continues to read past
the snapshot data, the available data counter transitions to reporting
only the size of the remaining dirty data, which is monotonically
increasing.  I guess this would be more similar to your approach B,
which seems to suggest that the interface needs to continue providing
data regardless of whether the user fully exhausted the available data
from the previous cycle.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-14 22:44               ` Alex Williamson
@ 2019-03-14 23:05                 ` Zhao Yan
  2019-03-15  2:24                   ` Alex Williamson
  0 siblings, 1 reply; 133+ messages in thread
From: Zhao Yan @ 2019-03-14 23:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, Tian, Kevin, dgilbert,
	intel-gvt-dev,

On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:
> On Wed, 13 Mar 2019 21:12:22 -0400
> Zhao Yan <yan.y.zhao@intel.com> wrote:
> 
> > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:
> > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > >   
> > > > hi Alex
> > > > Any comments to the sequence below?
> > > > 
> > > > Actaully we have some concerns and suggestions to userspace-opaque migration
> > > > data.
> > > > 
> > > > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > > > migration. 
> > > >    e.g. vendor driver has to know state (running + not logging) should not
> > > >    return any data, and state (running + logging) should return whole
> > > >    snapshot first and dirty later. it also has to know qemu migration will
> > > >    not call GET_BUFFER in state (running + not logging), otherwise, it has
> > > >    to adjust its behavior.  
> > > 
> > > This all just sounds like defining the protocol we expect with the
> > > interface.  For instance if we define a session as beginning when
> > > logging is enabled and ending when the device is stopped and the
> > > interface reports no more data is available, then we can state that any
> > > partial accumulation of data is incomplete relative to migration.  If
> > > userspace wants to initiate a new migration stream, they can simply
> > > toggle logging.  How the vendor driver provides the data during the
> > > session is not defined, but beginning the session with a snapshot
> > > followed by repeated iterations of dirtied data is certainly a valid
> > > approach.
> > >   
> > > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > > save in pre-copy phase.
> > > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > > >   data in previous phase.  
> > > 
> > > First, I don't think the device has control of when QEMU switches from
> > > pre-copy to stop-and-copy, the protocol needs to support that
> > > transition at any point.  However, it seems a simply data available
> > > counter provides an indication of when it might be optimal to make such
> > > a transition.  If a vendor driver follows a scheme as above, the
> > > available data counter would indicate a large value, the entire initial
> > > snapshot of the device.  As the migration continues and pages are
> > > dirtied, the device would reach a steady state amount of data
> > > available, depending on the guest activity.  This could indicate to the
> > > user to stop the device.  The migration stream would not be considered
> > > completed until the available data counter reaches zero while the
> > > device is in the stopped|logging state.
> > >   
> > > > 3. if all the sequence is tightly bound to live migration, can we remove the
> > > > logging state? what about adding two states migrate-in and migrate-out?
> > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > >    migrate-out is for source side when migration starts. together with
> > > >    state running and stopped, it can substitute state logging.
> > > >    migrate-in is for target side.  
> > > 
> > > In fact, Kirti's implementation specifies a data direction, but I think
> > > we still need logging to indicate sessions.  I'd also assume that
> > > logging implies some overhead for the vendor driver.
> > >  
> > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > migrate-out are more universal againt hardware requirement changes.
> > 
> > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:  
> > > > > hi Alex
> > > > > thanks for your reply.
> > > > > 
> > > > > So, if we choose migration data to be userspace opaque, do you think below
> > > > > sequence is the right behavior for vendor driver to follow:
> > > > > 
> > > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > > > vendor driver,  vendor driver should reject and return 0.  
> > > 
> > > What would this state mean otherwise?  If we're not logging then it
> > > should not be expected that we can construct dirtied data from a
> > > previous read of the state before logging was enabled (it would be
> > > outside of the "session").  So at best this is an incomplete segment of
> > > the initial snapshot of the device, but that presumes how the vendor
> > > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > > driver reject it, but I think we should consider it undefined and
> > > vendor specific relative to the migration interface.
> > >   
> > > > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > > > driver,
> > > > >    a. vendor driver shoud first query a whole snapshot of device memory
> > > > >    (let's use this term to represent device's standalone memory for now),
> > > > >    b. vendor driver returns a chunk of data just queried to userspace,
> > > > >    while recording current pos in data.
> > > > >    c. vendor driver finds all data just queried is finished transmitting to
> > > > >    userspace, and queries only dirty data in device memory now.
> > > > >    d. vendor driver returns a chunk of data just quered (this time is dirty
> > > > >    data )to userspace while recording current pos in data
> > > > >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> > > > >    userspace, vendor driver starts another round of dirty data query.  
> > > 
> > > This is a valid vendor driver approach, but it's outside the scope of
> > > the interface definition.  A vendor driver could also decide to not
> > > provide any data until both stopped and logging are set and then
> > > provide a fixed, final snapshot.  The interface supports either
> > > approach by defining the protocol to interact with it.
> > >   
> > > > > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > > > > driver,
> > > > >    a. if vendor driver finds there's previously untransmitted data, returns
> > > > >    them until all transmitted.
> > > > >    b. vendor driver then queries dirty data again and transmits them.
> > > > >    c. at last, vendor driver queris device config data (which has to be
> > > > >    queried at last and sent once) and transmits them.  
> > > 
> > > This seems broken, the vendor driver is presuming the user intentions.
> > > If logging is unset, we return to bullet 1, reading data is undefined
> > > and vendor specific.  It's outside of the session.
> > >   
> > > > > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > > > > then,  vendor driver has to be able to detect that condition. so seemingly,
> > > > > vendor driver has to know more qemu's migration state, like migration
> > > > > called and failed. Do you think that's acceptable?  
> > > 
> > > If migration aborts, logging is cleared and the device continues
> > > operation.  If a new migration is started, the session is initiated by
> > > enabling logging.  Sound reasonable?  Thanks,
> > >  
> > 
> > For the flow, I still have a question.
> > There are 2 approaches below, which one do you prefer?
> > 
> > Approach A, in precopy stage, the sequence is
> > 
> > (1)
> > .save_live_pending --> return whole snapshot size
> > .save_live_iterate --> save whole snapshot
> > 
> > (2)
> > .save_live_pending --> get dirty data, return dirty data size
> > .save_live_iterate --> save all dirty data
> > 
> > (3)
> > .save_live_pending --> get dirty data again, return dirty data size
> > .save_live_iterate --> save all dirty data
> > 
> > 
> > Approach B, in precopy stage, the sequence is
> > (1)
> > .save_live_pending --> return whole snapshot size
> > .save_live_iterate --> save part of snapshot
> > 
> > (2)
> > .save_live_pending --> return rest part of whole snapshot size +
> >                               current dirty data size
> > .save_live_iterate --> save part of snapshot 
> > 
> > (3) repeat (2) until whole snapshot saved.
> > 
> > (4) 
> > .save_live_pending --> get diryt data and return current dirty data size
> > .save_live_iterate --> save part of dirty data
> > 
> > (5)
> > .save_live_pending --> return reset part of dirty data size +
> > 			     delta size of dirty data
> > .save_live_iterate --> save part of dirty data
> > 
> > (6)
> > repeat (5) until precopy stops
> 
> I don't really understand the question here.  If the vendor driver's
> approach is to send a full snapshot followed by iterations of dirty
> data, then when the user enables logging and reads the counter for
> available data it should report the (size of the snapshot).  The next
> time the user reads the counter, it should report the size of the
> (size of the snapshot) - (what the user has already read) + (size of
> the dirty data since the snapshot).  As the user continues to read past
> the snapshot data, the available data counter transitions to reporting
> only the size of the remaining dirty data, which is monotonically
> increasing.  I guess this would be more similar to your approach B,
> which seems to suggest that the interface needs to continue providing
> data regardless of whether the user fully exhausted the available data
> from the previous cycle.  Thanks,
>

Right. But regarding to the VFIO migration code in QEMU, rather than save
one chunk each time, do you think it is better to exhaust all reported data
from .save_live_pending in each .save_live_iterate callback? (eventhough 
vendor driver will handle the case that if userspace cannot exhaust
all data, VFIO QEMU can still try to save as many available data as it can
each time).

> Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-14 23:05                 ` Zhao Yan
@ 2019-03-15  2:24                   ` Alex Williamson
  2019-03-18  2:51                     ` Zhao Yan
  0 siblings, 1 reply; 133+ messages in thread
From: Alex Williamson @ 2019-03-15  2:24 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Ken.Xue, Tian, Kevin, dgilbert,
	intel-gvt-dev,

On Thu, 14 Mar 2019 19:05:06 -0400
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:
> > On Wed, 13 Mar 2019 21:12:22 -0400
> > Zhao Yan <yan.y.zhao@intel.com> wrote:
> >   
> > > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:  
> > > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > > >     
> > > > > hi Alex
> > > > > Any comments to the sequence below?
> > > > > 
> > > > > Actaully we have some concerns and suggestions to userspace-opaque migration
> > > > > data.
> > > > > 
> > > > > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > > > > migration. 
> > > > >    e.g. vendor driver has to know state (running + not logging) should not
> > > > >    return any data, and state (running + logging) should return whole
> > > > >    snapshot first and dirty later. it also has to know qemu migration will
> > > > >    not call GET_BUFFER in state (running + not logging), otherwise, it has
> > > > >    to adjust its behavior.    
> > > > 
> > > > This all just sounds like defining the protocol we expect with the
> > > > interface.  For instance if we define a session as beginning when
> > > > logging is enabled and ending when the device is stopped and the
> > > > interface reports no more data is available, then we can state that any
> > > > partial accumulation of data is incomplete relative to migration.  If
> > > > userspace wants to initiate a new migration stream, they can simply
> > > > toggle logging.  How the vendor driver provides the data during the
> > > > session is not defined, but beginning the session with a snapshot
> > > > followed by repeated iterations of dirtied data is certainly a valid
> > > > approach.
> > > >     
> > > > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > > > save in pre-copy phase.
> > > > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > > > >   data in previous phase.    
> > > > 
> > > > First, I don't think the device has control of when QEMU switches from
> > > > pre-copy to stop-and-copy, the protocol needs to support that
> > > > transition at any point.  However, it seems a simply data available
> > > > counter provides an indication of when it might be optimal to make such
> > > > a transition.  If a vendor driver follows a scheme as above, the
> > > > available data counter would indicate a large value, the entire initial
> > > > snapshot of the device.  As the migration continues and pages are
> > > > dirtied, the device would reach a steady state amount of data
> > > > available, depending on the guest activity.  This could indicate to the
> > > > user to stop the device.  The migration stream would not be considered
> > > > completed until the available data counter reaches zero while the
> > > > device is in the stopped|logging state.
> > > >     
> > > > > 3. if all the sequence is tightly bound to live migration, can we remove the
> > > > > logging state? what about adding two states migrate-in and migrate-out?
> > > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > > >    migrate-out is for source side when migration starts. together with
> > > > >    state running and stopped, it can substitute state logging.
> > > > >    migrate-in is for target side.    
> > > > 
> > > > In fact, Kirti's implementation specifies a data direction, but I think
> > > > we still need logging to indicate sessions.  I'd also assume that
> > > > logging implies some overhead for the vendor driver.
> > > >    
> > > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > > migrate-out are more universal againt hardware requirement changes.
> > >   
> > > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:    
> > > > > > hi Alex
> > > > > > thanks for your reply.
> > > > > > 
> > > > > > So, if we choose migration data to be userspace opaque, do you think below
> > > > > > sequence is the right behavior for vendor driver to follow:
> > > > > > 
> > > > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > > > > vendor driver,  vendor driver should reject and return 0.    
> > > > 
> > > > What would this state mean otherwise?  If we're not logging then it
> > > > should not be expected that we can construct dirtied data from a
> > > > previous read of the state before logging was enabled (it would be
> > > > outside of the "session").  So at best this is an incomplete segment of
> > > > the initial snapshot of the device, but that presumes how the vendor
> > > > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > > > driver reject it, but I think we should consider it undefined and
> > > > vendor specific relative to the migration interface.
> > > >     
> > > > > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > > > > driver,
> > > > > >    a. vendor driver shoud first query a whole snapshot of device memory
> > > > > >    (let's use this term to represent device's standalone memory for now),
> > > > > >    b. vendor driver returns a chunk of data just queried to userspace,
> > > > > >    while recording current pos in data.
> > > > > >    c. vendor driver finds all data just queried is finished transmitting to
> > > > > >    userspace, and queries only dirty data in device memory now.
> > > > > >    d. vendor driver returns a chunk of data just quered (this time is dirty
> > > > > >    data )to userspace while recording current pos in data
> > > > > >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> > > > > >    userspace, vendor driver starts another round of dirty data query.    
> > > > 
> > > > This is a valid vendor driver approach, but it's outside the scope of
> > > > the interface definition.  A vendor driver could also decide to not
> > > > provide any data until both stopped and logging are set and then
> > > > provide a fixed, final snapshot.  The interface supports either
> > > > approach by defining the protocol to interact with it.
> > > >     
> > > > > > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > > > > > driver,
> > > > > >    a. if vendor driver finds there's previously untransmitted data, returns
> > > > > >    them until all transmitted.
> > > > > >    b. vendor driver then queries dirty data again and transmits them.
> > > > > >    c. at last, vendor driver queris device config data (which has to be
> > > > > >    queried at last and sent once) and transmits them.    
> > > > 
> > > > This seems broken, the vendor driver is presuming the user intentions.
> > > > If logging is unset, we return to bullet 1, reading data is undefined
> > > > and vendor specific.  It's outside of the session.
> > > >     
> > > > > > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > > > > > then,  vendor driver has to be able to detect that condition. so seemingly,
> > > > > > vendor driver has to know more qemu's migration state, like migration
> > > > > > called and failed. Do you think that's acceptable?    
> > > > 
> > > > If migration aborts, logging is cleared and the device continues
> > > > operation.  If a new migration is started, the session is initiated by
> > > > enabling logging.  Sound reasonable?  Thanks,
> > > >    
> > > 
> > > For the flow, I still have a question.
> > > There are 2 approaches below, which one do you prefer?
> > > 
> > > Approach A, in precopy stage, the sequence is
> > > 
> > > (1)
> > > .save_live_pending --> return whole snapshot size
> > > .save_live_iterate --> save whole snapshot
> > > 
> > > (2)
> > > .save_live_pending --> get dirty data, return dirty data size
> > > .save_live_iterate --> save all dirty data
> > > 
> > > (3)
> > > .save_live_pending --> get dirty data again, return dirty data size
> > > .save_live_iterate --> save all dirty data
> > > 
> > > 
> > > Approach B, in precopy stage, the sequence is
> > > (1)
> > > .save_live_pending --> return whole snapshot size
> > > .save_live_iterate --> save part of snapshot
> > > 
> > > (2)
> > > .save_live_pending --> return rest part of whole snapshot size +
> > >                               current dirty data size
> > > .save_live_iterate --> save part of snapshot 
> > > 
> > > (3) repeat (2) until whole snapshot saved.
> > > 
> > > (4) 
> > > .save_live_pending --> get diryt data and return current dirty data size
> > > .save_live_iterate --> save part of dirty data
> > > 
> > > (5)
> > > .save_live_pending --> return reset part of dirty data size +
> > > 			     delta size of dirty data
> > > .save_live_iterate --> save part of dirty data
> > > 
> > > (6)
> > > repeat (5) until precopy stops  
> > 
> > I don't really understand the question here.  If the vendor driver's
> > approach is to send a full snapshot followed by iterations of dirty
> > data, then when the user enables logging and reads the counter for
> > available data it should report the (size of the snapshot).  The next
> > time the user reads the counter, it should report the size of the
> > (size of the snapshot) - (what the user has already read) + (size of
> > the dirty data since the snapshot).  As the user continues to read past
> > the snapshot data, the available data counter transitions to reporting
> > only the size of the remaining dirty data, which is monotonically
> > increasing.  I guess this would be more similar to your approach B,
> > which seems to suggest that the interface needs to continue providing
> > data regardless of whether the user fully exhausted the available data
> > from the previous cycle.  Thanks,
> >  
> 
> Right. But regarding to the VFIO migration code in QEMU, rather than save
> one chunk each time, do you think it is better to exhaust all reported data
> from .save_live_pending in each .save_live_iterate callback? (eventhough 
> vendor driver will handle the case that if userspace cannot exhaust
> all data, VFIO QEMU can still try to save as many available data as it can
> each time).

Don't you suspect that some devices might have state that's too large
to process in each iteration?  I expect we'll need to use heuristics on
data size or time spent on each iteration round such that some devices
might be able to fully process their pending data while others will
require multiple passes or make up the balance once we've entered stop
and copy.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-15  2:24                   ` Alex Williamson
@ 2019-03-18  2:51                     ` Zhao Yan
  2019-03-18  3:09                       ` Alex Williamson
  0 siblings, 1 reply; 133+ messages in thread
From: Zhao Yan @ 2019-03-18  2:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Wang, Zhi A, Tian, Kevin, dgilbert,
	intel-gvt-dev,

On Fri, Mar 15, 2019 at 10:24:02AM +0800, Alex Williamson wrote:
> On Thu, 14 Mar 2019 19:05:06 -0400
> Zhao Yan <yan.y.zhao@intel.com> wrote:
> 
> > On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:
> > > On Wed, 13 Mar 2019 21:12:22 -0400
> > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:  
> > > > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > > > >     
> > > > > > hi Alex
> > > > > > Any comments to the sequence below?
> > > > > > 
> > > > > > Actaully we have some concerns and suggestions to userspace-opaque migration
> > > > > > data.
> > > > > > 
> > > > > > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > > > > > migration. 
> > > > > >    e.g. vendor driver has to know state (running + not logging) should not
> > > > > >    return any data, and state (running + logging) should return whole
> > > > > >    snapshot first and dirty later. it also has to know qemu migration will
> > > > > >    not call GET_BUFFER in state (running + not logging), otherwise, it has
> > > > > >    to adjust its behavior.    
> > > > > 
> > > > > This all just sounds like defining the protocol we expect with the
> > > > > interface.  For instance if we define a session as beginning when
> > > > > logging is enabled and ending when the device is stopped and the
> > > > > interface reports no more data is available, then we can state that any
> > > > > partial accumulation of data is incomplete relative to migration.  If
> > > > > userspace wants to initiate a new migration stream, they can simply
> > > > > toggle logging.  How the vendor driver provides the data during the
> > > > > session is not defined, but beginning the session with a snapshot
> > > > > followed by repeated iterations of dirtied data is certainly a valid
> > > > > approach.
> > > > >     
> > > > > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > > > > save in pre-copy phase.
> > > > > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > > > > >   data in previous phase.    
> > > > > 
> > > > > First, I don't think the device has control of when QEMU switches from
> > > > > pre-copy to stop-and-copy, the protocol needs to support that
> > > > > transition at any point.  However, it seems a simply data available
> > > > > counter provides an indication of when it might be optimal to make such
> > > > > a transition.  If a vendor driver follows a scheme as above, the
> > > > > available data counter would indicate a large value, the entire initial
> > > > > snapshot of the device.  As the migration continues and pages are
> > > > > dirtied, the device would reach a steady state amount of data
> > > > > available, depending on the guest activity.  This could indicate to the
> > > > > user to stop the device.  The migration stream would not be considered
> > > > > completed until the available data counter reaches zero while the
> > > > > device is in the stopped|logging state.
> > > > >     
> > > > > > 3. if all the sequence is tightly bound to live migration, can we remove the
> > > > > > logging state? what about adding two states migrate-in and migrate-out?
> > > > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > > > >    migrate-out is for source side when migration starts. together with
> > > > > >    state running and stopped, it can substitute state logging.
> > > > > >    migrate-in is for target side.    
> > > > > 
> > > > > In fact, Kirti's implementation specifies a data direction, but I think
> > > > > we still need logging to indicate sessions.  I'd also assume that
> > > > > logging implies some overhead for the vendor driver.
> > > > >    
> > > > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > > > migrate-out are more universal againt hardware requirement changes.
> > > >   
> > > > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:    
> > > > > > > hi Alex
> > > > > > > thanks for your reply.
> > > > > > > 
> > > > > > > So, if we choose migration data to be userspace opaque, do you think below
> > > > > > > sequence is the right behavior for vendor driver to follow:
> > > > > > > 
> > > > > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > > > > > vendor driver,  vendor driver should reject and return 0.    
> > > > > 
> > > > > What would this state mean otherwise?  If we're not logging then it
> > > > > should not be expected that we can construct dirtied data from a
> > > > > previous read of the state before logging was enabled (it would be
> > > > > outside of the "session").  So at best this is an incomplete segment of
> > > > > the initial snapshot of the device, but that presumes how the vendor
> > > > > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > > > > driver reject it, but I think we should consider it undefined and
> > > > > vendor specific relative to the migration interface.
> > > > >     
> > > > > > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > > > > > driver,
> > > > > > >    a. vendor driver shoud first query a whole snapshot of device memory
> > > > > > >    (let's use this term to represent device's standalone memory for now),
> > > > > > >    b. vendor driver returns a chunk of data just queried to userspace,
> > > > > > >    while recording current pos in data.
> > > > > > >    c. vendor driver finds all data just queried is finished transmitting to
> > > > > > >    userspace, and queries only dirty data in device memory now.
> > > > > > >    d. vendor driver returns a chunk of data just quered (this time is dirty
> > > > > > >    data )to userspace while recording current pos in data
> > > > > > >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> > > > > > >    userspace, vendor driver starts another round of dirty data query.    
> > > > > 
> > > > > This is a valid vendor driver approach, but it's outside the scope of
> > > > > the interface definition.  A vendor driver could also decide to not
> > > > > provide any data until both stopped and logging are set and then
> > > > > provide a fixed, final snapshot.  The interface supports either
> > > > > approach by defining the protocol to interact with it.
> > > > >     
> > > > > > > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > > > > > > driver,
> > > > > > >    a. if vendor driver finds there's previously untransmitted data, returns
> > > > > > >    them until all transmitted.
> > > > > > >    b. vendor driver then queries dirty data again and transmits them.
> > > > > > >    c. at last, vendor driver queris device config data (which has to be
> > > > > > >    queried at last and sent once) and transmits them.    
> > > > > 
> > > > > This seems broken, the vendor driver is presuming the user intentions.
> > > > > If logging is unset, we return to bullet 1, reading data is undefined
> > > > > and vendor specific.  It's outside of the session.
> > > > >     
> > > > > > > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > > > > > > then,  vendor driver has to be able to detect that condition. so seemingly,
> > > > > > > vendor driver has to know more qemu's migration state, like migration
> > > > > > > called and failed. Do you think that's acceptable?    
> > > > > 
> > > > > If migration aborts, logging is cleared and the device continues
> > > > > operation.  If a new migration is started, the session is initiated by
> > > > > enabling logging.  Sound reasonable?  Thanks,
> > > > >    
> > > > 
> > > > For the flow, I still have a question.
> > > > There are 2 approaches below, which one do you prefer?
> > > > 
> > > > Approach A, in precopy stage, the sequence is
> > > > 
> > > > (1)
> > > > .save_live_pending --> return whole snapshot size
> > > > .save_live_iterate --> save whole snapshot
> > > > 
> > > > (2)
> > > > .save_live_pending --> get dirty data, return dirty data size
> > > > .save_live_iterate --> save all dirty data
> > > > 
> > > > (3)
> > > > .save_live_pending --> get dirty data again, return dirty data size
> > > > .save_live_iterate --> save all dirty data
> > > > 
> > > > 
> > > > Approach B, in precopy stage, the sequence is
> > > > (1)
> > > > .save_live_pending --> return whole snapshot size
> > > > .save_live_iterate --> save part of snapshot
> > > > 
> > > > (2)
> > > > .save_live_pending --> return rest part of whole snapshot size +
> > > >                               current dirty data size
> > > > .save_live_iterate --> save part of snapshot 
> > > > 
> > > > (3) repeat (2) until whole snapshot saved.
> > > > 
> > > > (4) 
> > > > .save_live_pending --> get diryt data and return current dirty data size
> > > > .save_live_iterate --> save part of dirty data
> > > > 
> > > > (5)
> > > > .save_live_pending --> return reset part of dirty data size +
> > > > 			     delta size of dirty data
> > > > .save_live_iterate --> save part of dirty data
> > > > 
> > > > (6)
> > > > repeat (5) until precopy stops  
> > > 
> > > I don't really understand the question here.  If the vendor driver's
> > > approach is to send a full snapshot followed by iterations of dirty
> > > data, then when the user enables logging and reads the counter for
> > > available data it should report the (size of the snapshot).  The next
> > > time the user reads the counter, it should report the size of the
> > > (size of the snapshot) - (what the user has already read) + (size of
> > > the dirty data since the snapshot).  As the user continues to read past
> > > the snapshot data, the available data counter transitions to reporting
> > > only the size of the remaining dirty data, which is monotonically
> > > increasing.  I guess this would be more similar to your approach B,
> > > which seems to suggest that the interface needs to continue providing
> > > data regardless of whether the user fully exhausted the available data
> > > from the previous cycle.  Thanks,
> > >  
> > 
> > Right. But regarding to the VFIO migration code in QEMU, rather than save
> > one chunk each time, do you think it is better to exhaust all reported data
> > from .save_live_pending in each .save_live_iterate callback? (eventhough 
> > vendor driver will handle the case that if userspace cannot exhaust
> > all data, VFIO QEMU can still try to save as many available data as it can
> > each time).
> 
> Don't you suspect that some devices might have state that's too large
> to process in each iteration?  I expect we'll need to use heuristics on
> data size or time spent on each iteration round such that some devices
> might be able to fully process their pending data while others will
> require multiple passes or make up the balance once we've entered stop
> and copy.  Thanks,
>
hi Alex
What about looping and draining the pending data in each iteration? :)

Thanks
Yan



> Alex
> _______________________________________________
> intel-gvt-dev mailing list
> intel-gvt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-18  2:51                     ` Zhao Yan
@ 2019-03-18  3:09                       ` Alex Williamson
  2019-03-18  3:27                         ` Zhao Yan
  0 siblings, 1 reply; 133+ messages in thread
From: Alex Williamson @ 2019-03-18  3:09 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Wang, Zhi A, Tian, Kevin, dgilbert,
	intel-gvt-dev

On Sun, 17 Mar 2019 22:51:27 -0400
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Fri, Mar 15, 2019 at 10:24:02AM +0800, Alex Williamson wrote:
> > On Thu, 14 Mar 2019 19:05:06 -0400
> > Zhao Yan <yan.y.zhao@intel.com> wrote:
> >   
> > > On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:  
> > > > On Wed, 13 Mar 2019 21:12:22 -0400
> > > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > > >     
> > > > > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:    
> > > > > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > > > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > > > > >       
> > > > > > > hi Alex
> > > > > > > Any comments to the sequence below?
> > > > > > > 
> > > > > > > Actaully we have some concerns and suggestions to userspace-opaque migration
> > > > > > > data.
> > > > > > > 
> > > > > > > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > > > > > > migration. 
> > > > > > >    e.g. vendor driver has to know state (running + not logging) should not
> > > > > > >    return any data, and state (running + logging) should return whole
> > > > > > >    snapshot first and dirty later. it also has to know qemu migration will
> > > > > > >    not call GET_BUFFER in state (running + not logging), otherwise, it has
> > > > > > >    to adjust its behavior.      
> > > > > > 
> > > > > > This all just sounds like defining the protocol we expect with the
> > > > > > interface.  For instance if we define a session as beginning when
> > > > > > logging is enabled and ending when the device is stopped and the
> > > > > > interface reports no more data is available, then we can state that any
> > > > > > partial accumulation of data is incomplete relative to migration.  If
> > > > > > userspace wants to initiate a new migration stream, they can simply
> > > > > > toggle logging.  How the vendor driver provides the data during the
> > > > > > session is not defined, but beginning the session with a snapshot
> > > > > > followed by repeated iterations of dirtied data is certainly a valid
> > > > > > approach.
> > > > > >       
> > > > > > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > > > > > save in pre-copy phase.
> > > > > > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > > > > > >   data in previous phase.      
> > > > > > 
> > > > > > First, I don't think the device has control of when QEMU switches from
> > > > > > pre-copy to stop-and-copy, the protocol needs to support that
> > > > > > transition at any point.  However, it seems a simply data available
> > > > > > counter provides an indication of when it might be optimal to make such
> > > > > > a transition.  If a vendor driver follows a scheme as above, the
> > > > > > available data counter would indicate a large value, the entire initial
> > > > > > snapshot of the device.  As the migration continues and pages are
> > > > > > dirtied, the device would reach a steady state amount of data
> > > > > > available, depending on the guest activity.  This could indicate to the
> > > > > > user to stop the device.  The migration stream would not be considered
> > > > > > completed until the available data counter reaches zero while the
> > > > > > device is in the stopped|logging state.
> > > > > >       
> > > > > > > 3. if all the sequence is tightly bound to live migration, can we remove the
> > > > > > > logging state? what about adding two states migrate-in and migrate-out?
> > > > > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > > > > >    migrate-out is for source side when migration starts. together with
> > > > > > >    state running and stopped, it can substitute state logging.
> > > > > > >    migrate-in is for target side.      
> > > > > > 
> > > > > > In fact, Kirti's implementation specifies a data direction, but I think
> > > > > > we still need logging to indicate sessions.  I'd also assume that
> > > > > > logging implies some overhead for the vendor driver.
> > > > > >      
> > > > > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > > > > migrate-out are more universal againt hardware requirement changes.
> > > > >     
> > > > > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:      
> > > > > > > > hi Alex
> > > > > > > > thanks for your reply.
> > > > > > > > 
> > > > > > > > So, if we choose migration data to be userspace opaque, do you think below
> > > > > > > > sequence is the right behavior for vendor driver to follow:
> > > > > > > > 
> > > > > > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > > > > > > vendor driver,  vendor driver should reject and return 0.      
> > > > > > 
> > > > > > What would this state mean otherwise?  If we're not logging then it
> > > > > > should not be expected that we can construct dirtied data from a
> > > > > > previous read of the state before logging was enabled (it would be
> > > > > > outside of the "session").  So at best this is an incomplete segment of
> > > > > > the initial snapshot of the device, but that presumes how the vendor
> > > > > > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > > > > > driver reject it, but I think we should consider it undefined and
> > > > > > vendor specific relative to the migration interface.
> > > > > >       
> > > > > > > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > > > > > > driver,
> > > > > > > >    a. vendor driver shoud first query a whole snapshot of device memory
> > > > > > > >    (let's use this term to represent device's standalone memory for now),
> > > > > > > >    b. vendor driver returns a chunk of data just queried to userspace,
> > > > > > > >    while recording current pos in data.
> > > > > > > >    c. vendor driver finds all data just queried is finished transmitting to
> > > > > > > >    userspace, and queries only dirty data in device memory now.
> > > > > > > >    d. vendor driver returns a chunk of data just quered (this time is dirty
> > > > > > > >    data )to userspace while recording current pos in data
> > > > > > > >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> > > > > > > >    userspace, vendor driver starts another round of dirty data query.      
> > > > > > 
> > > > > > This is a valid vendor driver approach, but it's outside the scope of
> > > > > > the interface definition.  A vendor driver could also decide to not
> > > > > > provide any data until both stopped and logging are set and then
> > > > > > provide a fixed, final snapshot.  The interface supports either
> > > > > > approach by defining the protocol to interact with it.
> > > > > >       
> > > > > > > > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > > > > > > > driver,
> > > > > > > >    a. if vendor driver finds there's previously untransmitted data, returns
> > > > > > > >    them until all transmitted.
> > > > > > > >    b. vendor driver then queries dirty data again and transmits them.
> > > > > > > >    c. at last, vendor driver queris device config data (which has to be
> > > > > > > >    queried at last and sent once) and transmits them.      
> > > > > > 
> > > > > > This seems broken, the vendor driver is presuming the user intentions.
> > > > > > If logging is unset, we return to bullet 1, reading data is undefined
> > > > > > and vendor specific.  It's outside of the session.
> > > > > >       
> > > > > > > > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > > > > > > > then,  vendor driver has to be able to detect that condition. so seemingly,
> > > > > > > > vendor driver has to know more qemu's migration state, like migration
> > > > > > > > called and failed. Do you think that's acceptable?      
> > > > > > 
> > > > > > If migration aborts, logging is cleared and the device continues
> > > > > > operation.  If a new migration is started, the session is initiated by
> > > > > > enabling logging.  Sound reasonable?  Thanks,
> > > > > >      
> > > > > 
> > > > > For the flow, I still have a question.
> > > > > There are 2 approaches below, which one do you prefer?
> > > > > 
> > > > > Approach A, in precopy stage, the sequence is
> > > > > 
> > > > > (1)
> > > > > .save_live_pending --> return whole snapshot size
> > > > > .save_live_iterate --> save whole snapshot
> > > > > 
> > > > > (2)
> > > > > .save_live_pending --> get dirty data, return dirty data size
> > > > > .save_live_iterate --> save all dirty data
> > > > > 
> > > > > (3)
> > > > > .save_live_pending --> get dirty data again, return dirty data size
> > > > > .save_live_iterate --> save all dirty data
> > > > > 
> > > > > 
> > > > > Approach B, in precopy stage, the sequence is
> > > > > (1)
> > > > > .save_live_pending --> return whole snapshot size
> > > > > .save_live_iterate --> save part of snapshot
> > > > > 
> > > > > (2)
> > > > > .save_live_pending --> return rest part of whole snapshot size +
> > > > >                               current dirty data size
> > > > > .save_live_iterate --> save part of snapshot 
> > > > > 
> > > > > (3) repeat (2) until whole snapshot saved.
> > > > > 
> > > > > (4) 
> > > > > .save_live_pending --> get diryt data and return current dirty data size
> > > > > .save_live_iterate --> save part of dirty data
> > > > > 
> > > > > (5)
> > > > > .save_live_pending --> return reset part of dirty data size +
> > > > > 			     delta size of dirty data
> > > > > .save_live_iterate --> save part of dirty data
> > > > > 
> > > > > (6)
> > > > > repeat (5) until precopy stops    
> > > > 
> > > > I don't really understand the question here.  If the vendor driver's
> > > > approach is to send a full snapshot followed by iterations of dirty
> > > > data, then when the user enables logging and reads the counter for
> > > > available data it should report the (size of the snapshot).  The next
> > > > time the user reads the counter, it should report the size of the
> > > > (size of the snapshot) - (what the user has already read) + (size of
> > > > the dirty data since the snapshot).  As the user continues to read past
> > > > the snapshot data, the available data counter transitions to reporting
> > > > only the size of the remaining dirty data, which is monotonically
> > > > increasing.  I guess this would be more similar to your approach B,
> > > > which seems to suggest that the interface needs to continue providing
> > > > data regardless of whether the user fully exhausted the available data
> > > > from the previous cycle.  Thanks,
> > > >    
> > > 
> > > Right. But regarding to the VFIO migration code in QEMU, rather than save
> > > one chunk each time, do you think it is better to exhaust all reported data
> > > from .save_live_pending in each .save_live_iterate callback? (eventhough 
> > > vendor driver will handle the case that if userspace cannot exhaust
> > > all data, VFIO QEMU can still try to save as many available data as it can
> > > each time).  
> > 
> > Don't you suspect that some devices might have state that's too large
> > to process in each iteration?  I expect we'll need to use heuristics on
> > data size or time spent on each iteration round such that some devices
> > might be able to fully process their pending data while others will
> > require multiple passes or make up the balance once we've entered stop
> > and copy.  Thanks,
> >  
> hi Alex
> What about looping and draining the pending data in each iteration? :)

How is this question different than your previous question?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-18  3:09                       ` Alex Williamson
@ 2019-03-18  3:27                         ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-03-18  3:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, arei.gonglei, felipe, Wang, Zhi A, Tian, Kevin, dgilbert,
	intel-gvt-dev,

On Mon, Mar 18, 2019 at 11:09:04AM +0800, Alex Williamson wrote:
> On Sun, 17 Mar 2019 22:51:27 -0400
> Zhao Yan <yan.y.zhao@intel.com> wrote:
> 
> > On Fri, Mar 15, 2019 at 10:24:02AM +0800, Alex Williamson wrote:
> > > On Thu, 14 Mar 2019 19:05:06 -0400
> > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:  
> > > > > On Wed, 13 Mar 2019 21:12:22 -0400
> > > > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > > > >     
> > > > > > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:    
> > > > > > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > > > > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > > > > > >       
> > > > > > > > hi Alex
> > > > > > > > Any comments to the sequence below?
> > > > > > > > 
> > > > > > > > Actaully we have some concerns and suggestions to userspace-opaque migration
> > > > > > > > data.
> > > > > > > > 
> > > > > > > > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > > > > > > > migration. 
> > > > > > > >    e.g. vendor driver has to know state (running + not logging) should not
> > > > > > > >    return any data, and state (running + logging) should return whole
> > > > > > > >    snapshot first and dirty later. it also has to know qemu migration will
> > > > > > > >    not call GET_BUFFER in state (running + not logging), otherwise, it has
> > > > > > > >    to adjust its behavior.      
> > > > > > > 
> > > > > > > This all just sounds like defining the protocol we expect with the
> > > > > > > interface.  For instance if we define a session as beginning when
> > > > > > > logging is enabled and ending when the device is stopped and the
> > > > > > > interface reports no more data is available, then we can state that any
> > > > > > > partial accumulation of data is incomplete relative to migration.  If
> > > > > > > userspace wants to initiate a new migration stream, they can simply
> > > > > > > toggle logging.  How the vendor driver provides the data during the
> > > > > > > session is not defined, but beginning the session with a snapshot
> > > > > > > followed by repeated iterations of dirtied data is certainly a valid
> > > > > > > approach.
> > > > > > >       
> > > > > > > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > > > > > > save in pre-copy phase.
> > > > > > > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > > > > > > >   data in previous phase.      
> > > > > > > 
> > > > > > > First, I don't think the device has control of when QEMU switches from
> > > > > > > pre-copy to stop-and-copy, the protocol needs to support that
> > > > > > > transition at any point.  However, it seems a simply data available
> > > > > > > counter provides an indication of when it might be optimal to make such
> > > > > > > a transition.  If a vendor driver follows a scheme as above, the
> > > > > > > available data counter would indicate a large value, the entire initial
> > > > > > > snapshot of the device.  As the migration continues and pages are
> > > > > > > dirtied, the device would reach a steady state amount of data
> > > > > > > available, depending on the guest activity.  This could indicate to the
> > > > > > > user to stop the device.  The migration stream would not be considered
> > > > > > > completed until the available data counter reaches zero while the
> > > > > > > device is in the stopped|logging state.
> > > > > > >       
> > > > > > > > 3. if all the sequence is tightly bound to live migration, can we remove the
> > > > > > > > logging state? what about adding two states migrate-in and migrate-out?
> > > > > > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > > > > > >    migrate-out is for source side when migration starts. together with
> > > > > > > >    state running and stopped, it can substitute state logging.
> > > > > > > >    migrate-in is for target side.      
> > > > > > > 
> > > > > > > In fact, Kirti's implementation specifies a data direction, but I think
> > > > > > > we still need logging to indicate sessions.  I'd also assume that
> > > > > > > logging implies some overhead for the vendor driver.
> > > > > > >      
> > > > > > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > > > > > migrate-out are more universal againt hardware requirement changes.
> > > > > >     
> > > > > > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:      
> > > > > > > > > hi Alex
> > > > > > > > > thanks for your reply.
> > > > > > > > > 
> > > > > > > > > So, if we choose migration data to be userspace opaque, do you think below
> > > > > > > > > sequence is the right behavior for vendor driver to follow:
> > > > > > > > > 
> > > > > > > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > > > > > > > vendor driver,  vendor driver should reject and return 0.      
> > > > > > > 
> > > > > > > What would this state mean otherwise?  If we're not logging then it
> > > > > > > should not be expected that we can construct dirtied data from a
> > > > > > > previous read of the state before logging was enabled (it would be
> > > > > > > outside of the "session").  So at best this is an incomplete segment of
> > > > > > > the initial snapshot of the device, but that presumes how the vendor
> > > > > > > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > > > > > > driver reject it, but I think we should consider it undefined and
> > > > > > > vendor specific relative to the migration interface.
> > > > > > >       
> > > > > > > > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > > > > > > > driver,
> > > > > > > > >    a. vendor driver shoud first query a whole snapshot of device memory
> > > > > > > > >    (let's use this term to represent device's standalone memory for now),
> > > > > > > > >    b. vendor driver returns a chunk of data just queried to userspace,
> > > > > > > > >    while recording current pos in data.
> > > > > > > > >    c. vendor driver finds all data just queried is finished transmitting to
> > > > > > > > >    userspace, and queries only dirty data in device memory now.
> > > > > > > > >    d. vendor driver returns a chunk of data just quered (this time is dirty
> > > > > > > > >    data )to userspace while recording current pos in data
> > > > > > > > >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> > > > > > > > >    userspace, vendor driver starts another round of dirty data query.      
> > > > > > > 
> > > > > > > This is a valid vendor driver approach, but it's outside the scope of
> > > > > > > the interface definition.  A vendor driver could also decide to not
> > > > > > > provide any data until both stopped and logging are set and then
> > > > > > > provide a fixed, final snapshot.  The interface supports either
> > > > > > > approach by defining the protocol to interact with it.
> > > > > > >       
> > > > > > > > > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > > > > > > > > driver,
> > > > > > > > >    a. if vendor driver finds there's previously untransmitted data, returns
> > > > > > > > >    them until all transmitted.
> > > > > > > > >    b. vendor driver then queries dirty data again and transmits them.
> > > > > > > > >    c. at last, vendor driver queris device config data (which has to be
> > > > > > > > >    queried at last and sent once) and transmits them.      
> > > > > > > 
> > > > > > > This seems broken, the vendor driver is presuming the user intentions.
> > > > > > > If logging is unset, we return to bullet 1, reading data is undefined
> > > > > > > and vendor specific.  It's outside of the session.
> > > > > > >       
> > > > > > > > > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > > > > > > > > then,  vendor driver has to be able to detect that condition. so seemingly,
> > > > > > > > > vendor driver has to know more qemu's migration state, like migration
> > > > > > > > > called and failed. Do you think that's acceptable?      
> > > > > > > 
> > > > > > > If migration aborts, logging is cleared and the device continues
> > > > > > > operation.  If a new migration is started, the session is initiated by
> > > > > > > enabling logging.  Sound reasonable?  Thanks,
> > > > > > >      
> > > > > > 
> > > > > > For the flow, I still have a question.
> > > > > > There are 2 approaches below, which one do you prefer?
> > > > > > 
> > > > > > Approach A, in precopy stage, the sequence is
> > > > > > 
> > > > > > (1)
> > > > > > .save_live_pending --> return whole snapshot size
> > > > > > .save_live_iterate --> save whole snapshot
> > > > > > 
> > > > > > (2)
> > > > > > .save_live_pending --> get dirty data, return dirty data size
> > > > > > .save_live_iterate --> save all dirty data
> > > > > > 
> > > > > > (3)
> > > > > > .save_live_pending --> get dirty data again, return dirty data size
> > > > > > .save_live_iterate --> save all dirty data
> > > > > > 
> > > > > > 
> > > > > > Approach B, in precopy stage, the sequence is
> > > > > > (1)
> > > > > > .save_live_pending --> return whole snapshot size
> > > > > > .save_live_iterate --> save part of snapshot
> > > > > > 
> > > > > > (2)
> > > > > > .save_live_pending --> return rest part of whole snapshot size +
> > > > > >                               current dirty data size
> > > > > > .save_live_iterate --> save part of snapshot 
> > > > > > 
> > > > > > (3) repeat (2) until whole snapshot saved.
> > > > > > 
> > > > > > (4) 
> > > > > > .save_live_pending --> get diryt data and return current dirty data size
> > > > > > .save_live_iterate --> save part of dirty data
> > > > > > 
> > > > > > (5)
> > > > > > .save_live_pending --> return reset part of dirty data size +
> > > > > > 			     delta size of dirty data
> > > > > > .save_live_iterate --> save part of dirty data
> > > > > > 
> > > > > > (6)
> > > > > > repeat (5) until precopy stops    
> > > > > 
> > > > > I don't really understand the question here.  If the vendor driver's
> > > > > approach is to send a full snapshot followed by iterations of dirty
> > > > > data, then when the user enables logging and reads the counter for
> > > > > available data it should report the (size of the snapshot).  The next
> > > > > time the user reads the counter, it should report the size of the
> > > > > (size of the snapshot) - (what the user has already read) + (size of
> > > > > the dirty data since the snapshot).  As the user continues to read past
> > > > > the snapshot data, the available data counter transitions to reporting
> > > > > only the size of the remaining dirty data, which is monotonically
> > > > > increasing.  I guess this would be more similar to your approach B,
> > > > > which seems to suggest that the interface needs to continue providing
> > > > > data regardless of whether the user fully exhausted the available data
> > > > > from the previous cycle.  Thanks,
> > > > >    
> > > > 
> > > > Right. But regarding to the VFIO migration code in QEMU, rather than save
> > > > one chunk each time, do you think it is better to exhaust all reported data
> > > > from .save_live_pending in each .save_live_iterate callback? (eventhough 
> > > > vendor driver will handle the case that if userspace cannot exhaust
> > > > all data, VFIO QEMU can still try to save as many available data as it can
> > > > each time).  
> > > 
> > > Don't you suspect that some devices might have state that's too large
> > > to process in each iteration?  I expect we'll need to use heuristics on
> > > data size or time spent on each iteration round such that some devices
> > > might be able to fully process their pending data while others will
> > > require multiple passes or make up the balance once we've entered stop
> > > and copy.  Thanks,
> > >  
> > hi Alex
> > What about looping and draining the pending data in each iteration? :)
> 
> How is this question different than your previous question?  Thanks,
> 
sorry, I misunderstood your meaning in last mail.
you are right, sometimes, one device may have too large of pending data to
save in each iteration. 
Although draining the pending data in each iteration is feasible because
pre-copy phase is allowed to be slow, but use a heuristic max size in each
iteration is also reasonable.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-02-20 11:42           ` [Qemu-devel] " Cornelia Huck
  (?)
  (?)
@ 2019-03-27  6:35           ` Zhao Yan
  2019-03-27 20:18             ` Dr. David Alan Gilbert
  -1 siblings, 1 reply; 133+ messages in thread
From: Zhao Yan @ 2019-03-27  6:35 UTC (permalink / raw)
  To: Cornelia Huck, alex.williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	alex.williamson,

On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:
> > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > a device that has less device memory ?  
> > > > Actually it's still an open for VFIO migration. Need to think about
> > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > along with verion ?).  
> > 
> > We must keep the hardware generation is the same with one POD of public cloud
> > providers. But we still think about the live migration between from the the lower
> > generation of hardware migrated to the higher generation.
> 
> Agreed, lower->higher is the one direction that might make sense to
> support.
> 
> But regardless of that, I think we need to make sure that incompatible
> devices/versions fail directly instead of failing in a subtle, hard to
> debug way. Might be useful to do some initial sanity checks in libvirt
> as well.
> 
> How easy is it to obtain that information in a form that can be
> consumed by higher layers? Can we find out the device type at least?
> What about some kind of revision?
hi Alex and Cornelia
for device compatibility, do you think it's a good idea to use "version"
and "device version" fields?

version field: identify live migration interface's version. it can have a
sort of backward compatibility, like target machine's version >= source
machine's version. something like that.

device_version field consists two parts:
1. vendor id : it takes 32 bits. e.g. 0x8086.
2. vendor proprietary string: it can be any string that a vendor driver
thinks can identify a source device. e.g. pciid + mdev type.
"vendor id" is to avoid overlap of "vendor proprietary string".


struct vfio_device_state_ctl {
     __u32 version;            /* ro */
     __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
     struct {
     	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
	...
     }data;
     ...
 };

Then, an action IS_COMPATIBLE is added to check device compatibility.

The flow to figure out whether a source device is migratable to target device
is like that:
1. in source side's .save_setup, save source device's device_version string
2. in target side's .load_state, load source device's device version string
and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
to check whether the source device is compatible to it.

The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
maintain a compatibility table and decide whether source device is compatible
to target device according to its proprietary table.
In device_version string, vendor driver only has to describe the source
device as elaborately as possible and resorts to vendor driver in target side
to figure out whether they are compatible.

Thanks
Yan

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-27  6:35           ` Zhao Yan
@ 2019-03-27 20:18             ` Dr. David Alan Gilbert
  2019-03-27 22:10               ` Alex Williamson
  0 siblings, 1 reply; 133+ messages in thread
From: Dr. David Alan Gilbert @ 2019-03-27 20:18 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, alex.williamson, intel-gvt-dev,
	Liu, Changpeng

* Zhao Yan (yan.y.zhao@intel.com) wrote:
> On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:
> > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > a device that has less device memory ?  
> > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > along with verion ?).  
> > > 
> > > We must keep the hardware generation is the same with one POD of public cloud
> > > providers. But we still think about the live migration between from the the lower
> > > generation of hardware migrated to the higher generation.
> > 
> > Agreed, lower->higher is the one direction that might make sense to
> > support.
> > 
> > But regardless of that, I think we need to make sure that incompatible
> > devices/versions fail directly instead of failing in a subtle, hard to
> > debug way. Might be useful to do some initial sanity checks in libvirt
> > as well.
> > 
> > How easy is it to obtain that information in a form that can be
> > consumed by higher layers? Can we find out the device type at least?
> > What about some kind of revision?
> hi Alex and Cornelia
> for device compatibility, do you think it's a good idea to use "version"
> and "device version" fields?
> 
> version field: identify live migration interface's version. it can have a
> sort of backward compatibility, like target machine's version >= source
> machine's version. something like that.
> 
> device_version field consists two parts:
> 1. vendor id : it takes 32 bits. e.g. 0x8086.
> 2. vendor proprietary string: it can be any string that a vendor driver
> thinks can identify a source device. e.g. pciid + mdev type.
> "vendor id" is to avoid overlap of "vendor proprietary string".
> 
> 
> struct vfio_device_state_ctl {
>      __u32 version;            /* ro */
>      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
>      struct {
>      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> 	...
>      }data;
>      ...
>  };
> 
> Then, an action IS_COMPATIBLE is added to check device compatibility.
> 
> The flow to figure out whether a source device is migratable to target device
> is like that:
> 1. in source side's .save_setup, save source device's device_version string
> 2. in target side's .load_state, load source device's device version string
> and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> to check whether the source device is compatible to it.
> 
> The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> maintain a compatibility table and decide whether source device is compatible
> to target device according to its proprietary table.
> In device_version string, vendor driver only has to describe the source
> device as elaborately as possible and resorts to vendor driver in target side
> to figure out whether they are compatible.

It would also be good if the 'IS_COMPATIBLE' was somehow callable
externally - so we could be able to answer a question like 'can we
migrate this VM to this host' - from the management layer before it
actually starts the migration.

Dave

> Thanks
> Yan
> 
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-27 20:18             ` Dr. David Alan Gilbert
@ 2019-03-27 22:10               ` Alex Williamson
  2019-03-28  8:36                 ` Zhao Yan
  2019-04-01  8:14                   ` [Qemu-devel] " Cornelia Huck
  0 siblings, 2 replies; 133+ messages in thread
From: Alex Williamson @ 2019-03-27 22:10 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Zhao Yan, intel-gvt-dev, Liu,
	Changpeng

On Wed, 27 Mar 2019 20:18:54 +0000
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > a device that has less device memory ?    
> > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > along with verion ?).    
> > > > 
> > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > providers. But we still think about the live migration between from the the lower
> > > > generation of hardware migrated to the higher generation.  
> > > 
> > > Agreed, lower->higher is the one direction that might make sense to
> > > support.
> > > 
> > > But regardless of that, I think we need to make sure that incompatible
> > > devices/versions fail directly instead of failing in a subtle, hard to
> > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > as well.
> > > 
> > > How easy is it to obtain that information in a form that can be
> > > consumed by higher layers? Can we find out the device type at least?
> > > What about some kind of revision?  
> > hi Alex and Cornelia
> > for device compatibility, do you think it's a good idea to use "version"
> > and "device version" fields?
> > 
> > version field: identify live migration interface's version. it can have a
> > sort of backward compatibility, like target machine's version >= source
> > machine's version. something like that.

Don't we essentially already have this via the device specific region?
The struct vfio_info_cap_header includes id and version fields, so we
can declare a migration id and increment the version for any
incompatible changes to the protocol.

> > 
> > device_version field consists two parts:
> > 1. vendor id : it takes 32 bits. e.g. 0x8086.

Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
suggest we use a bit to flag it as such so we can reserve that portion
of the 32bit address space.  See for example:

#define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
#define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)

For vendor specific regions.

> > 2. vendor proprietary string: it can be any string that a vendor driver
> > thinks can identify a source device. e.g. pciid + mdev type.
> > "vendor id" is to avoid overlap of "vendor proprietary string".
> > 
> > 
> > struct vfio_device_state_ctl {
> >      __u32 version;            /* ro */
> >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> >      struct {
> >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > 	...
> >      }data;
> >      ...
> >  };

We have a buffer area where we can read and write data from the vendor
driver, why would we have this IS_COMPATIBLE protocol to write the
device version string but use a static fixed length version string in
the control header to read it?  IOW, let's use GET_VERSION,
CHECK_VERSION actions that make use of the buffer area and allow vendor
specific version information length.

> > 
> > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > 
> > The flow to figure out whether a source device is migratable to target device
> > is like that:
> > 1. in source side's .save_setup, save source device's device_version string
> > 2. in target side's .load_state, load source device's device version string
> > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > to check whether the source device is compatible to it.
> > 
> > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > maintain a compatibility table and decide whether source device is compatible
> > to target device according to its proprietary table.
> > In device_version string, vendor driver only has to describe the source
> > device as elaborately as possible and resorts to vendor driver in target side
> > to figure out whether they are compatible.  

I agree, it's too complicated and restrictive to try to create an
interface for the user to determine compatibility, let the driver
declare it compatible or not.

> It would also be good if the 'IS_COMPATIBLE' was somehow callable
> externally - so we could be able to answer a question like 'can we
> migrate this VM to this host' - from the management layer before it
> actually starts the migration.

I think we'd need to mirror this capability in sysfs to support that,
or create a qmp interface through QEMU that the device owner could make
the request on behalf of the management layer.  Getting access to the
vfio device requires an iommu context that's already in use by the
device owner, we have no intention of supporting a model that allows
independent tasks access to a device.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-27 22:10               ` Alex Williamson
@ 2019-03-28  8:36                 ` Zhao Yan
  2019-03-28  9:21                   ` Erik Skultety
  2019-04-01  8:14                   ` [Qemu-devel] " Cornelia Huck
  1 sibling, 1 reply; 133+ messages in thread
From: Zhao Yan @ 2019-03-28  8:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	intel-gvt-dev,

hi Alex and Dave,
Thanks for your replies.
Please see my comments inline.

On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:
> On Wed, 27 Mar 2019 20:18:54 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > a device that has less device memory ?    
> > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > along with verion ?).    
> > > > > 
> > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > providers. But we still think about the live migration between from the the lower
> > > > > generation of hardware migrated to the higher generation.  
> > > > 
> > > > Agreed, lower->higher is the one direction that might make sense to
> > > > support.
> > > > 
> > > > But regardless of that, I think we need to make sure that incompatible
> > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > as well.
> > > > 
> > > > How easy is it to obtain that information in a form that can be
> > > > consumed by higher layers? Can we find out the device type at least?
> > > > What about some kind of revision?  
> > > hi Alex and Cornelia
> > > for device compatibility, do you think it's a good idea to use "version"
> > > and "device version" fields?
> > > 
> > > version field: identify live migration interface's version. it can have a
> > > sort of backward compatibility, like target machine's version >= source
> > > machine's version. something like that.
> 
> Don't we essentially already have this via the device specific region?
> The struct vfio_info_cap_header includes id and version fields, so we
> can declare a migration id and increment the version for any
> incompatible changes to the protocol.
yes, good idea!
so, what about declaring below new cap? 
    #define VFIO_REGION_INFO_CAP_MIGRATION 4
    struct vfio_region_info_cap_migration {
        struct vfio_info_cap_header header;
        __u32 device_version_len;
        __u8  device_version[];
    };


> > > 
> > > device_version field consists two parts:
> > > 1. vendor id : it takes 32 bits. e.g. 0x8086.
> 
> Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> suggest we use a bit to flag it as such so we can reserve that portion
> of the 32bit address space.  See for example:
> 
> #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> 
> For vendor specific regions.
Yes, use PCI vendor ID.
you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
to identify it's a PCI ID.
Thanks for pointing it out. 
But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
why it's 0xffff? I searched QEMU and kernel code and did not find anywhere
uses it.


> > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > thinks can identify a source device. e.g. pciid + mdev type.
> > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > 
> > > 
> > > struct vfio_device_state_ctl {
> > >      __u32 version;            /* ro */
> > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> > >      struct {
> > >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > 	...
> > >      }data;
> > >      ...
> > >  };
> 
> We have a buffer area where we can read and write data from the vendor
> driver, why would we have this IS_COMPATIBLE protocol to write the
> device version string but use a static fixed length version string in
> the control header to read it?  IOW, let's use GET_VERSION,
> CHECK_VERSION actions that make use of the buffer area and allow vendor
> specific version information length.
you are right, such static fixed length version string is bad :)
To get device version, do you think which approach below is better?
1. use GET_VERSION action, and read from region buffer
2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION

seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
for checking migration interface's version?

> > > 
> > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > 
> > > The flow to figure out whether a source device is migratable to target device
> > > is like that:
> > > 1. in source side's .save_setup, save source device's device_version string
> > > 2. in target side's .load_state, load source device's device version string
> > > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > > to check whether the source device is compatible to it.
> > > 
> > > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > > maintain a compatibility table and decide whether source device is compatible
> > > to target device according to its proprietary table.
> > > In device_version string, vendor driver only has to describe the source
> > > device as elaborately as possible and resorts to vendor driver in target side
> > > to figure out whether they are compatible.  
> 
> I agree, it's too complicated and restrictive to try to create an
> interface for the user to determine compatibility, let the driver
> declare it compatible or not.
:)

> > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > externally - so we could be able to answer a question like 'can we
> > migrate this VM to this host' - from the management layer before it
> > actually starts the migration.

so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and CHECK_VERSION.
GET_VERSION returns a vm's device's version string.
CHECK_VERSION's input is device version string and return
compatible/non-compatible.
Do you think it's good?

> I think we'd need to mirror this capability in sysfs to support that,
> or create a qmp interface through QEMU that the device owner could make
> the request on behalf of the management layer.  Getting access to the
> vfio device requires an iommu context that's already in use by the
> device owner, we have no intention of supporting a model that allows
> independent tasks access to a device.  Thanks,
> Alex
>
do you think two sysfs nodes under a device node is ok?
e.g.
/sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
/sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version

Thanks
Yan

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-28  8:36                 ` Zhao Yan
@ 2019-03-28  9:21                   ` Erik Skultety
  2019-03-28 16:04                     ` Alex Williamson
  0 siblings, 1 reply; 133+ messages in thread
From: Erik Skultety @ 2019-03-28  9:21 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, Yang, Ziye, mlevitsk, pasic,
	Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	Alex Williamson, intel-gvt-dev,

On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:
> hi Alex and Dave,
> Thanks for your replies.
> Please see my comments inline.
>
> On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:
> > On Wed, 27 Mar 2019 20:18:54 +0000
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >
> > > * Zhao Yan (yan.y.zhao@intel.com) wrote:
> > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:
> > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > a device that has less device memory ?
> > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > along with verion ?).
> > > > > >
> > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > generation of hardware migrated to the higher generation.
> > > > >
> > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > support.
> > > > >
> > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > as well.
> > > > >
> > > > > How easy is it to obtain that information in a form that can be
> > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > What about some kind of revision?
> > > > hi Alex and Cornelia
> > > > for device compatibility, do you think it's a good idea to use "version"
> > > > and "device version" fields?
> > > >
> > > > version field: identify live migration interface's version. it can have a
> > > > sort of backward compatibility, like target machine's version >= source
> > > > machine's version. something like that.
> >
> > Don't we essentially already have this via the device specific region?
> > The struct vfio_info_cap_header includes id and version fields, so we
> > can declare a migration id and increment the version for any
> > incompatible changes to the protocol.
> yes, good idea!
> so, what about declaring below new cap?
>     #define VFIO_REGION_INFO_CAP_MIGRATION 4
>     struct vfio_region_info_cap_migration {
>         struct vfio_info_cap_header header;
>         __u32 device_version_len;
>         __u8  device_version[];
>     };
>
>
> > > >
> > > > device_version field consists two parts:
> > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.
> >
> > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > suggest we use a bit to flag it as such so we can reserve that portion
> > of the 32bit address space.  See for example:
> >
> > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> >
> > For vendor specific regions.
> Yes, use PCI vendor ID.
> you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> to identify it's a PCI ID.
> Thanks for pointing it out.
> But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
> why it's 0xffff? I searched QEMU and kernel code and did not find anywhere
> uses it.
>
>
> > > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > >
> > > >
> > > > struct vfio_device_state_ctl {
> > > >      __u32 version;            /* ro */
> > > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> > > >      struct {
> > > >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > 	...
> > > >      }data;
> > > >      ...
> > > >  };
> >
> > We have a buffer area where we can read and write data from the vendor
> > driver, why would we have this IS_COMPATIBLE protocol to write the
> > device version string but use a static fixed length version string in
> > the control header to read it?  IOW, let's use GET_VERSION,
> > CHECK_VERSION actions that make use of the buffer area and allow vendor
> > specific version information length.
> you are right, such static fixed length version string is bad :)
> To get device version, do you think which approach below is better?
> 1. use GET_VERSION action, and read from region buffer
> 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION
>
> seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
> for checking migration interface's version?
>
> > > >
> > > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > >
> > > > The flow to figure out whether a source device is migratable to target device
> > > > is like that:
> > > > 1. in source side's .save_setup, save source device's device_version string
> > > > 2. in target side's .load_state, load source device's device version string
> > > > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > > > to check whether the source device is compatible to it.
> > > >
> > > > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > > > maintain a compatibility table and decide whether source device is compatible
> > > > to target device according to its proprietary table.
> > > > In device_version string, vendor driver only has to describe the source
> > > > device as elaborately as possible and resorts to vendor driver in target side
> > > > to figure out whether they are compatible.
> >
> > I agree, it's too complicated and restrictive to try to create an
> > interface for the user to determine compatibility, let the driver
> > declare it compatible or not.
> :)
>
> > > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > > externally - so we could be able to answer a question like 'can we
> > > migrate this VM to this host' - from the management layer before it
> > > actually starts the migration.
>
> so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and CHECK_VERSION.
> GET_VERSION returns a vm's device's version string.
> CHECK_VERSION's input is device version string and return
> compatible/non-compatible.
> Do you think it's good?
>
> > I think we'd need to mirror this capability in sysfs to support that,
> > or create a qmp interface through QEMU that the device owner could make
> > the request on behalf of the management layer.  Getting access to the
> > vfio device requires an iommu context that's already in use by the
> > device owner, we have no intention of supporting a model that allows
> > independent tasks access to a device.  Thanks,
> > Alex
> >
> do you think two sysfs nodes under a device node is ok?
> e.g.
> /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
> /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version

Why do you need both sysfs and QMP at the same time? I can see it gives us some
flexibility, but is there something more to that?

Normally, I'd prefer a QMP interface from libvirt's perspective (with an
appropriate capability that libvirt can check for QEMU support) because I imagine large nodes having a
bunch of GPUs with different revisions which might not be backwards compatible.
Libvirt might query the version string on source and check it on dest via the
QMP in a way that QEMU, talking to the driver, would return either a list or a
single physical device to which we can migrate, because neither QEMU nor
libvirt know that, only the driver does, so that's an important information
rather than looping through all the devices and trying to find one that is
compatible. However, you might have a hard time making all the necessary
changes in QMP introspectable, a new command would be fine, but if you also
wanted to extend say vfio-pci options, IIRC that would not appear in the QAPI
schema and libvirt would not be able to detect support for it.

On the other hand, the presence of a QMP interface IMO doesn't help mgmt apps
much, as it still carries the burden of being able to check this only at the
time of migration, which e.g. OpenStack would like to know long before that.

So, having sysfs attributes would work for both libvirt (even though libvirt
would benefit from a QMP much more) and OpenStack. OpenStack would IMO then
have to figure out how to create the mappings between compatible devices across
several nodes which are non-uniform.

Regards,
Erik

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-28  9:21                   ` Erik Skultety
@ 2019-03-28 16:04                     ` Alex Williamson
  2019-03-29  2:47                       ` Zhao Yan
  0 siblings, 1 reply; 133+ messages in thread
From: Alex Williamson @ 2019-03-28 16:04 UTC (permalink / raw)
  To: Erik Skultety
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, Yang, Ziye, mlevitsk, pasic,
	Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Zhao Yan, Dr. David Alan Gilbert,
	intel-gvt-dev, Liu, Changpeng

On Thu, 28 Mar 2019 10:21:38 +0100
Erik Skultety <eskultet@redhat.com> wrote:

> On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:
> > hi Alex and Dave,
> > Thanks for your replies.
> > Please see my comments inline.
> >
> > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:  
> > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > >  
> > > > * Zhao Yan (yan.y.zhao@intel.com) wrote:  
> > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > > a device that has less device memory ?  
> > > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > > along with verion ?).  
> > > > > > >
> > > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > > generation of hardware migrated to the higher generation.  
> > > > > >
> > > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > > support.
> > > > > >
> > > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > > as well.
> > > > > >
> > > > > > How easy is it to obtain that information in a form that can be
> > > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > > What about some kind of revision?  
> > > > > hi Alex and Cornelia
> > > > > for device compatibility, do you think it's a good idea to use "version"
> > > > > and "device version" fields?
> > > > >
> > > > > version field: identify live migration interface's version. it can have a
> > > > > sort of backward compatibility, like target machine's version >= source
> > > > > machine's version. something like that.  
> > >
> > > Don't we essentially already have this via the device specific region?
> > > The struct vfio_info_cap_header includes id and version fields, so we
> > > can declare a migration id and increment the version for any
> > > incompatible changes to the protocol.  
> > yes, good idea!
> > so, what about declaring below new cap?
> >     #define VFIO_REGION_INFO_CAP_MIGRATION 4
> >     struct vfio_region_info_cap_migration {
> >         struct vfio_info_cap_header header;
> >         __u32 device_version_len;
> >         __u8  device_version[];
> >     };

I'm not sure why we need a new region for everything, it seems this
could fit within the protocol of a single region.  This could simply be
a new action to retrieve the version where the protocol would report
the number of bytes available, just like the migration stream itself.

> > > > > device_version field consists two parts:
> > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.  
> > >
> > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > suggest we use a bit to flag it as such so we can reserve that portion
> > > of the 32bit address space.  See for example:
> > >
> > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > >
> > > For vendor specific regions.  
> > Yes, use PCI vendor ID.
> > you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> > to identify it's a PCI ID.
> > Thanks for pointing it out.
> > But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
> > why it's 0xffff? I searched QEMU and kernel code and did not find anywhere
> > uses it.

PCI vendor IDs are 16bits, it's just indicating that when the
PCI_VENDOR_TYPE bit is set the valid data is the lower 16bits.

> > > > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > > >
> > > > >
> > > > > struct vfio_device_state_ctl {
> > > > >      __u32 version;            /* ro */
> > > > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> > > > >      struct {
> > > > >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > > 	...
> > > > >      }data;
> > > > >      ...
> > > > >  };  
> > >
> > > We have a buffer area where we can read and write data from the vendor
> > > driver, why would we have this IS_COMPATIBLE protocol to write the
> > > device version string but use a static fixed length version string in
> > > the control header to read it?  IOW, let's use GET_VERSION,
> > > CHECK_VERSION actions that make use of the buffer area and allow vendor
> > > specific version information length.  
> > you are right, such static fixed length version string is bad :)
> > To get device version, do you think which approach below is better?
> > 1. use GET_VERSION action, and read from region buffer
> > 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION
> >
> > seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
> > for checking migration interface's version?

I think 1 provides the most flexibility to the vendor driver.

> > > > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > > >
> > > > > The flow to figure out whether a source device is migratable to target device
> > > > > is like that:
> > > > > 1. in source side's .save_setup, save source device's device_version string
> > > > > 2. in target side's .load_state, load source device's device version string
> > > > > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > > > > to check whether the source device is compatible to it.
> > > > >
> > > > > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > > > > maintain a compatibility table and decide whether source device is compatible
> > > > > to target device according to its proprietary table.
> > > > > In device_version string, vendor driver only has to describe the source
> > > > > device as elaborately as possible and resorts to vendor driver in target side
> > > > > to figure out whether they are compatible.  
> > >
> > > I agree, it's too complicated and restrictive to try to create an
> > > interface for the user to determine compatibility, let the driver
> > > declare it compatible or not.  
> > :)
> >  
> > > > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > > > externally - so we could be able to answer a question like 'can we
> > > > migrate this VM to this host' - from the management layer before it
> > > > actually starts the migration.  
> >
> > so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and CHECK_VERSION.
> > GET_VERSION returns a vm's device's version string.
> > CHECK_VERSION's input is device version string and return
> > compatible/non-compatible.
> > Do you think it's good?

That's the idea, but note that QEMU can only provide the QMP interface,
the sysfs interface would of course be provided as more of a direct
path from the vendor driver or mdev kernel layer.

> > > I think we'd need to mirror this capability in sysfs to support that,
> > > or create a qmp interface through QEMU that the device owner could make
> > > the request on behalf of the management layer.  Getting access to the
> > > vfio device requires an iommu context that's already in use by the
> > > device owner, we have no intention of supporting a model that allows
> > > independent tasks access to a device.  Thanks,
> > > Alex
> > >  
> > do you think two sysfs nodes under a device node is ok?
> > e.g.
> > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
> > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version  

I'd think it might live more in the mdev_support_types area, wouldn't
we ideally like to know if a device is compatible even before it's
created?  For example maybe:

/sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/version

Where reading the sysfs attribute returns the version string and
writing a string into the attribute return an errno for incompatibility.

> Why do you need both sysfs and QMP at the same time? I can see it gives us some
> flexibility, but is there something more to that?
>
> Normally, I'd prefer a QMP interface from libvirt's perspective (with an
> appropriate capability that libvirt can check for QEMU support) because I imagine large nodes having a
> bunch of GPUs with different revisions which might not be backwards compatible.
> Libvirt might query the version string on source and check it on dest via the
> QMP in a way that QEMU, talking to the driver, would return either a list or a
> single physical device to which we can migrate, because neither QEMU nor
> libvirt know that, only the driver does, so that's an important information
> rather than looping through all the devices and trying to find one that is
> compatible. However, you might have a hard time making all the necessary
> changes in QMP introspectable, a new command would be fine, but if you also
> wanted to extend say vfio-pci options, IIRC that would not appear in the QAPI
> schema and libvirt would not be able to detect support for it.
> 
> On the other hand, the presence of a QMP interface IMO doesn't help mgmt apps
> much, as it still carries the burden of being able to check this only at the
> time of migration, which e.g. OpenStack would like to know long before that.
> 
> So, having sysfs attributes would work for both libvirt (even though libvirt
> would benefit from a QMP much more) and OpenStack. OpenStack would IMO then
> have to figure out how to create the mappings between compatible devices across
> several nodes which are non-uniform.

Yep, vfio encompasses more than just QEMU, so a sysfs interface has more
utility than a QMP interface.  For instance we couldn't predetermine if
an mdev type on a host is compatible if we need to first create the
device and launch a QEMU instance on it to get access to QMP.  So maybe
the question is whether we should bother with any sort of VFIO API to
do this comparison, perhaps only a sysfs interface is sufficient for a
complete solution.  The downside of not having a version API in the
user interface might be that QEMU on its own can only try a migration
and see if it fails, it wouldn't have the ability to test expected
compatibility without access to sysfs.  And maybe that's fine.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-28 16:04                     ` Alex Williamson
@ 2019-03-29  2:47                       ` Zhao Yan
  2019-03-29 14:26                         ` Alex Williamson
  0 siblings, 1 reply; 133+ messages in thread
From: Zhao Yan @ 2019-03-29  2:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, Erik Skultety, Yang, Ziye,
	mlevitsk, pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	intel-gvt-dev,

On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:
> On Thu, 28 Mar 2019 10:21:38 +0100
> Erik Skultety <eskultet@redhat.com> wrote:
> 
> > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:
> > > hi Alex and Dave,
> > > Thanks for your replies.
> > > Please see my comments inline.
> > >
> > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:  
> > > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > > >  
> > > > > * Zhao Yan (yan.y.zhao@intel.com) wrote:  
> > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > > > a device that has less device memory ?  
> > > > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > > > along with verion ?).  
> > > > > > > >
> > > > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > > > generation of hardware migrated to the higher generation.  
> > > > > > >
> > > > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > > > support.
> > > > > > >
> > > > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > > > as well.
> > > > > > >
> > > > > > > How easy is it to obtain that information in a form that can be
> > > > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > > > What about some kind of revision?  
> > > > > > hi Alex and Cornelia
> > > > > > for device compatibility, do you think it's a good idea to use "version"
> > > > > > and "device version" fields?
> > > > > >
> > > > > > version field: identify live migration interface's version. it can have a
> > > > > > sort of backward compatibility, like target machine's version >= source
> > > > > > machine's version. something like that.  
> > > >
> > > > Don't we essentially already have this via the device specific region?
> > > > The struct vfio_info_cap_header includes id and version fields, so we
> > > > can declare a migration id and increment the version for any
> > > > incompatible changes to the protocol.  
> > > yes, good idea!
> > > so, what about declaring below new cap?
> > >     #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > >     struct vfio_region_info_cap_migration {
> > >         struct vfio_info_cap_header header;
> > >         __u32 device_version_len;
> > >         __u8  device_version[];
> > >     };
> 
> I'm not sure why we need a new region for everything, it seems this
> could fit within the protocol of a single region.  This could simply be
> a new action to retrieve the version where the protocol would report
> the number of bytes available, just like the migration stream itself.
so, to get version of VFIO live migration device state interface (simply
call it migration interface?),
a new cap looks like this:
#define VFIO_REGION_INFO_CAP_MIGRATION 4
it contains struct vfio_info_cap_header only.
when get region info of the migration region, we query this cap and get
migration interface's version. right?

or just directly use VFIO_REGION_INFO_CAP_TYPE is fine?


> > > > > > device_version field consists two parts:
> > > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.  
> > > >
> > > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > > suggest we use a bit to flag it as such so we can reserve that portion
> > > > of the 32bit address space.  See for example:
> > > >
> > > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > > >
> > > > For vendor specific regions.  
> > > Yes, use PCI vendor ID.
> > > you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> > > to identify it's a PCI ID.
> > > Thanks for pointing it out.
> > > But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
> > > why it's 0xffff? I searched QEMU and kernel code and did not find anywhere
> > > uses it.
> 
> PCI vendor IDs are 16bits, it's just indicating that when the
> PCI_VENDOR_TYPE bit is set the valid data is the lower 16bits.

thanks:)

> > > > > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > > > >
> > > > > >
> > > > > > struct vfio_device_state_ctl {
> > > > > >      __u32 version;            /* ro */
> > > > > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> > > > > >      struct {
> > > > > >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > > > 	...
> > > > > >      }data;
> > > > > >      ...
> > > > > >  };  
> > > >
> > > > We have a buffer area where we can read and write data from the vendor
> > > > driver, why would we have this IS_COMPATIBLE protocol to write the
> > > > device version string but use a static fixed length version string in
> > > > the control header to read it?  IOW, let's use GET_VERSION,
> > > > CHECK_VERSION actions that make use of the buffer area and allow vendor
> > > > specific version information length.  
> > > you are right, such static fixed length version string is bad :)
> > > To get device version, do you think which approach below is better?
> > > 1. use GET_VERSION action, and read from region buffer
> > > 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION
> > >
> > > seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
> > > for checking migration interface's version?
> 
> I think 1 provides the most flexibility to the vendor driver.

Got it.
For VFIO live migration, compared to reuse device state region (which takes
GET_BUFFER/SET_BUFFER actions),
could we create a new region for GET_VERSION & CHECK_VERSION ?

> > > > > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > > > >
> > > > > > The flow to figure out whether a source device is migratable to target device
> > > > > > is like that:
> > > > > > 1. in source side's .save_setup, save source device's device_version string
> > > > > > 2. in target side's .load_state, load source device's device version string
> > > > > > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > > > > > to check whether the source device is compatible to it.
> > > > > >
> > > > > > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > > > > > maintain a compatibility table and decide whether source device is compatible
> > > > > > to target device according to its proprietary table.
> > > > > > In device_version string, vendor driver only has to describe the source
> > > > > > device as elaborately as possible and resorts to vendor driver in target side
> > > > > > to figure out whether they are compatible.  
> > > >
> > > > I agree, it's too complicated and restrictive to try to create an
> > > > interface for the user to determine compatibility, let the driver
> > > > declare it compatible or not.  
> > > :)
> > >  
> > > > > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > > > > externally - so we could be able to answer a question like 'can we
> > > > > migrate this VM to this host' - from the management layer before it
> > > > > actually starts the migration.  
> > >
> > > so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and CHECK_VERSION.
> > > GET_VERSION returns a vm's device's version string.
> > > CHECK_VERSION's input is device version string and return
> > > compatible/non-compatible.
> > > Do you think it's good?
> 
> That's the idea, but note that QEMU can only provide the QMP interface,
> the sysfs interface would of course be provided as more of a direct
> path from the vendor driver or mdev kernel layer.
> 
> > > > I think we'd need to mirror this capability in sysfs to support that,
> > > > or create a qmp interface through QEMU that the device owner could make
> > > > the request on behalf of the management layer.  Getting access to the
> > > > vfio device requires an iommu context that's already in use by the
> > > > device owner, we have no intention of supporting a model that allows
> > > > independent tasks access to a device.  Thanks,
> > > > Alex
> > > >  
> > > do you think two sysfs nodes under a device node is ok?
> > > e.g.
> > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
> > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version  
> 
> I'd think it might live more in the mdev_support_types area, wouldn't
> we ideally like to know if a device is compatible even before it's
> created?  For example maybe:
> 
> /sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/version
> 
> Where reading the sysfs attribute returns the version string and
> writing a string into the attribute return an errno for incompatibility.
yes, knowing if a device is compatible before it's created is good.
but do you think check whether a device is compatible after it's created is
also required? For live migration, user usually only queries this information
when it's really required, i.e. when a device has been created.
maybe we can add this version get/check at both places?


> 
> > Why do you need both sysfs and QMP at the same time? I can see it gives us some
> > flexibility, but is there something more to that?
> >
> > Normally, I'd prefer a QMP interface from libvirt's perspective (with an
> > appropriate capability that libvirt can check for QEMU support) because I imagine large nodes having a
> > bunch of GPUs with different revisions which might not be backwards compatible.
> > Libvirt might query the version string on source and check it on dest via the
> > QMP in a way that QEMU, talking to the driver, would return either a list or a
> > single physical device to which we can migrate, because neither QEMU nor
> > libvirt know that, only the driver does, so that's an important information
> > rather than looping through all the devices and trying to find one that is
> > compatible. However, you might have a hard time making all the necessary
> > changes in QMP introspectable, a new command would be fine, but if you also
> > wanted to extend say vfio-pci options, IIRC that would not appear in the QAPI
> > schema and libvirt would not be able to detect support for it.
> > 
> > On the other hand, the presence of a QMP interface IMO doesn't help mgmt apps
> > much, as it still carries the burden of being able to check this only at the
> > time of migration, which e.g. OpenStack would like to know long before that.
> > 
> > So, having sysfs attributes would work for both libvirt (even though libvirt
> > would benefit from a QMP much more) and OpenStack. OpenStack would IMO then
> > have to figure out how to create the mappings between compatible devices across
> > several nodes which are non-uniform.
> 
> Yep, vfio encompasses more than just QEMU, so a sysfs interface has more
> utility than a QMP interface.  For instance we couldn't predetermine if
> an mdev type on a host is compatible if we need to first create the
> device and launch a QEMU instance on it to get access to QMP.  So maybe
> the question is whether we should bother with any sort of VFIO API to
> do this comparison, perhaps only a sysfs interface is sufficient for a
> complete solution.  The downside of not having a version API in the
> user interface might be that QEMU on its own can only try a migration
> and see if it fails, it wouldn't have the ability to test expected
> compatibility without access to sysfs.  And maybe that's fine.  Thanks,
> 
So QEMU vfio uses sysfs to check device compatiblity in migration's save_setup
phase?

Thanks
Yan

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-29  2:47                       ` Zhao Yan
@ 2019-03-29 14:26                         ` Alex Williamson
  2019-03-29 23:10                           ` Zhao Yan
  0 siblings, 1 reply; 133+ messages in thread
From: Alex Williamson @ 2019-03-29 14:26 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, Erik Skultety, Yang, Ziye,
	mlevitsk, pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	intel-gvt-dev,

On Thu, 28 Mar 2019 22:47:04 -0400
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:
> > On Thu, 28 Mar 2019 10:21:38 +0100
> > Erik Skultety <eskultet@redhat.com> wrote:
> >   
> > > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:  
> > > > hi Alex and Dave,
> > > > Thanks for your replies.
> > > > Please see my comments inline.
> > > >
> > > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:    
> > > > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > > > >    
> > > > > > * Zhao Yan (yan.y.zhao@intel.com) wrote:    
> > > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:    
> > > > > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > > > > a device that has less device memory ?    
> > > > > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > > > > along with verion ?).    
> > > > > > > > >
> > > > > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > > > > generation of hardware migrated to the higher generation.    
> > > > > > > >
> > > > > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > > > > support.
> > > > > > > >
> > > > > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > > > > as well.
> > > > > > > >
> > > > > > > > How easy is it to obtain that information in a form that can be
> > > > > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > > > > What about some kind of revision?    
> > > > > > > hi Alex and Cornelia
> > > > > > > for device compatibility, do you think it's a good idea to use "version"
> > > > > > > and "device version" fields?
> > > > > > >
> > > > > > > version field: identify live migration interface's version. it can have a
> > > > > > > sort of backward compatibility, like target machine's version >= source
> > > > > > > machine's version. something like that.    
> > > > >
> > > > > Don't we essentially already have this via the device specific region?
> > > > > The struct vfio_info_cap_header includes id and version fields, so we
> > > > > can declare a migration id and increment the version for any
> > > > > incompatible changes to the protocol.    
> > > > yes, good idea!
> > > > so, what about declaring below new cap?
> > > >     #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > >     struct vfio_region_info_cap_migration {
> > > >         struct vfio_info_cap_header header;
> > > >         __u32 device_version_len;
> > > >         __u8  device_version[];
> > > >     };  
> > 
> > I'm not sure why we need a new region for everything, it seems this
> > could fit within the protocol of a single region.  This could simply be
> > a new action to retrieve the version where the protocol would report
> > the number of bytes available, just like the migration stream itself.  
> so, to get version of VFIO live migration device state interface (simply
> call it migration interface?),
> a new cap looks like this:
> #define VFIO_REGION_INFO_CAP_MIGRATION 4
> it contains struct vfio_info_cap_header only.
> when get region info of the migration region, we query this cap and get
> migration interface's version. right?
> 
> or just directly use VFIO_REGION_INFO_CAP_TYPE is fine?

Again, why a new region.  I'm imagining we have one region and this is
just asking for a slightly different thing from it.  But TBH, I'm not
sure we need it at all vs the sysfs interface.

> > > > > > > device_version field consists two parts:
> > > > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.    
> > > > >
> > > > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > > > suggest we use a bit to flag it as such so we can reserve that portion
> > > > > of the 32bit address space.  See for example:
> > > > >
> > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > > > >
> > > > > For vendor specific regions.    
> > > > Yes, use PCI vendor ID.
> > > > you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> > > > to identify it's a PCI ID.
> > > > Thanks for pointing it out.
> > > > But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
> > > > why it's 0xffff? I searched QEMU and kernel code and did not find anywhere
> > > > uses it.  
> > 
> > PCI vendor IDs are 16bits, it's just indicating that when the
> > PCI_VENDOR_TYPE bit is set the valid data is the lower 16bits.  
> 
> thanks:)
> 
> > > > > > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > > > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > > > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > > > > >
> > > > > > >
> > > > > > > struct vfio_device_state_ctl {
> > > > > > >      __u32 version;            /* ro */
> > > > > > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> > > > > > >      struct {
> > > > > > >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > > > > 	...
> > > > > > >      }data;
> > > > > > >      ...
> > > > > > >  };    
> > > > >
> > > > > We have a buffer area where we can read and write data from the vendor
> > > > > driver, why would we have this IS_COMPATIBLE protocol to write the
> > > > > device version string but use a static fixed length version string in
> > > > > the control header to read it?  IOW, let's use GET_VERSION,
> > > > > CHECK_VERSION actions that make use of the buffer area and allow vendor
> > > > > specific version information length.    
> > > > you are right, such static fixed length version string is bad :)
> > > > To get device version, do you think which approach below is better?
> > > > 1. use GET_VERSION action, and read from region buffer
> > > > 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION
> > > >
> > > > seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
> > > > for checking migration interface's version?  
> > 
> > I think 1 provides the most flexibility to the vendor driver.  
> 
> Got it.
> For VFIO live migration, compared to reuse device state region (which takes
> GET_BUFFER/SET_BUFFER actions),
> could we create a new region for GET_VERSION & CHECK_VERSION ?

Why?

> > > > > > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > > > > >
> > > > > > > The flow to figure out whether a source device is migratable to target device
> > > > > > > is like that:
> > > > > > > 1. in source side's .save_setup, save source device's device_version string
> > > > > > > 2. in target side's .load_state, load source device's device version string
> > > > > > > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > > > > > > to check whether the source device is compatible to it.
> > > > > > >
> > > > > > > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > > > > > > maintain a compatibility table and decide whether source device is compatible
> > > > > > > to target device according to its proprietary table.
> > > > > > > In device_version string, vendor driver only has to describe the source
> > > > > > > device as elaborately as possible and resorts to vendor driver in target side
> > > > > > > to figure out whether they are compatible.    
> > > > >
> > > > > I agree, it's too complicated and restrictive to try to create an
> > > > > interface for the user to determine compatibility, let the driver
> > > > > declare it compatible or not.    
> > > > :)
> > > >    
> > > > > > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > > > > > externally - so we could be able to answer a question like 'can we
> > > > > > migrate this VM to this host' - from the management layer before it
> > > > > > actually starts the migration.    
> > > >
> > > > so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and CHECK_VERSION.
> > > > GET_VERSION returns a vm's device's version string.
> > > > CHECK_VERSION's input is device version string and return
> > > > compatible/non-compatible.
> > > > Do you think it's good?  
> > 
> > That's the idea, but note that QEMU can only provide the QMP interface,
> > the sysfs interface would of course be provided as more of a direct
> > path from the vendor driver or mdev kernel layer.
> >   
> > > > > I think we'd need to mirror this capability in sysfs to support that,
> > > > > or create a qmp interface through QEMU that the device owner could make
> > > > > the request on behalf of the management layer.  Getting access to the
> > > > > vfio device requires an iommu context that's already in use by the
> > > > > device owner, we have no intention of supporting a model that allows
> > > > > independent tasks access to a device.  Thanks,
> > > > > Alex
> > > > >    
> > > > do you think two sysfs nodes under a device node is ok?
> > > > e.g.
> > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
> > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version    
> > 
> > I'd think it might live more in the mdev_support_types area, wouldn't
> > we ideally like to know if a device is compatible even before it's
> > created?  For example maybe:
> > 
> > /sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/version
> > 
> > Where reading the sysfs attribute returns the version string and
> > writing a string into the attribute return an errno for incompatibility.  
> yes, knowing if a device is compatible before it's created is good.
> but do you think check whether a device is compatible after it's created is
> also required? For live migration, user usually only queries this information
> when it's really required, i.e. when a device has been created.
> maybe we can add this version get/check at both places?

Why does an instantiated device suddenly not follow the version and
compatibility rules of an uninstantiated device?  IOW, if the version
and compatibility check are on the mdev type, why can't we trace back
from the device to the mdev type and make use of that same interface?
Seems the only question is whether we require an interface through the
vfio API directly or if sysfs is sufficient.

> > > Why do you need both sysfs and QMP at the same time? I can see it gives us some
> > > flexibility, but is there something more to that?
> > >
> > > Normally, I'd prefer a QMP interface from libvirt's perspective (with an
> > > appropriate capability that libvirt can check for QEMU support) because I imagine large nodes having a
> > > bunch of GPUs with different revisions which might not be backwards compatible.
> > > Libvirt might query the version string on source and check it on dest via the
> > > QMP in a way that QEMU, talking to the driver, would return either a list or a
> > > single physical device to which we can migrate, because neither QEMU nor
> > > libvirt know that, only the driver does, so that's an important information
> > > rather than looping through all the devices and trying to find one that is
> > > compatible. However, you might have a hard time making all the necessary
> > > changes in QMP introspectable, a new command would be fine, but if you also
> > > wanted to extend say vfio-pci options, IIRC that would not appear in the QAPI
> > > schema and libvirt would not be able to detect support for it.
> > > 
> > > On the other hand, the presence of a QMP interface IMO doesn't help mgmt apps
> > > much, as it still carries the burden of being able to check this only at the
> > > time of migration, which e.g. OpenStack would like to know long before that.
> > > 
> > > So, having sysfs attributes would work for both libvirt (even though libvirt
> > > would benefit from a QMP much more) and OpenStack. OpenStack would IMO then
> > > have to figure out how to create the mappings between compatible devices across
> > > several nodes which are non-uniform.  
> > 
> > Yep, vfio encompasses more than just QEMU, so a sysfs interface has more
> > utility than a QMP interface.  For instance we couldn't predetermine if
> > an mdev type on a host is compatible if we need to first create the
> > device and launch a QEMU instance on it to get access to QMP.  So maybe
> > the question is whether we should bother with any sort of VFIO API to
> > do this comparison, perhaps only a sysfs interface is sufficient for a
> > complete solution.  The downside of not having a version API in the
> > user interface might be that QEMU on its own can only try a migration
> > and see if it fails, it wouldn't have the ability to test expected
> > compatibility without access to sysfs.  And maybe that's fine.  Thanks,
> >   
> So QEMU vfio uses sysfs to check device compatiblity in migration's save_setup
> phase?

The migration stream between source and target device are the ultimate
test of compatibility, the vendor driver should never rely on userspace
validating compatibility of the migration.  At the point it could do so, the
migration has already begun, so we're only testing how quickly we can
fail the migration.  The management layer setting up the migration can
test via sysfs for compatibility and the migration stream itself needs
to be self validating, so what value is added for QEMU to perform a
version compatibility test?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-29 14:26                         ` Alex Williamson
@ 2019-03-29 23:10                           ` Zhao Yan
  2019-03-30 14:14                             ` Alex Williamson
  0 siblings, 1 reply; 133+ messages in thread
From: Zhao Yan @ 2019-03-29 23:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, Erik Skultety, Yang, Ziye,
	mlevitsk, pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	intel-gvt-dev,

On Fri, Mar 29, 2019 at 10:26:39PM +0800, Alex Williamson wrote:
> On Thu, 28 Mar 2019 22:47:04 -0400
> Zhao Yan <yan.y.zhao@intel.com> wrote:
> 
> > On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:
> > > On Thu, 28 Mar 2019 10:21:38 +0100
> > > Erik Skultety <eskultet@redhat.com> wrote:
> > >   
> > > > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:  
> > > > > hi Alex and Dave,
> > > > > Thanks for your replies.
> > > > > Please see my comments inline.
> > > > >
> > > > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:    
> > > > > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > > > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > > > > >    
> > > > > > > * Zhao Yan (yan.y.zhao@intel.com) wrote:    
> > > > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:    
> > > > > > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > > > > > a device that has less device memory ?    
> > > > > > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > > > > > along with verion ?).    
> > > > > > > > > >
> > > > > > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > > > > > generation of hardware migrated to the higher generation.    
> > > > > > > > >
> > > > > > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > > > > > support.
> > > > > > > > >
> > > > > > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > > > > > as well.
> > > > > > > > >
> > > > > > > > > How easy is it to obtain that information in a form that can be
> > > > > > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > > > > > What about some kind of revision?    
> > > > > > > > hi Alex and Cornelia
> > > > > > > > for device compatibility, do you think it's a good idea to use "version"
> > > > > > > > and "device version" fields?
> > > > > > > >
> > > > > > > > version field: identify live migration interface's version. it can have a
> > > > > > > > sort of backward compatibility, like target machine's version >= source
> > > > > > > > machine's version. something like that.    
> > > > > >
> > > > > > Don't we essentially already have this via the device specific region?
> > > > > > The struct vfio_info_cap_header includes id and version fields, so we
> > > > > > can declare a migration id and increment the version for any
> > > > > > incompatible changes to the protocol.    
> > > > > yes, good idea!
> > > > > so, what about declaring below new cap?
> > > > >     #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > > >     struct vfio_region_info_cap_migration {
> > > > >         struct vfio_info_cap_header header;
> > > > >         __u32 device_version_len;
> > > > >         __u8  device_version[];
> > > > >     };  
> > > 
> > > I'm not sure why we need a new region for everything, it seems this
> > > could fit within the protocol of a single region.  This could simply be
> > > a new action to retrieve the version where the protocol would report
> > > the number of bytes available, just like the migration stream itself.  
> > so, to get version of VFIO live migration device state interface (simply
> > call it migration interface?),
> > a new cap looks like this:
> > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > it contains struct vfio_info_cap_header only.
> > when get region info of the migration region, we query this cap and get
> > migration interface's version. right?
> > 
> > or just directly use VFIO_REGION_INFO_CAP_TYPE is fine?
> 
> Again, why a new region.  I'm imagining we have one region and this is
> just asking for a slightly different thing from it.  But TBH, I'm not
> sure we need it at all vs the sysfs interface.
> 
> > > > > > > > device_version field consists two parts:
> > > > > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.    
> > > > > >
> > > > > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > > > > suggest we use a bit to flag it as such so we can reserve that portion
> > > > > > of the 32bit address space.  See for example:
> > > > > >
> > > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > > > > >
> > > > > > For vendor specific regions.    
> > > > > Yes, use PCI vendor ID.
> > > > > you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> > > > > to identify it's a PCI ID.
> > > > > Thanks for pointing it out.
> > > > > But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
> > > > > why it's 0xffff? I searched QEMU and kernel code and did not find anywhere
> > > > > uses it.  
> > > 
> > > PCI vendor IDs are 16bits, it's just indicating that when the
> > > PCI_VENDOR_TYPE bit is set the valid data is the lower 16bits.  
> > 
> > thanks:)
> > 
> > > > > > > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > > > > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > > > > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > > > > > >
> > > > > > > >
> > > > > > > > struct vfio_device_state_ctl {
> > > > > > > >      __u32 version;            /* ro */
> > > > > > > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> > > > > > > >      struct {
> > > > > > > >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > > > > > 	...
> > > > > > > >      }data;
> > > > > > > >      ...
> > > > > > > >  };    
> > > > > >
> > > > > > We have a buffer area where we can read and write data from the vendor
> > > > > > driver, why would we have this IS_COMPATIBLE protocol to write the
> > > > > > device version string but use a static fixed length version string in
> > > > > > the control header to read it?  IOW, let's use GET_VERSION,
> > > > > > CHECK_VERSION actions that make use of the buffer area and allow vendor
> > > > > > specific version information length.    
> > > > > you are right, such static fixed length version string is bad :)
> > > > > To get device version, do you think which approach below is better?
> > > > > 1. use GET_VERSION action, and read from region buffer
> > > > > 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION
> > > > >
> > > > > seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
> > > > > for checking migration interface's version?  
> > > 
> > > I think 1 provides the most flexibility to the vendor driver.  
> > 
> > Got it.
> > For VFIO live migration, compared to reuse device state region (which takes
> > GET_BUFFER/SET_BUFFER actions),
> > could we create a new region for GET_VERSION & CHECK_VERSION ?
> 
> Why?
> 
> > > > > > > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > > > > > >
> > > > > > > > The flow to figure out whether a source device is migratable to target device
> > > > > > > > is like that:
> > > > > > > > 1. in source side's .save_setup, save source device's device_version string
> > > > > > > > 2. in target side's .load_state, load source device's device version string
> > > > > > > > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > > > > > > > to check whether the source device is compatible to it.
> > > > > > > >
> > > > > > > > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > > > > > > > maintain a compatibility table and decide whether source device is compatible
> > > > > > > > to target device according to its proprietary table.
> > > > > > > > In device_version string, vendor driver only has to describe the source
> > > > > > > > device as elaborately as possible and resorts to vendor driver in target side
> > > > > > > > to figure out whether they are compatible.    
> > > > > >
> > > > > > I agree, it's too complicated and restrictive to try to create an
> > > > > > interface for the user to determine compatibility, let the driver
> > > > > > declare it compatible or not.    
> > > > > :)
> > > > >    
> > > > > > > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > > > > > > externally - so we could be able to answer a question like 'can we
> > > > > > > migrate this VM to this host' - from the management layer before it
> > > > > > > actually starts the migration.    
> > > > >
> > > > > so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and CHECK_VERSION.
> > > > > GET_VERSION returns a vm's device's version string.
> > > > > CHECK_VERSION's input is device version string and return
> > > > > compatible/non-compatible.
> > > > > Do you think it's good?  
> > > 
> > > That's the idea, but note that QEMU can only provide the QMP interface,
> > > the sysfs interface would of course be provided as more of a direct
> > > path from the vendor driver or mdev kernel layer.
> > >   
> > > > > > I think we'd need to mirror this capability in sysfs to support that,
> > > > > > or create a qmp interface through QEMU that the device owner could make
> > > > > > the request on behalf of the management layer.  Getting access to the
> > > > > > vfio device requires an iommu context that's already in use by the
> > > > > > device owner, we have no intention of supporting a model that allows
> > > > > > independent tasks access to a device.  Thanks,
> > > > > > Alex
> > > > > >    
> > > > > do you think two sysfs nodes under a device node is ok?
> > > > > e.g.
> > > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
> > > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version    
> > > 
> > > I'd think it might live more in the mdev_support_types area, wouldn't
> > > we ideally like to know if a device is compatible even before it's
> > > created?  For example maybe:
> > > 
> > > /sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/version
> > > 
> > > Where reading the sysfs attribute returns the version string and
> > > writing a string into the attribute return an errno for incompatibility.  
> > yes, knowing if a device is compatible before it's created is good.
> > but do you think check whether a device is compatible after it's created is
> > also required? For live migration, user usually only queries this information
> > when it's really required, i.e. when a device has been created.
> > maybe we can add this version get/check at both places?
> 
> Why does an instantiated device suddenly not follow the version and
> compatibility rules of an uninstantiated device?  IOW, if the version
> and compatibility check are on the mdev type, why can't we trace back
> from the device to the mdev type and make use of that same interface?
> Seems the only question is whether we require an interface through the
> vfio API directly or if sysfs is sufficient.
ok. got it.

> > > > Why do you need both sysfs and QMP at the same time? I can see it gives us some
> > > > flexibility, but is there something more to that?
> > > >
> > > > Normally, I'd prefer a QMP interface from libvirt's perspective (with an
> > > > appropriate capability that libvirt can check for QEMU support) because I imagine large nodes having a
> > > > bunch of GPUs with different revisions which might not be backwards compatible.
> > > > Libvirt might query the version string on source and check it on dest via the
> > > > QMP in a way that QEMU, talking to the driver, would return either a list or a
> > > > single physical device to which we can migrate, because neither QEMU nor
> > > > libvirt know that, only the driver does, so that's an important information
> > > > rather than looping through all the devices and trying to find one that is
> > > > compatible. However, you might have a hard time making all the necessary
> > > > changes in QMP introspectable, a new command would be fine, but if you also
> > > > wanted to extend say vfio-pci options, IIRC that would not appear in the QAPI
> > > > schema and libvirt would not be able to detect support for it.
> > > > 
> > > > On the other hand, the presence of a QMP interface IMO doesn't help mgmt apps
> > > > much, as it still carries the burden of being able to check this only at the
> > > > time of migration, which e.g. OpenStack would like to know long before that.
> > > > 
> > > > So, having sysfs attributes would work for both libvirt (even though libvirt
> > > > would benefit from a QMP much more) and OpenStack. OpenStack would IMO then
> > > > have to figure out how to create the mappings between compatible devices across
> > > > several nodes which are non-uniform.  
> > > 
> > > Yep, vfio encompasses more than just QEMU, so a sysfs interface has more
> > > utility than a QMP interface.  For instance we couldn't predetermine if
> > > an mdev type on a host is compatible if we need to first create the
> > > device and launch a QEMU instance on it to get access to QMP.  So maybe
> > > the question is whether we should bother with any sort of VFIO API to
> > > do this comparison, perhaps only a sysfs interface is sufficient for a
> > > complete solution.  The downside of not having a version API in the
> > > user interface might be that QEMU on its own can only try a migration
> > > and see if it fails, it wouldn't have the ability to test expected
> > > compatibility without access to sysfs.  And maybe that's fine.  Thanks,
> > >   
> > So QEMU vfio uses sysfs to check device compatiblity in migration's save_setup
> > phase?
> 
> The migration stream between source and target device are the ultimate
> test of compatibility, the vendor driver should never rely on userspace
> validating compatibility of the migration.  At the point it could do so, the
> migration has already begun, so we're only testing how quickly we can
> fail the migration.  The management layer setting up the migration can
> test via sysfs for compatibility and the migration stream itself needs
> to be self validating, so what value is added for QEMU to perform a
> version compatibility test?  Thanks,
oh, do you mean vendor driver should embed source device's version in migration
stream, which is opaque to qemu?
otherwise, I can't think of a quick way for vendor driver to determine whether
source device is an incompatible device.  



Thanks
Yan

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-29 23:10                           ` Zhao Yan
@ 2019-03-30 14:14                             ` Alex Williamson
  2019-04-01  2:17                               ` Zhao Yan
  0 siblings, 1 reply; 133+ messages in thread
From: Alex Williamson @ 2019-03-30 14:14 UTC (permalink / raw)
  To: Zhao Yan
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, Erik Skultety, Yang, Ziye,
	mlevitsk, pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	intel-gvt-dev,

On Fri, 29 Mar 2019 19:10:50 -0400
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Fri, Mar 29, 2019 at 10:26:39PM +0800, Alex Williamson wrote:
> > On Thu, 28 Mar 2019 22:47:04 -0400
> > Zhao Yan <yan.y.zhao@intel.com> wrote:
> >   
> > > On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:  
> > > > On Thu, 28 Mar 2019 10:21:38 +0100
> > > > Erik Skultety <eskultet@redhat.com> wrote:
> > > >     
> > > > > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:    
> > > > > > hi Alex and Dave,
> > > > > > Thanks for your replies.
> > > > > > Please see my comments inline.
> > > > > >
> > > > > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:      
> > > > > > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > > > > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > > > > > >      
> > > > > > > > * Zhao Yan (yan.y.zhao@intel.com) wrote:      
> > > > > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:      
> > > > > > > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > > > > > > a device that has less device memory ?      
> > > > > > > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > > > > > > along with verion ?).      
> > > > > > > > > > >
> > > > > > > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > > > > > > generation of hardware migrated to the higher generation.      
> > > > > > > > > >
> > > > > > > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > > > > > > support.
> > > > > > > > > >
> > > > > > > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > > > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > > > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > > > > > > as well.
> > > > > > > > > >
> > > > > > > > > > How easy is it to obtain that information in a form that can be
> > > > > > > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > > > > > > What about some kind of revision?      
> > > > > > > > > hi Alex and Cornelia
> > > > > > > > > for device compatibility, do you think it's a good idea to use "version"
> > > > > > > > > and "device version" fields?
> > > > > > > > >
> > > > > > > > > version field: identify live migration interface's version. it can have a
> > > > > > > > > sort of backward compatibility, like target machine's version >= source
> > > > > > > > > machine's version. something like that.      
> > > > > > >
> > > > > > > Don't we essentially already have this via the device specific region?
> > > > > > > The struct vfio_info_cap_header includes id and version fields, so we
> > > > > > > can declare a migration id and increment the version for any
> > > > > > > incompatible changes to the protocol.      
> > > > > > yes, good idea!
> > > > > > so, what about declaring below new cap?
> > > > > >     #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > > > >     struct vfio_region_info_cap_migration {
> > > > > >         struct vfio_info_cap_header header;
> > > > > >         __u32 device_version_len;
> > > > > >         __u8  device_version[];
> > > > > >     };    
> > > > 
> > > > I'm not sure why we need a new region for everything, it seems this
> > > > could fit within the protocol of a single region.  This could simply be
> > > > a new action to retrieve the version where the protocol would report
> > > > the number of bytes available, just like the migration stream itself.    
> > > so, to get version of VFIO live migration device state interface (simply
> > > call it migration interface?),
> > > a new cap looks like this:
> > > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > it contains struct vfio_info_cap_header only.
> > > when get region info of the migration region, we query this cap and get
> > > migration interface's version. right?
> > > 
> > > or just directly use VFIO_REGION_INFO_CAP_TYPE is fine?  
> > 
> > Again, why a new region.  I'm imagining we have one region and this is
> > just asking for a slightly different thing from it.  But TBH, I'm not
> > sure we need it at all vs the sysfs interface.
> >   
> > > > > > > > > device_version field consists two parts:
> > > > > > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.      
> > > > > > >
> > > > > > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > > > > > suggest we use a bit to flag it as such so we can reserve that portion
> > > > > > > of the 32bit address space.  See for example:
> > > > > > >
> > > > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > > > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > > > > > >
> > > > > > > For vendor specific regions.      
> > > > > > Yes, use PCI vendor ID.
> > > > > > you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> > > > > > to identify it's a PCI ID.
> > > > > > Thanks for pointing it out.
> > > > > > But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
> > > > > > why it's 0xffff? I searched QEMU and kernel code and did not find anywhere
> > > > > > uses it.    
> > > > 
> > > > PCI vendor IDs are 16bits, it's just indicating that when the
> > > > PCI_VENDOR_TYPE bit is set the valid data is the lower 16bits.    
> > > 
> > > thanks:)
> > >   
> > > > > > > > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > > > > > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > > > > > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > struct vfio_device_state_ctl {
> > > > > > > > >      __u32 version;            /* ro */
> > > > > > > > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> > > > > > > > >      struct {
> > > > > > > > >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > > > > > > 	...
> > > > > > > > >      }data;
> > > > > > > > >      ...
> > > > > > > > >  };      
> > > > > > >
> > > > > > > We have a buffer area where we can read and write data from the vendor
> > > > > > > driver, why would we have this IS_COMPATIBLE protocol to write the
> > > > > > > device version string but use a static fixed length version string in
> > > > > > > the control header to read it?  IOW, let's use GET_VERSION,
> > > > > > > CHECK_VERSION actions that make use of the buffer area and allow vendor
> > > > > > > specific version information length.      
> > > > > > you are right, such static fixed length version string is bad :)
> > > > > > To get device version, do you think which approach below is better?
> > > > > > 1. use GET_VERSION action, and read from region buffer
> > > > > > 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION
> > > > > >
> > > > > > seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
> > > > > > for checking migration interface's version?    
> > > > 
> > > > I think 1 provides the most flexibility to the vendor driver.    
> > > 
> > > Got it.
> > > For VFIO live migration, compared to reuse device state region (which takes
> > > GET_BUFFER/SET_BUFFER actions),
> > > could we create a new region for GET_VERSION & CHECK_VERSION ?  
> > 
> > Why?
> >   
> > > > > > > > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > > > > > > >
> > > > > > > > > The flow to figure out whether a source device is migratable to target device
> > > > > > > > > is like that:
> > > > > > > > > 1. in source side's .save_setup, save source device's device_version string
> > > > > > > > > 2. in target side's .load_state, load source device's device version string
> > > > > > > > > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > > > > > > > > to check whether the source device is compatible to it.
> > > > > > > > >
> > > > > > > > > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > > > > > > > > maintain a compatibility table and decide whether source device is compatible
> > > > > > > > > to target device according to its proprietary table.
> > > > > > > > > In device_version string, vendor driver only has to describe the source
> > > > > > > > > device as elaborately as possible and resorts to vendor driver in target side
> > > > > > > > > to figure out whether they are compatible.      
> > > > > > >
> > > > > > > I agree, it's too complicated and restrictive to try to create an
> > > > > > > interface for the user to determine compatibility, let the driver
> > > > > > > declare it compatible or not.      
> > > > > > :)
> > > > > >      
> > > > > > > > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > > > > > > > externally - so we could be able to answer a question like 'can we
> > > > > > > > migrate this VM to this host' - from the management layer before it
> > > > > > > > actually starts the migration.      
> > > > > >
> > > > > > so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and CHECK_VERSION.
> > > > > > GET_VERSION returns a vm's device's version string.
> > > > > > CHECK_VERSION's input is device version string and return
> > > > > > compatible/non-compatible.
> > > > > > Do you think it's good?    
> > > > 
> > > > That's the idea, but note that QEMU can only provide the QMP interface,
> > > > the sysfs interface would of course be provided as more of a direct
> > > > path from the vendor driver or mdev kernel layer.
> > > >     
> > > > > > > I think we'd need to mirror this capability in sysfs to support that,
> > > > > > > or create a qmp interface through QEMU that the device owner could make
> > > > > > > the request on behalf of the management layer.  Getting access to the
> > > > > > > vfio device requires an iommu context that's already in use by the
> > > > > > > device owner, we have no intention of supporting a model that allows
> > > > > > > independent tasks access to a device.  Thanks,
> > > > > > > Alex
> > > > > > >      
> > > > > > do you think two sysfs nodes under a device node is ok?
> > > > > > e.g.
> > > > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
> > > > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version      
> > > > 
> > > > I'd think it might live more in the mdev_support_types area, wouldn't
> > > > we ideally like to know if a device is compatible even before it's
> > > > created?  For example maybe:
> > > > 
> > > > /sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/version
> > > > 
> > > > Where reading the sysfs attribute returns the version string and
> > > > writing a string into the attribute return an errno for incompatibility.    
> > > yes, knowing if a device is compatible before it's created is good.
> > > but do you think check whether a device is compatible after it's created is
> > > also required? For live migration, user usually only queries this information
> > > when it's really required, i.e. when a device has been created.
> > > maybe we can add this version get/check at both places?  
> > 
> > Why does an instantiated device suddenly not follow the version and
> > compatibility rules of an uninstantiated device?  IOW, if the version
> > and compatibility check are on the mdev type, why can't we trace back
> > from the device to the mdev type and make use of that same interface?
> > Seems the only question is whether we require an interface through the
> > vfio API directly or if sysfs is sufficient.  
> ok. got it.
> 
> > > > > Why do you need both sysfs and QMP at the same time? I can see it gives us some
> > > > > flexibility, but is there something more to that?
> > > > >
> > > > > Normally, I'd prefer a QMP interface from libvirt's perspective (with an
> > > > > appropriate capability that libvirt can check for QEMU support) because I imagine large nodes having a
> > > > > bunch of GPUs with different revisions which might not be backwards compatible.
> > > > > Libvirt might query the version string on source and check it on dest via the
> > > > > QMP in a way that QEMU, talking to the driver, would return either a list or a
> > > > > single physical device to which we can migrate, because neither QEMU nor
> > > > > libvirt know that, only the driver does, so that's an important information
> > > > > rather than looping through all the devices and trying to find one that is
> > > > > compatible. However, you might have a hard time making all the necessary
> > > > > changes in QMP introspectable, a new command would be fine, but if you also
> > > > > wanted to extend say vfio-pci options, IIRC that would not appear in the QAPI
> > > > > schema and libvirt would not be able to detect support for it.
> > > > > 
> > > > > On the other hand, the presence of a QMP interface IMO doesn't help mgmt apps
> > > > > much, as it still carries the burden of being able to check this only at the
> > > > > time of migration, which e.g. OpenStack would like to know long before that.
> > > > > 
> > > > > So, having sysfs attributes would work for both libvirt (even though libvirt
> > > > > would benefit from a QMP much more) and OpenStack. OpenStack would IMO then
> > > > > have to figure out how to create the mappings between compatible devices across
> > > > > several nodes which are non-uniform.    
> > > > 
> > > > Yep, vfio encompasses more than just QEMU, so a sysfs interface has more
> > > > utility than a QMP interface.  For instance we couldn't predetermine if
> > > > an mdev type on a host is compatible if we need to first create the
> > > > device and launch a QEMU instance on it to get access to QMP.  So maybe
> > > > the question is whether we should bother with any sort of VFIO API to
> > > > do this comparison, perhaps only a sysfs interface is sufficient for a
> > > > complete solution.  The downside of not having a version API in the
> > > > user interface might be that QEMU on its own can only try a migration
> > > > and see if it fails, it wouldn't have the ability to test expected
> > > > compatibility without access to sysfs.  And maybe that's fine.  Thanks,
> > > >     
> > > So QEMU vfio uses sysfs to check device compatiblity in migration's save_setup
> > > phase?  
> > 
> > The migration stream between source and target device are the ultimate
> > test of compatibility, the vendor driver should never rely on userspace
> > validating compatibility of the migration.  At the point it could do so, the
> > migration has already begun, so we're only testing how quickly we can
> > fail the migration.  The management layer setting up the migration can
> > test via sysfs for compatibility and the migration stream itself needs
> > to be self validating, so what value is added for QEMU to perform a
> > version compatibility test?  Thanks,  
> oh, do you mean vendor driver should embed source device's version in migration
> stream, which is opaque to qemu?
> otherwise, I can't think of a quick way for vendor driver to determine whether
> source device is an incompatible device.  

Yes, the vendor driver cannot rely on the user to make sure the
incoming migration stream is compatible, the vendor driver must take
responsibility for this.  Therefore, regardless of what other
interfaces we have for the user to test the compatibility between
devices, the vendor driver must make no assumptions about the validity
or integrity of the data stream.  Plan for and protect against a
malicious or incompetent user.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-30 14:14                             ` Alex Williamson
@ 2019-04-01  2:17                               ` Zhao Yan
  0 siblings, 0 replies; 133+ messages in thread
From: Zhao Yan @ 2019-04-01  2:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, Erik Skultety, Yang, Ziye,
	mlevitsk, pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	intel-gvt-dev,

On Sat, Mar 30, 2019 at 10:14:07PM +0800, Alex Williamson wrote:
> On Fri, 29 Mar 2019 19:10:50 -0400
> Zhao Yan <yan.y.zhao@intel.com> wrote:
> 
> > On Fri, Mar 29, 2019 at 10:26:39PM +0800, Alex Williamson wrote:
> > > On Thu, 28 Mar 2019 22:47:04 -0400
> > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:  
> > > > > On Thu, 28 Mar 2019 10:21:38 +0100
> > > > > Erik Skultety <eskultet@redhat.com> wrote:
> > > > >     
> > > > > > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:    
> > > > > > > hi Alex and Dave,
> > > > > > > Thanks for your replies.
> > > > > > > Please see my comments inline.
> > > > > > >
> > > > > > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:      
> > > > > > > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > > > > > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > > > > > > >      
> > > > > > > > > * Zhao Yan (yan.y.zhao@intel.com) wrote:      
> > > > > > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:      
> > > > > > > > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > > > > > > > a device that has less device memory ?      
> > > > > > > > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > > > > > > > along with verion ?).      
> > > > > > > > > > > >
> > > > > > > > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > > > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > > > > > > > generation of hardware migrated to the higher generation.      
> > > > > > > > > > >
> > > > > > > > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > > > > > > > support.
> > > > > > > > > > >
> > > > > > > > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > > > > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > > > > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > > > > > > > as well.
> > > > > > > > > > >
> > > > > > > > > > > How easy is it to obtain that information in a form that can be
> > > > > > > > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > > > > > > > What about some kind of revision?      
> > > > > > > > > > hi Alex and Cornelia
> > > > > > > > > > for device compatibility, do you think it's a good idea to use "version"
> > > > > > > > > > and "device version" fields?
> > > > > > > > > >
> > > > > > > > > > version field: identify live migration interface's version. it can have a
> > > > > > > > > > sort of backward compatibility, like target machine's version >= source
> > > > > > > > > > machine's version. something like that.      
> > > > > > > >
> > > > > > > > Don't we essentially already have this via the device specific region?
> > > > > > > > The struct vfio_info_cap_header includes id and version fields, so we
> > > > > > > > can declare a migration id and increment the version for any
> > > > > > > > incompatible changes to the protocol.      
> > > > > > > yes, good idea!
> > > > > > > so, what about declaring below new cap?
> > > > > > >     #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > > > > >     struct vfio_region_info_cap_migration {
> > > > > > >         struct vfio_info_cap_header header;
> > > > > > >         __u32 device_version_len;
> > > > > > >         __u8  device_version[];
> > > > > > >     };    
> > > > > 
> > > > > I'm not sure why we need a new region for everything, it seems this
> > > > > could fit within the protocol of a single region.  This could simply be
> > > > > a new action to retrieve the version where the protocol would report
> > > > > the number of bytes available, just like the migration stream itself.    
> > > > so, to get version of VFIO live migration device state interface (simply
> > > > call it migration interface?),
> > > > a new cap looks like this:
> > > > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > > it contains struct vfio_info_cap_header only.
> > > > when get region info of the migration region, we query this cap and get
> > > > migration interface's version. right?
> > > > 
> > > > or just directly use VFIO_REGION_INFO_CAP_TYPE is fine?  
> > > 
> > > Again, why a new region.  I'm imagining we have one region and this is
> > > just asking for a slightly different thing from it.  But TBH, I'm not
> > > sure we need it at all vs the sysfs interface.
> > >   
> > > > > > > > > > device_version field consists two parts:
> > > > > > > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.      
> > > > > > > >
> > > > > > > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > > > > > > suggest we use a bit to flag it as such so we can reserve that portion
> > > > > > > > of the 32bit address space.  See for example:
> > > > > > > >
> > > > > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > > > > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > > > > > > >
> > > > > > > > For vendor specific regions.      
> > > > > > > Yes, use PCI vendor ID.
> > > > > > > you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> > > > > > > to identify it's a PCI ID.
> > > > > > > Thanks for pointing it out.
> > > > > > > But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
> > > > > > > why it's 0xffff? I searched QEMU and kernel code and did not find anywhere
> > > > > > > uses it.    
> > > > > 
> > > > > PCI vendor IDs are 16bits, it's just indicating that when the
> > > > > PCI_VENDOR_TYPE bit is set the valid data is the lower 16bits.    
> > > > 
> > > > thanks:)
> > > >   
> > > > > > > > > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > > > > > > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > > > > > > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > struct vfio_device_state_ctl {
> > > > > > > > > >      __u32 version;            /* ro */
> > > > > > > > > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> > > > > > > > > >      struct {
> > > > > > > > > >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > > > > > > > 	...
> > > > > > > > > >      }data;
> > > > > > > > > >      ...
> > > > > > > > > >  };      
> > > > > > > >
> > > > > > > > We have a buffer area where we can read and write data from the vendor
> > > > > > > > driver, why would we have this IS_COMPATIBLE protocol to write the
> > > > > > > > device version string but use a static fixed length version string in
> > > > > > > > the control header to read it?  IOW, let's use GET_VERSION,
> > > > > > > > CHECK_VERSION actions that make use of the buffer area and allow vendor
> > > > > > > > specific version information length.      
> > > > > > > you are right, such static fixed length version string is bad :)
> > > > > > > To get device version, do you think which approach below is better?
> > > > > > > 1. use GET_VERSION action, and read from region buffer
> > > > > > > 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION
> > > > > > >
> > > > > > > seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
> > > > > > > for checking migration interface's version?    
> > > > > 
> > > > > I think 1 provides the most flexibility to the vendor driver.    
> > > > 
> > > > Got it.
> > > > For VFIO live migration, compared to reuse device state region (which takes
> > > > GET_BUFFER/SET_BUFFER actions),
> > > > could we create a new region for GET_VERSION & CHECK_VERSION ?  
> > > 
> > > Why?
> > >   
> > > > > > > > > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > > > > > > > >
> > > > > > > > > > The flow to figure out whether a source device is migratable to target device
> > > > > > > > > > is like that:
> > > > > > > > > > 1. in source side's .save_setup, save source device's device_version string
> > > > > > > > > > 2. in target side's .load_state, load source device's device version string
> > > > > > > > > > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > > > > > > > > > to check whether the source device is compatible to it.
> > > > > > > > > >
> > > > > > > > > > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > > > > > > > > > maintain a compatibility table and decide whether source device is compatible
> > > > > > > > > > to target device according to its proprietary table.
> > > > > > > > > > In device_version string, vendor driver only has to describe the source
> > > > > > > > > > device as elaborately as possible and resorts to vendor driver in target side
> > > > > > > > > > to figure out whether they are compatible.      
> > > > > > > >
> > > > > > > > I agree, it's too complicated and restrictive to try to create an
> > > > > > > > interface for the user to determine compatibility, let the driver
> > > > > > > > declare it compatible or not.      
> > > > > > > :)
> > > > > > >      
> > > > > > > > > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > > > > > > > > externally - so we could be able to answer a question like 'can we
> > > > > > > > > migrate this VM to this host' - from the management layer before it
> > > > > > > > > actually starts the migration.      
> > > > > > >
> > > > > > > so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and CHECK_VERSION.
> > > > > > > GET_VERSION returns a vm's device's version string.
> > > > > > > CHECK_VERSION's input is device version string and return
> > > > > > > compatible/non-compatible.
> > > > > > > Do you think it's good?    
> > > > > 
> > > > > That's the idea, but note that QEMU can only provide the QMP interface,
> > > > > the sysfs interface would of course be provided as more of a direct
> > > > > path from the vendor driver or mdev kernel layer.
> > > > >     
> > > > > > > > I think we'd need to mirror this capability in sysfs to support that,
> > > > > > > > or create a qmp interface through QEMU that the device owner could make
> > > > > > > > the request on behalf of the management layer.  Getting access to the
> > > > > > > > vfio device requires an iommu context that's already in use by the
> > > > > > > > device owner, we have no intention of supporting a model that allows
> > > > > > > > independent tasks access to a device.  Thanks,
> > > > > > > > Alex
> > > > > > > >      
> > > > > > > do you think two sysfs nodes under a device node is ok?
> > > > > > > e.g.
> > > > > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
> > > > > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version      
> > > > > 
> > > > > I'd think it might live more in the mdev_support_types area, wouldn't
> > > > > we ideally like to know if a device is compatible even before it's
> > > > > created?  For example maybe:
> > > > > 
> > > > > /sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/version
> > > > > 
> > > > > Where reading the sysfs attribute returns the version string and
> > > > > writing a string into the attribute return an errno for incompatibility.    
> > > > yes, knowing if a device is compatible before it's created is good.
> > > > but do you think check whether a device is compatible after it's created is
> > > > also required? For live migration, user usually only queries this information
> > > > when it's really required, i.e. when a device has been created.
> > > > maybe we can add this version get/check at both places?  
> > > 
> > > Why does an instantiated device suddenly not follow the version and
> > > compatibility rules of an uninstantiated device?  IOW, if the version
> > > and compatibility check are on the mdev type, why can't we trace back
> > > from the device to the mdev type and make use of that same interface?
> > > Seems the only question is whether we require an interface through the
> > > vfio API directly or if sysfs is sufficient.  
> > ok. got it.
> > 
> > > > > > Why do you need both sysfs and QMP at the same time? I can see it gives us some
> > > > > > flexibility, but is there something more to that?
> > > > > >
> > > > > > Normally, I'd prefer a QMP interface from libvirt's perspective (with an
> > > > > > appropriate capability that libvirt can check for QEMU support) because I imagine large nodes having a
> > > > > > bunch of GPUs with different revisions which might not be backwards compatible.
> > > > > > Libvirt might query the version string on source and check it on dest via the
> > > > > > QMP in a way that QEMU, talking to the driver, would return either a list or a
> > > > > > single physical device to which we can migrate, because neither QEMU nor
> > > > > > libvirt know that, only the driver does, so that's an important information
> > > > > > rather than looping through all the devices and trying to find one that is
> > > > > > compatible. However, you might have a hard time making all the necessary
> > > > > > changes in QMP introspectable, a new command would be fine, but if you also
> > > > > > wanted to extend say vfio-pci options, IIRC that would not appear in the QAPI
> > > > > > schema and libvirt would not be able to detect support for it.
> > > > > > 
> > > > > > On the other hand, the presence of a QMP interface IMO doesn't help mgmt apps
> > > > > > much, as it still carries the burden of being able to check this only at the
> > > > > > time of migration, which e.g. OpenStack would like to know long before that.
> > > > > > 
> > > > > > So, having sysfs attributes would work for both libvirt (even though libvirt
> > > > > > would benefit from a QMP much more) and OpenStack. OpenStack would IMO then
> > > > > > have to figure out how to create the mappings between compatible devices across
> > > > > > several nodes which are non-uniform.    
> > > > > 
> > > > > Yep, vfio encompasses more than just QEMU, so a sysfs interface has more
> > > > > utility than a QMP interface.  For instance we couldn't predetermine if
> > > > > an mdev type on a host is compatible if we need to first create the
> > > > > device and launch a QEMU instance on it to get access to QMP.  So maybe
> > > > > the question is whether we should bother with any sort of VFIO API to
> > > > > do this comparison, perhaps only a sysfs interface is sufficient for a
> > > > > complete solution.  The downside of not having a version API in the
> > > > > user interface might be that QEMU on its own can only try a migration
> > > > > and see if it fails, it wouldn't have the ability to test expected
> > > > > compatibility without access to sysfs.  And maybe that's fine.  Thanks,
> > > > >     
> > > > So QEMU vfio uses sysfs to check device compatiblity in migration's save_setup
> > > > phase?  
> > > 
> > > The migration stream between source and target device are the ultimate
> > > test of compatibility, the vendor driver should never rely on userspace
> > > validating compatibility of the migration.  At the point it could do so, the
> > > migration has already begun, so we're only testing how quickly we can
> > > fail the migration.  The management layer setting up the migration can
> > > test via sysfs for compatibility and the migration stream itself needs
> > > to be self validating, so what value is added for QEMU to perform a
> > > version compatibility test?  Thanks,  
> > oh, do you mean vendor driver should embed source device's version in migration
> > stream, which is opaque to qemu?
> > otherwise, I can't think of a quick way for vendor driver to determine whether
> > source device is an incompatible device.  
> 
> Yes, the vendor driver cannot rely on the user to make sure the
> incoming migration stream is compatible, the vendor driver must take
> responsibility for this.  Therefore, regardless of what other
> interfaces we have for the user to test the compatibility between
> devices, the vendor driver must make no assumptions about the validity
> or integrity of the data stream.  Plan for and protect against a
> malicious or incompetent user.  Thanks,
>
ok. got it.
Thank you :)

Yan

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-03-27 22:10               ` Alex Williamson
@ 2019-04-01  8:14                   ` Cornelia Huck
  2019-04-01  8:14                   ` [Qemu-devel] " Cornelia Huck
  1 sibling, 0 replies; 133+ messages in thread
From: Cornelia Huck @ 2019-04-01  8:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Zhao Yan, Dr. David Alan Gilbert,
	intel-gvt-dev@lists.freedesktop.org

On Wed, 27 Mar 2019 16:10:20 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 27 Mar 2019 20:18:54 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Zhao Yan (yan.y.zhao@intel.com) wrote:  
> > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:    
> > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > a device that has less device memory ?      
> > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > along with verion ?).      
> > > > > 
> > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > providers. But we still think about the live migration between from the the lower
> > > > > generation of hardware migrated to the higher generation.    
> > > > 
> > > > Agreed, lower->higher is the one direction that might make sense to
> > > > support.
> > > > 
> > > > But regardless of that, I think we need to make sure that incompatible
> > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > as well.
> > > > 
> > > > How easy is it to obtain that information in a form that can be
> > > > consumed by higher layers? Can we find out the device type at least?
> > > > What about some kind of revision?    
> > > hi Alex and Cornelia
> > > for device compatibility, do you think it's a good idea to use "version"
> > > and "device version" fields?
> > > 
> > > version field: identify live migration interface's version. it can have a
> > > sort of backward compatibility, like target machine's version >= source
> > > machine's version. something like that.  
> 
> Don't we essentially already have this via the device specific region?
> The struct vfio_info_cap_header includes id and version fields, so we
> can declare a migration id and increment the version for any
> incompatible changes to the protocol.
> 
> > > 
> > > device_version field consists two parts:
> > > 1. vendor id : it takes 32 bits. e.g. 0x8086.  
> 
> Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> suggest we use a bit to flag it as such so we can reserve that portion
> of the 32bit address space.  See for example:
> 
> #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> 
> For vendor specific regions.

Just browsing through the thread... if I don't misunderstand, we could
use a vfio-ccw region type id here for ccw, couldn't we? Just to make
sure that this is not pci-specific.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-04-01  8:14                   ` Cornelia Huck
  0 siblings, 0 replies; 133+ messages in thread
From: Cornelia Huck @ 2019-04-01  8:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Dr. David Alan Gilbert, Zhao Yan, Gonglei (Arei),
	cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, felipe, Wang, Zhi A, Tian, Kevin, intel-gvt-dev, Liu,
	Changpeng, Ken.Xue, jonathan.davies

On Wed, 27 Mar 2019 16:10:20 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 27 Mar 2019 20:18:54 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Zhao Yan (yan.y.zhao@intel.com) wrote:  
> > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:    
> > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > a device that has less device memory ?      
> > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > along with verion ?).      
> > > > > 
> > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > providers. But we still think about the live migration between from the the lower
> > > > > generation of hardware migrated to the higher generation.    
> > > > 
> > > > Agreed, lower->higher is the one direction that might make sense to
> > > > support.
> > > > 
> > > > But regardless of that, I think we need to make sure that incompatible
> > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > as well.
> > > > 
> > > > How easy is it to obtain that information in a form that can be
> > > > consumed by higher layers? Can we find out the device type at least?
> > > > What about some kind of revision?    
> > > hi Alex and Cornelia
> > > for device compatibility, do you think it's a good idea to use "version"
> > > and "device version" fields?
> > > 
> > > version field: identify live migration interface's version. it can have a
> > > sort of backward compatibility, like target machine's version >= source
> > > machine's version. something like that.  
> 
> Don't we essentially already have this via the device specific region?
> The struct vfio_info_cap_header includes id and version fields, so we
> can declare a migration id and increment the version for any
> incompatible changes to the protocol.
> 
> > > 
> > > device_version field consists two parts:
> > > 1. vendor id : it takes 32 bits. e.g. 0x8086.  
> 
> Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> suggest we use a bit to flag it as such so we can reserve that portion
> of the 32bit address space.  See for example:
> 
> #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> 
> For vendor specific regions.

Just browsing through the thread... if I don't misunderstand, we could
use a vfio-ccw region type id here for ccw, couldn't we? Just to make
sure that this is not pci-specific.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-04-01  8:14                   ` [Qemu-devel] " Cornelia Huck
@ 2019-04-01  8:40                     ` Yan Zhao
  -1 siblings, 0 replies; 133+ messages in thread
From: Yan Zhao @ 2019-04-01  8:40 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	Alex Williamson, intel-gvt-dev@lists.freedesktop.org

On Mon, Apr 01, 2019 at 04:14:30PM +0800, Cornelia Huck wrote:
> On Wed, 27 Mar 2019 16:10:20 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Wed, 27 Mar 2019 20:18:54 +0000
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > 
> > > * Zhao Yan (yan.y.zhao@intel.com) wrote:  
> > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:    
> > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > a device that has less device memory ?      
> > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > along with verion ?).      
> > > > > > 
> > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > generation of hardware migrated to the higher generation.    
> > > > > 
> > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > support.
> > > > > 
> > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > as well.
> > > > > 
> > > > > How easy is it to obtain that information in a form that can be
> > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > What about some kind of revision?    
> > > > hi Alex and Cornelia
> > > > for device compatibility, do you think it's a good idea to use "version"
> > > > and "device version" fields?
> > > > 
> > > > version field: identify live migration interface's version. it can have a
> > > > sort of backward compatibility, like target machine's version >= source
> > > > machine's version. something like that.  
> > 
> > Don't we essentially already have this via the device specific region?
> > The struct vfio_info_cap_header includes id and version fields, so we
> > can declare a migration id and increment the version for any
> > incompatible changes to the protocol.
> > 
> > > > 
> > > > device_version field consists two parts:
> > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.  
> > 
> > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > suggest we use a bit to flag it as such so we can reserve that portion
> > of the 32bit address space.  See for example:
> > 
> > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > 
> > For vendor specific regions.
> 
> Just browsing through the thread... if I don't misunderstand, we could
> use a vfio-ccw region type id here for ccw, couldn't we? Just to make
> sure that this is not pci-specific.
CCW could use another bit other than bit 31?
e.g.
#define VFIO_REGION_TYPE_CCW_VENDOR_TYPE        (1 << 30)
then ccw device use (VFIO_REGION_TYPE_CCW_VENDOR_TYPE | vendor id) as its
first 32 bit for device version string.

But as Alex said we'll not provide an extra region to get device version,
and device version is only exported in sysfs, probably we should define them as
below:
#define VFIO_DEVICE_VERSION_TYPE_PCI (1<<31)
#define VFIO_DEVICE_VERSION_TYPE_CCW (1<<30)

Do you think it's ok?

Thanks
Yan

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-04-01  8:40                     ` Yan Zhao
  0 siblings, 0 replies; 133+ messages in thread
From: Yan Zhao @ 2019-04-01  8:40 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alex Williamson, Dr. David Alan Gilbert, Gonglei (Arei),
	cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, felipe, Wang, Zhi A, Tian, Kevin, intel-gvt-dev, Liu,
	Changpeng, Ken.Xue, jonathan.davies

On Mon, Apr 01, 2019 at 04:14:30PM +0800, Cornelia Huck wrote:
> On Wed, 27 Mar 2019 16:10:20 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Wed, 27 Mar 2019 20:18:54 +0000
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > 
> > > * Zhao Yan (yan.y.zhao@intel.com) wrote:  
> > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:    
> > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > a device that has less device memory ?      
> > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > along with verion ?).      
> > > > > > 
> > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > generation of hardware migrated to the higher generation.    
> > > > > 
> > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > support.
> > > > > 
> > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > as well.
> > > > > 
> > > > > How easy is it to obtain that information in a form that can be
> > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > What about some kind of revision?    
> > > > hi Alex and Cornelia
> > > > for device compatibility, do you think it's a good idea to use "version"
> > > > and "device version" fields?
> > > > 
> > > > version field: identify live migration interface's version. it can have a
> > > > sort of backward compatibility, like target machine's version >= source
> > > > machine's version. something like that.  
> > 
> > Don't we essentially already have this via the device specific region?
> > The struct vfio_info_cap_header includes id and version fields, so we
> > can declare a migration id and increment the version for any
> > incompatible changes to the protocol.
> > 
> > > > 
> > > > device_version field consists two parts:
> > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.  
> > 
> > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > suggest we use a bit to flag it as such so we can reserve that portion
> > of the 32bit address space.  See for example:
> > 
> > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > 
> > For vendor specific regions.
> 
> Just browsing through the thread... if I don't misunderstand, we could
> use a vfio-ccw region type id here for ccw, couldn't we? Just to make
> sure that this is not pci-specific.
CCW could use another bit other than bit 31?
e.g.
#define VFIO_REGION_TYPE_CCW_VENDOR_TYPE        (1 << 30)
then ccw device use (VFIO_REGION_TYPE_CCW_VENDOR_TYPE | vendor id) as its
first 32 bit for device version string.

But as Alex said we'll not provide an extra region to get device version,
and device version is only exported in sysfs, probably we should define them as
below:
#define VFIO_DEVICE_VERSION_TYPE_PCI (1<<31)
#define VFIO_DEVICE_VERSION_TYPE_CCW (1<<30)

Do you think it's ok?

Thanks
Yan

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/5] QEMU VFIO live migration
  2019-04-01  8:40                     ` [Qemu-devel] " Yan Zhao
@ 2019-04-01 14:15                       ` Alex Williamson
  -1 siblings, 0 replies; 133+ messages in thread
From: Alex Williamson @ 2019-04-01 14:15 UTC (permalink / raw)
  To: Yan Zhao
  Cc: cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, Gonglei (Arei),
	felipe, Ken.Xue, Tian, Kevin, Dr. David Alan Gilbert,
	intel-gvt-dev,

On Mon, 1 Apr 2019 04:40:03 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Mon, Apr 01, 2019 at 04:14:30PM +0800, Cornelia Huck wrote:
> > On Wed, 27 Mar 2019 16:10:20 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > >   
> > > > * Zhao Yan (yan.y.zhao@intel.com) wrote:    
> > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:      
> > > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > > a device that has less device memory ?        
> > > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > > along with verion ?).        
> > > > > > > 
> > > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > > generation of hardware migrated to the higher generation.      
> > > > > > 
> > > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > > support.
> > > > > > 
> > > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > > as well.
> > > > > > 
> > > > > > How easy is it to obtain that information in a form that can be
> > > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > > What about some kind of revision?      
> > > > > hi Alex and Cornelia
> > > > > for device compatibility, do you think it's a good idea to use "version"
> > > > > and "device version" fields?
> > > > > 
> > > > > version field: identify live migration interface's version. it can have a
> > > > > sort of backward compatibility, like target machine's version >= source
> > > > > machine's version. something like that.    
> > > 
> > > Don't we essentially already have this via the device specific region?
> > > The struct vfio_info_cap_header includes id and version fields, so we
> > > can declare a migration id and increment the version for any
> > > incompatible changes to the protocol.
> > >   
> > > > > 
> > > > > device_version field consists two parts:
> > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.    
> > > 
> > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > suggest we use a bit to flag it as such so we can reserve that portion
> > > of the 32bit address space.  See for example:
> > > 
> > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > > 
> > > For vendor specific regions.  
> > 
> > Just browsing through the thread... if I don't misunderstand, we could
> > use a vfio-ccw region type id here for ccw, couldn't we? Just to make
> > sure that this is not pci-specific.  
> CCW could use another bit other than bit 31?
> e.g.
> #define VFIO_REGION_TYPE_CCW_VENDOR_TYPE        (1 << 30)
> then ccw device use (VFIO_REGION_TYPE_CCW_VENDOR_TYPE | vendor id) as its
> first 32 bit for device version string.
> 
> But as Alex said we'll not provide an extra region to get device version,
> and device version is only exported in sysfs, probably we should define them as
> below:
> #define VFIO_DEVICE_VERSION_TYPE_PCI (1<<31)
> #define VFIO_DEVICE_VERSION_TYPE_CCW (1<<30)
> 
> Do you think it's ok?

We already had this discussion for device specific regions and decided
that CCW doesn't have enough vendors to justify a full subset of the
available address space.  Also, this doesn't need to imply the device
interface, we're simply specifying a vendor registrar such that we can
give each vendor their own namespace, so I don't think it would be a
problem for a CCW to specify a namespace using a PCI vendor ID.
Finally, if we have such need for this in the future, because I'm not
sure where we stand with this in the current proposals, maybe we should
make use of an IEEE OUI rather than a PCI database to avoid this sort
of confusion and mis-association if we have further need.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
@ 2019-04-01 14:15                       ` Alex Williamson
  0 siblings, 0 replies; 133+ messages in thread
From: Alex Williamson @ 2019-04-01 14:15 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Cornelia Huck, Dr. David Alan Gilbert, Gonglei (Arei),
	cjia, kvm, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	kwankhede, eauger, Liu, Yi L, eskultet, Yang, Ziye, mlevitsk,
	pasic, felipe, Wang, Zhi A, Tian, Kevin, intel-gvt-dev, Liu,
	Changpeng, Ken.Xue, jonathan.davies

On Mon, 1 Apr 2019 04:40:03 -0400
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Mon, Apr 01, 2019 at 04:14:30PM +0800, Cornelia Huck wrote:
> > On Wed, 27 Mar 2019 16:10:20 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > >   
> > > > * Zhao Yan (yan.y.zhao@intel.com) wrote:    
> > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:      
> > > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > > a device that has less device memory ?        
> > > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > > along with verion ?).        
> > > > > > > 
> > > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > > generation of hardware migrated to the higher generation.      
> > > > > > 
> > > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > > support.
> > > > > > 
> > > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > > as well.
> > > > > > 
> > > > > > How easy is it to obtain that information in a form that can be
> > > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > > What about some kind of revision?      
> > > > > hi Alex and Cornelia
> > > > > for device compatibility, do you think it's a good idea to use "version"
> > > > > and "device version" fields?
> > > > > 
> > > > > version field: identify live migration interface's version. it can have a
> > > > > sort of backward compatibility, like target machine's version >= source
> > > > > machine's version. something like that.    
> > > 
> > > Don't we essentially already have this via the device specific region?
> > > The struct vfio_info_cap_header includes id and version fields, so we
> > > can declare a migration id and increment the version for any
> > > incompatible changes to the protocol.
> > >   
> > > > > 
> > > > > device_version field consists two parts:
> > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.    
> > > 
> > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > suggest we use a bit to flag it as such so we can reserve that portion
> > > of the 32bit address space.  See for example:
> > > 
> > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > > 
> > > For vendor specific regions.  
> > 
> > Just browsing through the thread... if I don't misunderstand, we could
> > use a vfio-ccw region type id here for ccw, couldn't we? Just to make
> > sure that this is not pci-specific.  
> CCW could use another bit other than bit 31?
> e.g.
> #define VFIO_REGION_TYPE_CCW_VENDOR_TYPE        (1 << 30)
> then ccw device use (VFIO_REGION_TYPE_CCW_VENDOR_TYPE | vendor id) as its
> first 32 bit for device version string.
> 
> But as Alex said we'll not provide an extra region to get device version,
> and device version is only exported in sysfs, probably we should define them as
> below:
> #define VFIO_DEVICE_VERSION_TYPE_PCI (1<<31)
> #define VFIO_DEVICE_VERSION_TYPE_CCW (1<<30)
> 
> Do you think it's ok?

We already had this discussion for device specific regions and decided
that CCW doesn't have enough vendors to justify a full subset of the
available address space.  Also, this doesn't need to imply the device
interface, we're simply specifying a vendor registrar such that we can
give each vendor their own namespace, so I don't think it would be a
problem for a CCW to specify a namespace using a PCI vendor ID.
Finally, if we have such need for this in the future, because I'm not
sure where we stand with this in the current proposals, maybe we should
make use of an IEEE OUI rather than a PCI database to avoid this sort
of confusion and mis-association if we have further need.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 133+ messages in thread

end of thread, other threads:[~2019-04-01 14:23 UTC | newest]

Thread overview: 133+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-19  8:50 [PATCH 0/5] QEMU VFIO live migration Yan Zhao
2019-02-19  8:50 ` [Qemu-devel] " Yan Zhao
2019-02-19  8:52 ` [PATCH 1/5] vfio/migration: define kernel interfaces Yan Zhao
2019-02-19  8:52   ` [Qemu-devel] " Yan Zhao
2019-02-19 13:09   ` Cornelia Huck
2019-02-19 13:09     ` [Qemu-devel] " Cornelia Huck
2019-02-20  7:36     ` Zhao Yan
2019-02-20  7:36       ` [Qemu-devel] " Zhao Yan
2019-02-20 17:08       ` Cornelia Huck
2019-02-20 17:08         ` [Qemu-devel] " Cornelia Huck
2019-02-21  1:47         ` Zhao Yan
2019-02-21  1:47           ` [Qemu-devel] " Zhao Yan
2019-02-19  8:52 ` [PATCH 2/5] vfio/migration: support device of device config capability Yan Zhao
2019-02-19  8:52   ` [Qemu-devel] " Yan Zhao
2019-02-19 11:01   ` Dr. David Alan Gilbert
2019-02-19 11:01     ` [Qemu-devel] " Dr. David Alan Gilbert
2019-02-20  5:12     ` Zhao Yan
2019-02-20  5:12       ` [Qemu-devel] " Zhao Yan
2019-02-20 10:57       ` Dr. David Alan Gilbert
2019-02-20 10:57         ` [Qemu-devel] " Dr. David Alan Gilbert
2019-02-19 14:37   ` Cornelia Huck
2019-02-19 14:37     ` [Qemu-devel] " Cornelia Huck
2019-02-20 22:54     ` Zhao Yan
2019-02-20 22:54       ` [Qemu-devel] " Zhao Yan
2019-02-21 10:56       ` Cornelia Huck
2019-02-21 10:56         ` [Qemu-devel] " Cornelia Huck
2019-02-19  8:52 ` [PATCH 3/5] vfio/migration: tracking of dirty page in system memory Yan Zhao
2019-02-19  8:52   ` [Qemu-devel] " Yan Zhao
2019-02-19  8:52 ` [PATCH 4/5] vfio/migration: turn on migration Yan Zhao
2019-02-19  8:52   ` [Qemu-devel] " Yan Zhao
2019-02-19  8:53 ` [PATCH 5/5] vfio/migration: support device memory capability Yan Zhao
2019-02-19  8:53   ` [Qemu-devel] " Yan Zhao
2019-02-19 11:25   ` Dr. David Alan Gilbert
2019-02-19 11:25     ` [Qemu-devel] " Dr. David Alan Gilbert
2019-02-20  5:17     ` Zhao Yan
2019-02-20  5:17       ` [Qemu-devel] " Zhao Yan
2019-02-19 14:42   ` Christophe de Dinechin
2019-02-19 14:42     ` [Qemu-devel] " Christophe de Dinechin
2019-02-20  7:58     ` Zhao Yan
2019-02-20  7:58       ` [Qemu-devel] " Zhao Yan
2019-02-20 10:14       ` Christophe de Dinechin
2019-02-20 10:14         ` [Qemu-devel] " Christophe de Dinechin
2019-02-21  0:07         ` Zhao Yan
2019-02-21  0:07           ` [Qemu-devel] " Zhao Yan
2019-02-19 11:32 ` [PATCH 0/5] QEMU VFIO live migration Dr. David Alan Gilbert
2019-02-19 11:32   ` [Qemu-devel] " Dr. David Alan Gilbert
2019-02-20  5:28   ` Zhao Yan
2019-02-20  5:28     ` [Qemu-devel] " Zhao Yan
2019-02-20 11:01     ` Dr. David Alan Gilbert
2019-02-20 11:01       ` [Qemu-devel] " Dr. David Alan Gilbert
2019-02-20 11:28       ` Gonglei (Arei)
2019-02-20 11:28         ` [Qemu-devel] " Gonglei (Arei)
2019-02-20 11:42         ` Cornelia Huck
2019-02-20 11:42           ` [Qemu-devel] " Cornelia Huck
2019-02-20 12:07           ` Gonglei (Arei)
2019-02-20 12:07             ` [Qemu-devel] " Gonglei (Arei)
2019-03-27  6:35           ` Zhao Yan
2019-03-27 20:18             ` Dr. David Alan Gilbert
2019-03-27 22:10               ` Alex Williamson
2019-03-28  8:36                 ` Zhao Yan
2019-03-28  9:21                   ` Erik Skultety
2019-03-28 16:04                     ` Alex Williamson
2019-03-29  2:47                       ` Zhao Yan
2019-03-29 14:26                         ` Alex Williamson
2019-03-29 23:10                           ` Zhao Yan
2019-03-30 14:14                             ` Alex Williamson
2019-04-01  2:17                               ` Zhao Yan
2019-04-01  8:14                 ` Cornelia Huck
2019-04-01  8:14                   ` [Qemu-devel] " Cornelia Huck
2019-04-01  8:40                   ` Yan Zhao
2019-04-01  8:40                     ` [Qemu-devel] " Yan Zhao
2019-04-01 14:15                     ` Alex Williamson
2019-04-01 14:15                       ` [Qemu-devel] " Alex Williamson
2019-02-21  0:31       ` Zhao Yan
2019-02-21  0:31         ` [Qemu-devel] " Zhao Yan
2019-02-21  9:15         ` Dr. David Alan Gilbert
2019-02-21  9:15           ` [Qemu-devel] " Dr. David Alan Gilbert
2019-02-20 11:56 ` Gonglei (Arei)
2019-02-20 11:56   ` [Qemu-devel] " Gonglei (Arei)
2019-02-21  0:24   ` Zhao Yan
2019-02-21  0:24     ` [Qemu-devel] " Zhao Yan
2019-02-21  1:35     ` Gonglei (Arei)
2019-02-21  1:35       ` [Qemu-devel] " Gonglei (Arei)
2019-02-21  1:58       ` Zhao Yan
2019-02-21  1:58         ` [Qemu-devel] " Zhao Yan
2019-02-21  3:33         ` Gonglei (Arei)
2019-02-21  3:33           ` [Qemu-devel] " Gonglei (Arei)
2019-02-21  4:08           ` Zhao Yan
2019-02-21  4:08             ` [Qemu-devel] " Zhao Yan
2019-02-21  5:46             ` Gonglei (Arei)
2019-02-21  5:46               ` [Qemu-devel] " Gonglei (Arei)
2019-02-21  2:04       ` Zhao Yan
2019-02-21  2:04         ` [Qemu-devel] " Zhao Yan
2019-02-21  3:16         ` Gonglei (Arei)
2019-02-21  3:16           ` [Qemu-devel] " Gonglei (Arei)
2019-02-21  4:21           ` Zhao Yan
2019-02-21  4:21             ` [Qemu-devel] " Zhao Yan
2019-02-21  5:56             ` Gonglei (Arei)
2019-02-21  5:56               ` [Qemu-devel] " Gonglei (Arei)
2019-02-21 20:40 ` Alex Williamson
2019-02-21 20:40   ` [Qemu-devel] " Alex Williamson
2019-02-25  2:22   ` Zhao Yan
2019-02-25  2:22     ` [Qemu-devel] " Zhao Yan
2019-03-06  0:22     ` Zhao Yan
2019-03-06  0:22       ` [Qemu-devel] " Zhao Yan
2019-03-07 17:44     ` Alex Williamson
2019-03-07 17:44       ` [Qemu-devel] " Alex Williamson
2019-03-07 23:20       ` Tian, Kevin
2019-03-07 23:20         ` [Qemu-devel] " Tian, Kevin
2019-03-08 16:11         ` Alex Williamson
2019-03-08 16:11           ` [Qemu-devel] " Alex Williamson
2019-03-08 16:21           ` Dr. David Alan Gilbert
2019-03-08 16:21             ` [Qemu-devel] " Dr. David Alan Gilbert
2019-03-08 22:02             ` Alex Williamson
2019-03-08 22:02               ` [Qemu-devel] " Alex Williamson
2019-03-11  2:33               ` Tian, Kevin
2019-03-11  2:33                 ` [Qemu-devel] " Tian, Kevin
2019-03-11 20:19                 ` Alex Williamson
2019-03-11 20:19                   ` [Qemu-devel] " Alex Williamson
2019-03-12  2:48                   ` Tian, Kevin
2019-03-12  2:48                     ` [Qemu-devel] " Tian, Kevin
2019-03-13 19:57                     ` Alex Williamson
2019-03-12  2:57       ` Zhao Yan
2019-03-12  2:57         ` [Qemu-devel] " Zhao Yan
2019-03-13  1:13         ` Zhao Yan
2019-03-13 19:14           ` Alex Williamson
2019-03-14  1:12             ` Zhao Yan
2019-03-14 22:44               ` Alex Williamson
2019-03-14 23:05                 ` Zhao Yan
2019-03-15  2:24                   ` Alex Williamson
2019-03-18  2:51                     ` Zhao Yan
2019-03-18  3:09                       ` Alex Williamson
2019-03-18  3:27                         ` Zhao Yan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.