[PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2
@ 2022-05-12 15:43 Avihai Horon
  2022-05-12 15:43 ` [PATCH 1/9] linux-headers: Update headers to v5.18-rc6 Avihai Horon
                   ` (9 more replies)
  0 siblings, 10 replies; 43+ messages in thread
From: Avihai Horon @ 2022-05-12 15:43 UTC (permalink / raw)
  To: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Juan Quintela, Dr . David Alan Gilbert
  Cc: Yishai Hadas, Jason Gunthorpe, Mark Bloch, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta

Hello,

Following VFIO migration protocol v2 acceptance in kernel, this series
implements VFIO migration according to the new v2 protocol and replaces
the now deprecated v1 implementation.

The main differences between v1 and v2 migration protocols are:
1. VFIO device state is represented as a finite state machine instead of
   a bitmap.

2. The migration interface with kernel is done using VFIO_DEVICE_FEATURE
   ioctl and normal read() and write() instead of the migration region
   used in v1.

3. Migration protocol v2 currently doesn't support the pre-copy phase of
   migration.

Full description of the v2 protocol and the differences from v1 can be
found here [1].

Patches 1-5 are prep patches importing the new uAPI headers, fixing bugs
and adding QEMUFile function that will be used later.

Patches 6-9 are the main patches that replace the v1 implementation with
v2.

Thanks.

[1]
https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com/

Avihai Horon (9):
  linux-headers: Update headers to v5.18-rc6
  vfio: Fix compilation errors caused by VFIO migration v1 deprecation
  vfio/migration: Fix NULL pointer dereference bug
  vfio/migration: Skip pre-copy if dirty page tracking is not supported
  migration/qemu-file: Add qemu_file_get_to_fd()
  vfio/migration: Implement VFIO migration protocol v2
  vfio/migration: Reset device if setting recover state fails
  vfio: Alphabetize migration section of VFIO trace-events file
  docs/devel: Align vfio-migration docs to VFIO migration v2

 docs/devel/vfio-migration.rst                 |  77 +--
 hw/vfio/common.c                              |  21 +-
 hw/vfio/migration.c                           | 640 ++++--------------
 hw/vfio/trace-events                          |  25 +-
 include/hw/vfio/vfio-common.h                 |   8 +-
 .../linux/input-event-codes.h                 |  25 +-
 .../standard-headers/linux/virtio_config.h    |   6 +
 .../standard-headers/linux/virtio_crypto.h    |  82 ++-
 linux-headers/asm-arm64/kvm.h                 |  16 +
 linux-headers/asm-generic/mman-common.h       |   2 +
 linux-headers/asm-mips/mman.h                 |   2 +
 linux-headers/linux/kvm.h                     |  27 +-
 linux-headers/linux/psci.h                    |   4 +
 linux-headers/linux/userfaultfd.h             |   8 +-
 linux-headers/linux/vfio.h                    | 406 ++++++-----
 linux-headers/linux/vhost.h                   |   7 +
 migration/migration.c                         |   5 +
 migration/migration.h                         |   3 +
 migration/qemu-file.c                         |  34 +
 migration/qemu-file.h                         |   1 +
 20 files changed, 618 insertions(+), 781 deletions(-)

-- 
2.21.3



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/9] linux-headers: Update headers to v5.18-rc6
  2022-05-12 15:43 [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
@ 2022-05-12 15:43 ` Avihai Horon
  2022-05-12 15:43 ` [PATCH 2/9] vfio: Fix compilation errors caused by VFIO migration v1 deprecation Avihai Horon
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Avihai Horon @ 2022-05-12 15:43 UTC (permalink / raw)
  To: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Juan Quintela, Dr . David Alan Gilbert
  Cc: Yishai Hadas, Jason Gunthorpe, Mark Bloch, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 .../linux/input-event-codes.h                 |  25 +-
 .../standard-headers/linux/virtio_config.h    |   6 +
 .../standard-headers/linux/virtio_crypto.h    |  82 +++-
 linux-headers/asm-arm64/kvm.h                 |  16 +
 linux-headers/asm-generic/mman-common.h       |   2 +
 linux-headers/asm-mips/mman.h                 |   2 +
 linux-headers/linux/kvm.h                     |  27 +-
 linux-headers/linux/psci.h                    |   4 +
 linux-headers/linux/userfaultfd.h             |   8 +-
 linux-headers/linux/vfio.h                    | 406 +++++++++---------
 linux-headers/linux/vhost.h                   |   7 +
 11 files changed, 365 insertions(+), 220 deletions(-)

diff --git a/include/standard-headers/linux/input-event-codes.h b/include/standard-headers/linux/input-event-codes.h
index b5e86b40ab..50790aee5a 100644
--- a/include/standard-headers/linux/input-event-codes.h
+++ b/include/standard-headers/linux/input-event-codes.h
@@ -278,7 +278,8 @@
 #define KEY_PAUSECD		201
 #define KEY_PROG3		202
 #define KEY_PROG4		203
-#define KEY_DASHBOARD		204	/* AL Dashboard */
+#define KEY_ALL_APPLICATIONS	204	/* AC Desktop Show All Applications */
+#define KEY_DASHBOARD		KEY_ALL_APPLICATIONS
 #define KEY_SUSPEND		205
 #define KEY_CLOSE		206	/* AC Close */
 #define KEY_PLAY		207
@@ -612,6 +613,7 @@
 #define KEY_ASSISTANT		0x247	/* AL Context-aware desktop assistant */
 #define KEY_KBD_LAYOUT_NEXT	0x248	/* AC Next Keyboard Layout Select */
 #define KEY_EMOJI_PICKER	0x249	/* Show/hide emoji picker (HUTRR101) */
+#define KEY_DICTATE		0x24a	/* Start or Stop Voice Dictation Session (HUTRR99) */
 
 #define KEY_BRIGHTNESS_MIN		0x250	/* Set Brightness to Minimum */
 #define KEY_BRIGHTNESS_MAX		0x251	/* Set Brightness to Maximum */
@@ -660,6 +662,27 @@
 /* Select an area of screen to be copied */
 #define KEY_SELECTIVE_SCREENSHOT	0x27a
 
+/* Move the focus to the next or previous user controllable element within a UI container */
+#define KEY_NEXT_ELEMENT               0x27b
+#define KEY_PREVIOUS_ELEMENT           0x27c
+
+/* Toggle Autopilot engagement */
+#define KEY_AUTOPILOT_ENGAGE_TOGGLE    0x27d
+
+/* Shortcut Keys */
+#define KEY_MARK_WAYPOINT              0x27e
+#define KEY_SOS                                0x27f
+#define KEY_NAV_CHART                  0x280
+#define KEY_FISHING_CHART              0x281
+#define KEY_SINGLE_RANGE_RADAR         0x282
+#define KEY_DUAL_RANGE_RADAR           0x283
+#define KEY_RADAR_OVERLAY              0x284
+#define KEY_TRADITIONAL_SONAR          0x285
+#define KEY_CLEARVU_SONAR              0x286
+#define KEY_SIDEVU_SONAR               0x287
+#define KEY_NAV_INFO                   0x288
+#define KEY_BRIGHTNESS_MENU            0x289
+
 /*
  * Some keyboards have keys which do not have a defined meaning, these keys
  * are intended to be programmed / bound to macros by the user. For most
diff --git a/include/standard-headers/linux/virtio_config.h b/include/standard-headers/linux/virtio_config.h
index 22e3a85f67..7acd8d4abc 100644
--- a/include/standard-headers/linux/virtio_config.h
+++ b/include/standard-headers/linux/virtio_config.h
@@ -80,6 +80,12 @@
 /* This feature indicates support for the packed virtqueue layout. */
 #define VIRTIO_F_RING_PACKED		34
 
+/*
+ * Inorder feature indicates that all buffers are used by the device
+ * in the same order in which they have been made available.
+ */
+#define VIRTIO_F_IN_ORDER		35
+
 /*
  * This feature indicates that memory accesses by the driver and the
  * device are ordered in a way described by the platform.
diff --git a/include/standard-headers/linux/virtio_crypto.h b/include/standard-headers/linux/virtio_crypto.h
index 5ff0b4ee59..68066dafb6 100644
--- a/include/standard-headers/linux/virtio_crypto.h
+++ b/include/standard-headers/linux/virtio_crypto.h
@@ -37,6 +37,7 @@
 #define VIRTIO_CRYPTO_SERVICE_HASH   1
 #define VIRTIO_CRYPTO_SERVICE_MAC    2
 #define VIRTIO_CRYPTO_SERVICE_AEAD   3
+#define VIRTIO_CRYPTO_SERVICE_AKCIPHER 4
 
 #define VIRTIO_CRYPTO_OPCODE(service, op)   (((service) << 8) | (op))
 
@@ -57,6 +58,10 @@ struct virtio_crypto_ctrl_header {
 	   VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AEAD, 0x02)
 #define VIRTIO_CRYPTO_AEAD_DESTROY_SESSION \
 	   VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AEAD, 0x03)
+#define VIRTIO_CRYPTO_AKCIPHER_CREATE_SESSION \
+	   VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AKCIPHER, 0x04)
+#define VIRTIO_CRYPTO_AKCIPHER_DESTROY_SESSION \
+	   VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AKCIPHER, 0x05)
 	uint32_t opcode;
 	uint32_t algo;
 	uint32_t flag;
@@ -180,6 +185,58 @@ struct virtio_crypto_aead_create_session_req {
 	uint8_t padding[32];
 };
 
+struct virtio_crypto_rsa_session_para {
+#define VIRTIO_CRYPTO_RSA_RAW_PADDING   0
+#define VIRTIO_CRYPTO_RSA_PKCS1_PADDING 1
+	uint32_t padding_algo;
+
+#define VIRTIO_CRYPTO_RSA_NO_HASH   0
+#define VIRTIO_CRYPTO_RSA_MD2       1
+#define VIRTIO_CRYPTO_RSA_MD3       2
+#define VIRTIO_CRYPTO_RSA_MD4       3
+#define VIRTIO_CRYPTO_RSA_MD5       4
+#define VIRTIO_CRYPTO_RSA_SHA1      5
+#define VIRTIO_CRYPTO_RSA_SHA256    6
+#define VIRTIO_CRYPTO_RSA_SHA384    7
+#define VIRTIO_CRYPTO_RSA_SHA512    8
+#define VIRTIO_CRYPTO_RSA_SHA224    9
+	uint32_t hash_algo;
+};
+
+struct virtio_crypto_ecdsa_session_para {
+#define VIRTIO_CRYPTO_CURVE_UNKNOWN   0
+#define VIRTIO_CRYPTO_CURVE_NIST_P192 1
+#define VIRTIO_CRYPTO_CURVE_NIST_P224 2
+#define VIRTIO_CRYPTO_CURVE_NIST_P256 3
+#define VIRTIO_CRYPTO_CURVE_NIST_P384 4
+#define VIRTIO_CRYPTO_CURVE_NIST_P521 5
+	uint32_t curve_id;
+	uint32_t padding;
+};
+
+struct virtio_crypto_akcipher_session_para {
+#define VIRTIO_CRYPTO_NO_AKCIPHER    0
+#define VIRTIO_CRYPTO_AKCIPHER_RSA   1
+#define VIRTIO_CRYPTO_AKCIPHER_DSA   2
+#define VIRTIO_CRYPTO_AKCIPHER_ECDSA 3
+	uint32_t algo;
+
+#define VIRTIO_CRYPTO_AKCIPHER_KEY_TYPE_PUBLIC  1
+#define VIRTIO_CRYPTO_AKCIPHER_KEY_TYPE_PRIVATE 2
+	uint32_t keytype;
+	uint32_t keylen;
+
+	union {
+		struct virtio_crypto_rsa_session_para rsa;
+		struct virtio_crypto_ecdsa_session_para ecdsa;
+	} u;
+};
+
+struct virtio_crypto_akcipher_create_session_req {
+	struct virtio_crypto_akcipher_session_para para;
+	uint8_t padding[36];
+};
+
 struct virtio_crypto_alg_chain_session_para {
 #define VIRTIO_CRYPTO_SYM_ALG_CHAIN_ORDER_HASH_THEN_CIPHER  1
 #define VIRTIO_CRYPTO_SYM_ALG_CHAIN_ORDER_CIPHER_THEN_HASH  2
@@ -247,6 +304,8 @@ struct virtio_crypto_op_ctrl_req {
 			mac_create_session;
 		struct virtio_crypto_aead_create_session_req
 			aead_create_session;
+		struct virtio_crypto_akcipher_create_session_req
+			akcipher_create_session;
 		struct virtio_crypto_destroy_session_req
 			destroy_session;
 		uint8_t padding[56];
@@ -266,6 +325,14 @@ struct virtio_crypto_op_header {
 	VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AEAD, 0x00)
 #define VIRTIO_CRYPTO_AEAD_DECRYPT \
 	VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AEAD, 0x01)
+#define VIRTIO_CRYPTO_AKCIPHER_ENCRYPT \
+	VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AKCIPHER, 0x00)
+#define VIRTIO_CRYPTO_AKCIPHER_DECRYPT \
+	VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AKCIPHER, 0x01)
+#define VIRTIO_CRYPTO_AKCIPHER_SIGN \
+	VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AKCIPHER, 0x02)
+#define VIRTIO_CRYPTO_AKCIPHER_VERIFY \
+	VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AKCIPHER, 0x03)
 	uint32_t opcode;
 	/* algo should be service-specific algorithms */
 	uint32_t algo;
@@ -390,6 +457,16 @@ struct virtio_crypto_aead_data_req {
 	uint8_t padding[32];
 };
 
+struct virtio_crypto_akcipher_para {
+	uint32_t src_data_len;
+	uint32_t dst_data_len;
+};
+
+struct virtio_crypto_akcipher_data_req {
+	struct virtio_crypto_akcipher_para para;
+	uint8_t padding[40];
+};
+
 /* The request of the data virtqueue's packet */
 struct virtio_crypto_op_data_req {
 	struct virtio_crypto_op_header header;
@@ -399,6 +476,7 @@ struct virtio_crypto_op_data_req {
 		struct virtio_crypto_hash_data_req hash_req;
 		struct virtio_crypto_mac_data_req mac_req;
 		struct virtio_crypto_aead_data_req aead_req;
+		struct virtio_crypto_akcipher_data_req akcipher_req;
 		uint8_t padding[48];
 	} u;
 };
@@ -408,6 +486,8 @@ struct virtio_crypto_op_data_req {
 #define VIRTIO_CRYPTO_BADMSG    2
 #define VIRTIO_CRYPTO_NOTSUPP   3
 #define VIRTIO_CRYPTO_INVSESS   4 /* Invalid session id */
+#define VIRTIO_CRYPTO_NOSPC     5 /* no free session ID */
+#define VIRTIO_CRYPTO_KEY_REJECTED 6 /* Signature verification failed */
 
 /* The accelerator hardware is ready */
 #define VIRTIO_CRYPTO_S_HW_READY  (1 << 0)
@@ -438,7 +518,7 @@ struct virtio_crypto_config {
 	uint32_t max_cipher_key_len;
 	/* Maximum length of authenticated key */
 	uint32_t max_auth_key_len;
-	uint32_t reserve;
+	uint32_t akcipher_algo;
 	/* Maximum size of each crypto request's content */
 	uint64_t max_size;
 };
diff --git a/linux-headers/asm-arm64/kvm.h b/linux-headers/asm-arm64/kvm.h
index 3d2ce9912d..5c28a9737a 100644
--- a/linux-headers/asm-arm64/kvm.h
+++ b/linux-headers/asm-arm64/kvm.h
@@ -281,6 +281,11 @@ struct kvm_arm_copy_mte_tags {
 #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED	3
 #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED     	(1U << 4)
 
+#define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3	KVM_REG_ARM_FW_REG(3)
+#define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3_NOT_AVAIL		0
+#define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3_AVAIL		1
+#define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_3_NOT_REQUIRED	2
+
 /* SVE registers */
 #define KVM_REG_ARM64_SVE		(0x15 << KVM_REG_ARM_COPROC_SHIFT)
 
@@ -362,6 +367,7 @@ struct kvm_arm_copy_mte_tags {
 #define   KVM_ARM_VCPU_PMU_V3_IRQ	0
 #define   KVM_ARM_VCPU_PMU_V3_INIT	1
 #define   KVM_ARM_VCPU_PMU_V3_FILTER	2
+#define   KVM_ARM_VCPU_PMU_V3_SET_PMU	3
 #define KVM_ARM_VCPU_TIMER_CTRL		1
 #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER		0
 #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER		1
@@ -411,6 +417,16 @@ struct kvm_arm_copy_mte_tags {
 #define KVM_PSCI_RET_INVAL		PSCI_RET_INVALID_PARAMS
 #define KVM_PSCI_RET_DENIED		PSCI_RET_DENIED
 
+/* arm64-specific kvm_run::system_event flags */
+/*
+ * Reset caused by a PSCI v1.1 SYSTEM_RESET2 call.
+ * Valid only when the system event has a type of KVM_SYSTEM_EVENT_RESET.
+ */
+#define KVM_SYSTEM_EVENT_RESET_FLAG_PSCI_RESET2	(1ULL << 0)
+
+/* run->fail_entry.hardware_entry_failure_reason codes. */
+#define KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED	(1ULL << 0)
+
 #endif
 
 #endif /* __ARM_KVM_H__ */
diff --git a/linux-headers/asm-generic/mman-common.h b/linux-headers/asm-generic/mman-common.h
index 1567a3294c..6c1aa92a92 100644
--- a/linux-headers/asm-generic/mman-common.h
+++ b/linux-headers/asm-generic/mman-common.h
@@ -75,6 +75,8 @@
 #define MADV_POPULATE_READ	22	/* populate (prefault) page tables readable */
 #define MADV_POPULATE_WRITE	23	/* populate (prefault) page tables writable */
 
+#define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/linux-headers/asm-mips/mman.h b/linux-headers/asm-mips/mman.h
index 40b210c65a..1be428663c 100644
--- a/linux-headers/asm-mips/mman.h
+++ b/linux-headers/asm-mips/mman.h
@@ -101,6 +101,8 @@
 #define MADV_POPULATE_READ	22	/* populate (prefault) page tables readable */
 #define MADV_POPULATE_WRITE	23	/* populate (prefault) page tables writable */
 
+#define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index d232feaae9..0d05d02ee4 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -445,7 +445,11 @@ struct kvm_run {
 #define KVM_SYSTEM_EVENT_RESET          2
 #define KVM_SYSTEM_EVENT_CRASH          3
 			__u32 type;
-			__u64 flags;
+			__u32 ndata;
+			union {
+				__u64 flags;
+				__u64 data[16];
+			};
 		} system_event;
 		/* KVM_EXIT_S390_STSI */
 		struct {
@@ -562,9 +566,12 @@ struct kvm_s390_mem_op {
 	__u32 op;		/* type of operation */
 	__u64 buf;		/* buffer in userspace */
 	union {
-		__u8 ar;	/* the access register number */
+		struct {
+			__u8 ar;	/* the access register number */
+			__u8 key;	/* access key, ignored if flag unset */
+		};
 		__u32 sida_offset; /* offset into the sida */
-		__u8 reserved[32]; /* should be set to 0 */
+		__u8 reserved[32]; /* ignored */
 	};
 };
 /* types for kvm_s390_mem_op->op */
@@ -572,9 +579,12 @@ struct kvm_s390_mem_op {
 #define KVM_S390_MEMOP_LOGICAL_WRITE	1
 #define KVM_S390_MEMOP_SIDA_READ	2
 #define KVM_S390_MEMOP_SIDA_WRITE	3
+#define KVM_S390_MEMOP_ABSOLUTE_READ	4
+#define KVM_S390_MEMOP_ABSOLUTE_WRITE	5
 /* flags for kvm_s390_mem_op->flags */
 #define KVM_S390_MEMOP_F_CHECK_ONLY		(1ULL << 0)
 #define KVM_S390_MEMOP_F_INJECT_EXCEPTION	(1ULL << 1)
+#define KVM_S390_MEMOP_F_SKEY_PROTECTION	(1ULL << 2)
 
 /* for KVM_INTERRUPT */
 struct kvm_interrupt {
@@ -1134,6 +1144,12 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_VM_GPA_BITS 207
 #define KVM_CAP_XSAVE2 208
 #define KVM_CAP_SYS_ATTRIBUTES 209
+#define KVM_CAP_PPC_AIL_MODE_3 210
+#define KVM_CAP_S390_MEM_OP_EXTENSION 211
+#define KVM_CAP_PMU_CAPABILITY 212
+#define KVM_CAP_DISABLE_QUIRKS2 213
+/* #define KVM_CAP_VM_TSC_CONTROL 214 */
+#define KVM_CAP_SYSTEM_EVENT_DATA 215
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1624,9 +1640,6 @@ struct kvm_enc_region {
 #define KVM_S390_NORMAL_RESET	_IO(KVMIO,   0xc3)
 #define KVM_S390_CLEAR_RESET	_IO(KVMIO,   0xc4)
 
-/* Available with KVM_CAP_XSAVE2 */
-#define KVM_GET_XSAVE2		  _IOR(KVMIO,  0xcf, struct kvm_xsave)
-
 struct kvm_s390_pv_sec_parm {
 	__u64 origin;
 	__u64 length;
@@ -1973,6 +1986,8 @@ struct kvm_dirty_gfn {
 #define KVM_BUS_LOCK_DETECTION_OFF             (1 << 0)
 #define KVM_BUS_LOCK_DETECTION_EXIT            (1 << 1)
 
+#define KVM_PMU_CAP_DISABLE                    (1 << 0)
+
 /**
  * struct kvm_stats_header - Header of per vm/vcpu binary statistics data.
  * @flags: Some extra information for header, always 0 for now.
diff --git a/linux-headers/linux/psci.h b/linux-headers/linux/psci.h
index a6772d508b..213b2a0f70 100644
--- a/linux-headers/linux/psci.h
+++ b/linux-headers/linux/psci.h
@@ -82,6 +82,10 @@
 #define PSCI_0_2_TOS_UP_NO_MIGRATE		1
 #define PSCI_0_2_TOS_MP				2
 
+/* PSCI v1.1 reset type encoding for SYSTEM_RESET2 */
+#define PSCI_1_1_RESET_TYPE_SYSTEM_WARM_RESET	0
+#define PSCI_1_1_RESET_TYPE_VENDOR_START	0x80000000U
+
 /* PSCI version decoding (independent of PSCI version) */
 #define PSCI_VERSION_MAJOR_SHIFT		16
 #define PSCI_VERSION_MINOR_MASK			\
diff --git a/linux-headers/linux/userfaultfd.h b/linux-headers/linux/userfaultfd.h
index 8479af5f4c..769b8379e4 100644
--- a/linux-headers/linux/userfaultfd.h
+++ b/linux-headers/linux/userfaultfd.h
@@ -32,7 +32,8 @@
 			   UFFD_FEATURE_SIGBUS |		\
 			   UFFD_FEATURE_THREAD_ID |		\
 			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
-			   UFFD_FEATURE_MINOR_SHMEM)
+			   UFFD_FEATURE_MINOR_SHMEM |		\
+			   UFFD_FEATURE_EXACT_ADDRESS)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -189,6 +190,10 @@ struct uffdio_api {
 	 *
 	 * UFFD_FEATURE_MINOR_SHMEM indicates the same support as
 	 * UFFD_FEATURE_MINOR_HUGETLBFS, but for shmem-backed pages instead.
+	 *
+	 * UFFD_FEATURE_EXACT_ADDRESS indicates that the exact address of page
+	 * faults would be provided and the offset within the page would not be
+	 * masked.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -201,6 +206,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_THREAD_ID			(1<<8)
 #define UFFD_FEATURE_MINOR_HUGETLBFS		(1<<9)
 #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
+#define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
 	__u64 features;
 
 	__u64 ioctls;
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index e680594f27..e9f7795c39 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -323,7 +323,7 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
 #define VFIO_REGION_TYPE_GFX                    (1)
 #define VFIO_REGION_TYPE_CCW			(2)
-#define VFIO_REGION_TYPE_MIGRATION              (3)
+#define VFIO_REGION_TYPE_MIGRATION_DEPRECATED   (3)
 
 /* sub-types for VFIO_REGION_TYPE_PCI_* */
 
@@ -405,225 +405,29 @@ struct vfio_region_gfx_edid {
 #define VFIO_REGION_SUBTYPE_CCW_CRW		(3)
 
 /* sub-types for VFIO_REGION_TYPE_MIGRATION */
-#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
-
-/*
- * The structure vfio_device_migration_info is placed at the 0th offset of
- * the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related
- * migration information. Field accesses from this structure are only supported
- * at their native width and alignment. Otherwise, the result is undefined and
- * vendor drivers should return an error.
- *
- * device_state: (read/write)
- *      - The user application writes to this field to inform the vendor driver
- *        about the device state to be transitioned to.
- *      - The vendor driver should take the necessary actions to change the
- *        device state. After successful transition to a given state, the
- *        vendor driver should return success on write(device_state, state)
- *        system call. If the device state transition fails, the vendor driver
- *        should return an appropriate -errno for the fault condition.
- *      - On the user application side, if the device state transition fails,
- *	  that is, if write(device_state, state) returns an error, read
- *	  device_state again to determine the current state of the device from
- *	  the vendor driver.
- *      - The vendor driver should return previous state of the device unless
- *        the vendor driver has encountered an internal error, in which case
- *        the vendor driver may report the device_state VFIO_DEVICE_STATE_ERROR.
- *      - The user application must use the device reset ioctl to recover the
- *        device from VFIO_DEVICE_STATE_ERROR state. If the device is
- *        indicated to be in a valid device state by reading device_state, the
- *        user application may attempt to transition the device to any valid
- *        state reachable from the current state or terminate itself.
- *
- *      device_state consists of 3 bits:
- *      - If bit 0 is set, it indicates the _RUNNING state. If bit 0 is clear,
- *        it indicates the _STOP state. When the device state is changed to
- *        _STOP, driver should stop the device before write() returns.
- *      - If bit 1 is set, it indicates the _SAVING state, which means that the
- *        driver should start gathering device state information that will be
- *        provided to the VFIO user application to save the device's state.
- *      - If bit 2 is set, it indicates the _RESUMING state, which means that
- *        the driver should prepare to resume the device. Data provided through
- *        the migration region should be used to resume the device.
- *      Bits 3 - 31 are reserved for future use. To preserve them, the user
- *      application should perform a read-modify-write operation on this
- *      field when modifying the specified bits.
- *
- *  +------- _RESUMING
- *  |+------ _SAVING
- *  ||+----- _RUNNING
- *  |||
- *  000b => Device Stopped, not saving or resuming
- *  001b => Device running, which is the default state
- *  010b => Stop the device & save the device state, stop-and-copy state
- *  011b => Device running and save the device state, pre-copy state
- *  100b => Device stopped and the device state is resuming
- *  101b => Invalid state
- *  110b => Error state
- *  111b => Invalid state
- *
- * State transitions:
- *
- *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
- *                (100b)     (001b)     (011b)        (010b)       (000b)
- * 0. Running or default state
- *                             |
- *
- * 1. Normal Shutdown (optional)
- *                             |------------------------------------->|
- *
- * 2. Save the state or suspend
- *                             |------------------------->|---------->|
- *
- * 3. Save the state during live migration
- *                             |----------->|------------>|---------->|
- *
- * 4. Resuming
- *                  |<---------|
- *
- * 5. Resumed
- *                  |--------->|
- *
- * 0. Default state of VFIO device is _RUNNING when the user application starts.
- * 1. During normal shutdown of the user application, the user application may
- *    optionally change the VFIO device state from _RUNNING to _STOP. This
- *    transition is optional. The vendor driver must support this transition but
- *    must not require it.
- * 2. When the user application saves state or suspends the application, the
- *    device state transitions from _RUNNING to stop-and-copy and then to _STOP.
- *    On state transition from _RUNNING to stop-and-copy, driver must stop the
- *    device, save the device state and send it to the application through the
- *    migration region. The sequence to be followed for such transition is given
- *    below.
- * 3. In live migration of user application, the state transitions from _RUNNING
- *    to pre-copy, to stop-and-copy, and to _STOP.
- *    On state transition from _RUNNING to pre-copy, the driver should start
- *    gathering the device state while the application is still running and send
- *    the device state data to application through the migration region.
- *    On state transition from pre-copy to stop-and-copy, the driver must stop
- *    the device, save the device state and send it to the user application
- *    through the migration region.
- *    Vendor drivers must support the pre-copy state even for implementations
- *    where no data is provided to the user before the stop-and-copy state. The
- *    user must not be required to consume all migration data before the device
- *    transitions to a new state, including the stop-and-copy state.
- *    The sequence to be followed for above two transitions is given below.
- * 4. To start the resuming phase, the device state should be transitioned from
- *    the _RUNNING to the _RESUMING state.
- *    In the _RESUMING state, the driver should use the device state data
- *    received through the migration region to resume the device.
- * 5. After providing saved device data to the driver, the application should
- *    change the state from _RESUMING to _RUNNING.
- *
- * reserved:
- *      Reads on this field return zero and writes are ignored.
- *
- * pending_bytes: (read only)
- *      The number of pending bytes still to be migrated from the vendor driver.
- *
- * data_offset: (read only)
- *      The user application should read data_offset field from the migration
- *      region. The user application should read the device data from this
- *      offset within the migration region during the _SAVING state or write
- *      the device data during the _RESUMING state. See below for details of
- *      sequence to be followed.
- *
- * data_size: (read/write)
- *      The user application should read data_size to get the size in bytes of
- *      the data copied in the migration region during the _SAVING state and
- *      write the size in bytes of the data copied in the migration region
- *      during the _RESUMING state.
- *
- * The format of the migration region is as follows:
- *  ------------------------------------------------------------------
- * |vfio_device_migration_info|    data section                      |
- * |                          |     ///////////////////////////////  |
- * ------------------------------------------------------------------
- *   ^                              ^
- *  offset 0-trapped part        data_offset
- *
- * The structure vfio_device_migration_info is always followed by the data
- * section in the region, so data_offset will always be nonzero. The offset
- * from where the data is copied is decided by the kernel driver. The data
- * section can be trapped, mmapped, or partitioned, depending on how the kernel
- * driver defines the data section. The data section partition can be defined
- * as mapped by the sparse mmap capability. If mmapped, data_offset must be
- * page aligned, whereas initial section which contains the
- * vfio_device_migration_info structure, might not end at the offset, which is
- * page aligned. The user is not required to access through mmap regardless
- * of the capabilities of the region mmap.
- * The vendor driver should determine whether and how to partition the data
- * section. The vendor driver should return data_offset accordingly.
- *
- * The sequence to be followed while in pre-copy state and stop-and-copy state
- * is as follows:
- * a. Read pending_bytes, indicating the start of a new iteration to get device
- *    data. Repeated read on pending_bytes at this stage should have no side
- *    effects.
- *    If pending_bytes == 0, the user application should not iterate to get data
- *    for that device.
- *    If pending_bytes > 0, perform the following steps.
- * b. Read data_offset, indicating that the vendor driver should make data
- *    available through the data section. The vendor driver should return this
- *    read operation only after data is available from (region + data_offset)
- *    to (region + data_offset + data_size).
- * c. Read data_size, which is the amount of data in bytes available through
- *    the migration region.
- *    Read on data_offset and data_size should return the offset and size of
- *    the current buffer if the user application reads data_offset and
- *    data_size more than once here.
- * d. Read data_size bytes of data from (region + data_offset) from the
- *    migration region.
- * e. Process the data.
- * f. Read pending_bytes, which indicates that the data from the previous
- *    iteration has been read. If pending_bytes > 0, go to step b.
- *
- * The user application can transition from the _SAVING|_RUNNING
- * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
- * number of pending bytes. The user application should iterate in _SAVING
- * (stop-and-copy) until pending_bytes is 0.
- *
- * The sequence to be followed while _RESUMING device state is as follows:
- * While data for this device is available, repeat the following steps:
- * a. Read data_offset from where the user application should write data.
- * b. Write migration data starting at the migration region + data_offset for
- *    the length determined by data_size from the migration source.
- * c. Write data_size, which indicates to the vendor driver that data is
- *    written in the migration region. Vendor driver must return this write
- *    operations on consuming data. Vendor driver should apply the
- *    user-provided migration region data to the device resume state.
- *
- * If an error occurs during the above sequences, the vendor driver can return
- * an error code for next read() or write() operation, which will terminate the
- * loop. The user application should then take the next necessary action, for
- * example, failing migration or terminating the user application.
- *
- * For the user application, data is opaque. The user application should write
- * data in the same order as the data is received and the data should be of
- * same transaction size at the source.
- */
+#define VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED (1)
 
 struct vfio_device_migration_info {
 	__u32 device_state;         /* VFIO device state */
-#define VFIO_DEVICE_STATE_STOP      (0)
-#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
-#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
-#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
-#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
-				     VFIO_DEVICE_STATE_SAVING |  \
-				     VFIO_DEVICE_STATE_RESUMING)
+#define VFIO_DEVICE_STATE_V1_STOP      (0)
+#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_V1_RUNNING | \
+				     VFIO_DEVICE_STATE_V1_SAVING |  \
+				     VFIO_DEVICE_STATE_V1_RESUMING)
 
 #define VFIO_DEVICE_STATE_VALID(state) \
-	(state & VFIO_DEVICE_STATE_RESUMING ? \
-	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
+	(state & VFIO_DEVICE_STATE_V1_RESUMING ? \
+	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_V1_RESUMING : 1)
 
 #define VFIO_DEVICE_STATE_IS_ERROR(state) \
-	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \
-					      VFIO_DEVICE_STATE_RESUMING))
+	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_V1_SAVING | \
+					      VFIO_DEVICE_STATE_V1_RESUMING))
 
 #define VFIO_DEVICE_STATE_SET_ERROR(state) \
-	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
-					     VFIO_DEVICE_STATE_RESUMING)
+	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_STATE_V1_SAVING | \
+					     VFIO_DEVICE_STATE_V1_RESUMING)
 
 	__u32 reserved;
 	__u64 pending_bytes;
@@ -1002,6 +806,186 @@ struct vfio_device_feature {
  */
 #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN	(0)
 
+/*
+ * Indicates the device can support the migration API through
+ * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. If this GET succeeds, the RUNNING and
+ * ERROR states are always supported. Support for additional states is
+ * indicated via the flags field; at least VFIO_MIGRATION_STOP_COPY must be
+ * set.
+ *
+ * VFIO_MIGRATION_STOP_COPY means that STOP, STOP_COPY and
+ * RESUMING are supported.
+ *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that RUNNING_P2P
+ * is supported in addition to the STOP_COPY states.
+ *
+ * Other combinations of flags have behavior to be defined in the future.
+ */
+struct vfio_device_feature_migration {
+	__aligned_u64 flags;
+#define VFIO_MIGRATION_STOP_COPY	(1 << 0)
+#define VFIO_MIGRATION_P2P		(1 << 1)
+};
+#define VFIO_DEVICE_FEATURE_MIGRATION 1
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET, execute a migration state change on the VFIO
+ * device. The new state is supplied in device_state, see enum
+ * vfio_device_mig_state for details
+ *
+ * The kernel migration driver must fully transition the device to the new state
+ * value before the operation returns to the user.
+ *
+ * The kernel migration driver must not generate asynchronous device state
+ * transitions outside of manipulation by the user or the VFIO_DEVICE_RESET
+ * ioctl as described above.
+ *
+ * If this function fails then current device_state may be the original
+ * operating state or some other state along the combination transition path.
+ * The user can then decide if it should execute a VFIO_DEVICE_RESET, attempt
+ * to return to the original state, or attempt to return to some other state
+ * such as RUNNING or STOP.
+ *
+ * If the new_state starts a new data transfer session then the FD associated
+ * with that session is returned in data_fd. The user is responsible to close
+ * this FD when it is finished. The user must consider the migration data stream
+ * carried over the FD to be opaque and must preserve the byte order of the
+ * stream. The user is not required to preserve buffer segmentation when writing
+ * the data stream during the RESUMING operation.
+ *
+ * Upon VFIO_DEVICE_FEATURE_GET, get the current migration state of the VFIO
+ * device, data_fd will be -1.
+ */
+struct vfio_device_feature_mig_state {
+	__u32 device_state; /* From enum vfio_device_mig_state */
+	__s32 data_fd;
+};
+#define VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE 2
+
+/*
+ * The device migration Finite State Machine is described by the enum
+ * vfio_device_mig_state. Some of the FSM arcs will create a migration data
+ * transfer session by returning a FD, in this case the migration data will
+ * flow over the FD using read() and write() as discussed below.
+ *
+ * There are 5 states to support VFIO_MIGRATION_STOP_COPY:
+ *  RUNNING - The device is running normally
+ *  STOP - The device does not change the internal or external state
+ *  STOP_COPY - The device internal state can be read out
+ *  RESUMING - The device is stopped and is loading a new internal state
+ *  ERROR - The device has failed and must be reset
+ *
+ * And 1 optional state to support VFIO_MIGRATION_P2P:
+ *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA
+ *
+ * The FSM takes actions on the arcs between FSM states. The driver implements
+ * the following behavior for the FSM arcs:
+ *
+ * RUNNING_P2P -> STOP
+ * STOP_COPY -> STOP
+ *   While in STOP the device must stop the operation of the device. The device
+ *   must not generate interrupts, DMA, or any other change to external state.
+ *   It must not change its internal state. When stopped the device and kernel
+ *   migration driver must accept and respond to interaction to support external
+ *   subsystems in the STOP state, for example PCI MSI-X and PCI config space.
+ *   Failure by the user to restrict device access while in STOP must not result
+ *   in error conditions outside the user context (ex. host system faults).
+ *
+ *   The STOP_COPY arc will terminate a data transfer session.
+ *
+ * RESUMING -> STOP
+ *   Leaving RESUMING terminates a data transfer session and indicates the
+ *   device should complete processing of the data delivered by write(). The
+ *   kernel migration driver should complete the incorporation of data written
+ *   to the data transfer FD into the device internal state and perform
+ *   final validity and consistency checking of the new device state. If the
+ *   user provided data is found to be incomplete, inconsistent, or otherwise
+ *   invalid, the migration driver must fail the SET_STATE ioctl and
+ *   optionally go to the ERROR state as described below.
+ *
+ *   While in STOP the device has the same behavior as other STOP states
+ *   described above.
+ *
+ *   To abort a RESUMING session the device must be reset.
+ *
+ * RUNNING_P2P -> RUNNING
+ *   While in RUNNING the device is fully operational, the device may generate
+ *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
+ *   and the device may advance its internal state.
+ *
+ * RUNNING -> RUNNING_P2P
+ * STOP -> RUNNING_P2P
+ *   While in RUNNING_P2P the device is partially running in the P2P quiescent
+ *   state defined below.
+ *
+ * STOP -> STOP_COPY
+ *   This arc begin the process of saving the device state and will return a
+ *   new data_fd.
+ *
+ *   While in the STOP_COPY state the device has the same behavior as STOP
+ *   with the addition that the data transfers session continues to stream the
+ *   migration state. End of stream on the FD indicates the entire device
+ *   state has been transferred.
+ *
+ *   The user should take steps to restrict access to vfio device regions while
+ *   the device is in STOP_COPY or risk corruption of the device migration data
+ *   stream.
+ *
+ * STOP -> RESUMING
+ *   Entering the RESUMING state starts a process of restoring the device state
+ *   and will return a new data_fd. The data stream fed into the data_fd should
+ *   be taken from the data transfer output of a single FD during saving from
+ *   a compatible device. The migration driver may alter/reset the internal
+ *   device state for this arc if required to prepare the device to receive the
+ *   migration data.
+ *
+ * any -> ERROR
+ *   ERROR cannot be specified as a device state, however any transition request
+ *   can be failed with an errno return and may then move the device_state into
+ *   ERROR. In this case the device was unable to execute the requested arc and
+ *   was also unable to restore the device to any valid device_state.
+ *   To recover from ERROR VFIO_DEVICE_RESET must be used to return the
+ *   device_state back to RUNNING.
+ *
+ * The optional peer to peer (P2P) quiescent state is intended to be a quiescent
+ * state for the device for the purposes of managing multiple devices within a
+ * user context where peer-to-peer DMA between devices may be active. The
+ * RUNNING_P2P states must prevent the device from initiating
+ * any new P2P DMA transactions. If the device can identify P2P transactions
+ * then it can stop only P2P DMA, otherwise it must stop all DMA. The migration
+ * driver must complete any such outstanding operations prior to completing the
+ * FSM arc into a P2P state. For the purpose of specification the states
+ * behave as though the device was fully running if not supported. Like while in
+ * STOP or STOP_COPY the user must not touch the device, otherwise the state
+ * can be exited.
+ *
+ * The remaining possible transitions are interpreted as combinations of the
+ * above FSM arcs. As there are multiple paths through the FSM arcs the path
+ * should be selected based on the following rules:
+ *   - Select the shortest path.
+ * Refer to vfio_mig_get_next_state() for the result of the algorithm.
+ *
+ * The automatic transit through the FSM arcs that make up the combination
+ * transition is invisible to the user. When working with combination arcs the
+ * user may see any step along the path in the device_state if SET_STATE
+ * fails. When handling these types of errors users should anticipate future
+ * revisions of this protocol using new states and those states becoming
+ * visible in this case.
+ *
+ * The optional states cannot be used with SET_STATE if the device does not
+ * support them. The user can discover if these states are supported by using
+ * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can
+ * avoid knowing about these optional states if the kernel driver supports them.
+ */
+enum vfio_device_mig_state {
+	VFIO_DEVICE_STATE_ERROR = 0,
+	VFIO_DEVICE_STATE_STOP = 1,
+	VFIO_DEVICE_STATE_RUNNING = 2,
+	VFIO_DEVICE_STATE_STOP_COPY = 3,
+	VFIO_DEVICE_STATE_RESUMING = 4,
+	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
+};
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
diff --git a/linux-headers/linux/vhost.h b/linux-headers/linux/vhost.h
index c998860d7b..5d99e7c242 100644
--- a/linux-headers/linux/vhost.h
+++ b/linux-headers/linux/vhost.h
@@ -150,4 +150,11 @@
 /* Get the valid iova range */
 #define VHOST_VDPA_GET_IOVA_RANGE	_IOR(VHOST_VIRTIO, 0x78, \
 					     struct vhost_vdpa_iova_range)
+
+/* Get the config size */
+#define VHOST_VDPA_GET_CONFIG_SIZE	_IOR(VHOST_VIRTIO, 0x79, __u32)
+
+/* Get the count of all virtqueues */
+#define VHOST_VDPA_GET_VQS_COUNT	_IOR(VHOST_VIRTIO, 0x80, __u32)
+
 #endif
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/9] vfio: Fix compilation errors caused by VFIO migration v1 deprecation
  2022-05-12 15:43 [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
  2022-05-12 15:43 ` [PATCH 1/9] linux-headers: Update headers to v5.18-rc6 Avihai Horon
@ 2022-05-12 15:43 ` Avihai Horon
  2022-05-12 17:57   ` Alex Williamson
  2022-05-12 15:43 ` [PATCH 3/9] vfio/migration: Fix NULL pointer dereference bug Avihai Horon
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 43+ messages in thread
From: Avihai Horon @ 2022-05-12 15:43 UTC (permalink / raw)
  To: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Juan Quintela, Dr . David Alan Gilbert
  Cc: Yishai Hadas, Jason Gunthorpe, Mark Bloch, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta

VFIO migration protocol v1 was deprecated and as part of it some of the
uAPI definitions were renamed. This caused compilation errors.
Fix them.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c    |  6 +++---
 hw/vfio/migration.c | 29 ++++++++++++++++-------------
 2 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 159f910421..29982c7af8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -355,7 +355,7 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
             }
 
             if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
-                && (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+                && (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) {
                 return false;
             }
         }
@@ -381,8 +381,8 @@ static bool vfio_devices_all_running_and_saving(VFIOContainer *container)
                 return false;
             }
 
-            if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
-                (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+            if ((migration->device_state & VFIO_DEVICE_STATE_V1_SAVING) &&
+                (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index ff6b45de6b..835608cd23 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -432,7 +432,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
     }
 
     ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
-                                   VFIO_DEVICE_STATE_SAVING);
+                                   VFIO_DEVICE_STATE_V1_SAVING);
     if (ret) {
         error_report("%s: Failed to set state SAVING", vbasedev->name);
         return ret;
@@ -531,8 +531,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     uint64_t data_size;
     int ret;
 
-    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
-                                   VFIO_DEVICE_STATE_SAVING);
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_RUNNING,
+                                   VFIO_DEVICE_STATE_V1_SAVING);
     if (ret) {
         error_report("%s: Failed to set state STOP and SAVING",
                      vbasedev->name);
@@ -569,7 +569,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
         return ret;
     }
 
-    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_SAVING, 0);
     if (ret) {
         error_report("%s: Failed to set state STOPPED", vbasedev->name);
         return ret;
@@ -609,7 +609,7 @@ static int vfio_load_setup(QEMUFile *f, void *opaque)
     }
 
     ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
-                                   VFIO_DEVICE_STATE_RESUMING);
+                                   VFIO_DEVICE_STATE_V1_RESUMING);
     if (ret) {
         error_report("%s: Failed to set state RESUMING", vbasedev->name);
         if (migration->region.mmaps) {
@@ -717,20 +717,20 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
          * In both the above cases, set _RUNNING bit.
          */
         mask = ~VFIO_DEVICE_STATE_MASK;
-        value = VFIO_DEVICE_STATE_RUNNING;
+        value = VFIO_DEVICE_STATE_V1_RUNNING;
     } else {
         /*
          * Here device state could be either _RUNNING or _SAVING|_RUNNING. Reset
          * _RUNNING bit
          */
-        mask = ~VFIO_DEVICE_STATE_RUNNING;
+        mask = ~VFIO_DEVICE_STATE_V1_RUNNING;
 
         /*
          * When VM state transition to stop for savevm command, device should
          * start saving data.
          */
         if (state == RUN_STATE_SAVE_VM) {
-            value = VFIO_DEVICE_STATE_SAVING;
+            value = VFIO_DEVICE_STATE_V1_SAVING;
         } else {
             value = 0;
         }
@@ -767,9 +767,10 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
     case MIGRATION_STATUS_CANCELLED:
     case MIGRATION_STATUS_FAILED:
         bytes_transferred = 0;
-        ret = vfio_migration_set_state(vbasedev,
-                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
-                      VFIO_DEVICE_STATE_RUNNING);
+        ret = vfio_migration_set_state(
+            vbasedev,
+            ~(VFIO_DEVICE_STATE_V1_SAVING | VFIO_DEVICE_STATE_V1_RESUMING),
+            VFIO_DEVICE_STATE_V1_RUNNING);
         if (ret) {
             error_report("%s: Failed to set state RUNNING", vbasedev->name);
         }
@@ -864,8 +865,10 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
         goto add_blocker;
     }
 
-    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
-                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
+    ret = vfio_get_dev_region_info(vbasedev,
+                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
+                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
+                                   &info);
     if (ret) {
         goto add_blocker;
     }
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/9] vfio/migration: Fix NULL pointer dereference bug
  2022-05-12 15:43 [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
  2022-05-12 15:43 ` [PATCH 1/9] linux-headers: Update headers to v5.18-rc6 Avihai Horon
  2022-05-12 15:43 ` [PATCH 2/9] vfio: Fix compilation errors caused by VFIO migration v1 deprecation Avihai Horon
@ 2022-05-12 15:43 ` Avihai Horon
  2022-05-16 11:15   ` Juan Quintela
  2022-05-12 15:43 ` [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported Avihai Horon
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 43+ messages in thread
From: Avihai Horon @ 2022-05-12 15:43 UTC (permalink / raw)
  To: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Juan Quintela, Dr . David Alan Gilbert
  Cc: Yishai Hadas, Jason Gunthorpe, Mark Bloch, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta

As part of its error flow, vfio_vmstate_change() accesses
MigrationState->to_dst_file without any checks. This can cause a NULL
pointer dereference if the error flow is taken and
MigrationState->to_dst_file is not set.

For example, this can happen if VM is started or stopped not during
migration and vfio_vmstate_change() error flow is taken, as
MigrationState->to_dst_file is not set at that time.

Fix it by checking that MigrationState->to_dst_file is set before using
it.

Fixes: 02a7e71b1e5b ("vfio: Add VM state change handler to know state of VM")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 835608cd23..21e8f9d4d4 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -744,7 +744,9 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
          */
         error_report("%s: Failed to set device state 0x%x", vbasedev->name,
                      (migration->device_state & mask) | value);
-        qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
+        if (migrate_get_current()->to_dst_file) {
+            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
+        }
     }
     vbasedev->migration->vm_running = running;
     trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-12 15:43 [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (2 preceding siblings ...)
  2022-05-12 15:43 ` [PATCH 3/9] vfio/migration: Fix NULL pointer dereference bug Avihai Horon
@ 2022-05-12 15:43 ` Avihai Horon
  2022-05-16 11:22   ` Juan Quintela
  2022-05-20 10:58   ` Joao Martins
  2022-05-12 15:43 ` [PATCH 5/9] migration/qemu-file: Add qemu_file_get_to_fd() Avihai Horon
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 43+ messages in thread
From: Avihai Horon @ 2022-05-12 15:43 UTC (permalink / raw)
  To: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Juan Quintela, Dr . David Alan Gilbert
  Cc: Yishai Hadas, Jason Gunthorpe, Mark Bloch, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta

Currently, if IOMMU of a VFIO container doesn't support dirty page
tracking, migration is blocked completely. This is because a DMA-able
VFIO device can dirty RAM pages without updating QEMU about it, thus
breaking the migration.

However, this doesn't mean that migration can't be done at all. If
migration pre-copy phase is skipped, the VFIO device doesn't have a
chance to dirty RAM pages that have been migrated already, thus
eliminating the problem previously mentioned.

Hence, in such case allow migration but skip pre-copy phase.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c   | 9 ++++++++-
 migration/migration.c | 5 +++++
 migration/migration.h | 3 +++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 21e8f9d4d4..d4b6653026 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -863,10 +863,17 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
     struct vfio_region_info *info = NULL;
     int ret = -ENOTSUP;
 
-    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
+    if (!vbasedev->enable_migration) {
         goto add_blocker;
     }
 
+    if (!container->dirty_pages_supported) {
+        warn_report(
+            "%s: IOMMU of the device's VFIO container doesn't support dirty page tracking, migration pre-copy phase will be skipped",
+            vbasedev->name);
+        migrate_get_current()->skip_precopy = true;
+    }
+
     ret = vfio_get_dev_region_info(vbasedev,
                                    VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
                                    VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
diff --git a/migration/migration.c b/migration/migration.c
index 5a31b23bd6..668343508d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3593,6 +3593,11 @@ static MigIterateState migration_iteration_run(MigrationState *s)
     uint64_t pending_size, pend_pre, pend_compat, pend_post;
     bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
 
+    if (s->skip_precopy) {
+        migration_completion(s);
+        return MIG_ITERATE_BREAK;
+    }
+
     qemu_savevm_state_pending(s->to_dst_file, s->threshold_size, &pend_pre,
                               &pend_compat, &pend_post);
     pending_size = pend_pre + pend_compat + pend_post;
diff --git a/migration/migration.h b/migration/migration.h
index a863032b71..876713e7e1 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -332,6 +332,9 @@ struct MigrationState {
      * This save hostname when out-going migration starts
      */
     char *hostname;
+
+    /* Whether to skip pre-copy phase of migration or not */
+    bool skip_precopy;
 };
 
 void migrate_set_state(int *state, int old_state, int new_state);
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 5/9] migration/qemu-file: Add qemu_file_get_to_fd()
  2022-05-12 15:43 [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (3 preceding siblings ...)
  2022-05-12 15:43 ` [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported Avihai Horon
@ 2022-05-12 15:43 ` Avihai Horon
  2022-05-16 11:31   ` Juan Quintela
  2022-05-12 15:43 ` [PATCH 6/9] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 43+ messages in thread
From: Avihai Horon @ 2022-05-12 15:43 UTC (permalink / raw)
  To: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Juan Quintela, Dr . David Alan Gilbert
  Cc: Yishai Hadas, Jason Gunthorpe, Mark Bloch, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta

Add new function qemu_file_get_to_fd() that allows reading data from
QEMUFile and writing it straight into a given fd.

This will be used later in VFIO migration code.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/qemu-file.c | 34 ++++++++++++++++++++++++++++++++++
 migration/qemu-file.h |  1 +
 2 files changed, 35 insertions(+)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 1479cddad9..cad3d32eb3 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -867,3 +867,37 @@ QIOChannel *qemu_file_get_ioc(QEMUFile *file)
 {
     return file->has_ioc ? QIO_CHANNEL(file->opaque) : NULL;
 }
+
+/*
+ * Read size bytes from QEMUFile f and write them to fd.
+ */
+int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size)
+{
+    while (size) {
+        size_t pending = f->buf_size - f->buf_index;
+        ssize_t rc;
+
+        if (!pending) {
+            rc = qemu_fill_buffer(f);
+            if (rc < 0) {
+                return rc;
+            }
+            if (rc == 0) {
+                return -1;
+            }
+            continue;
+        }
+
+        rc = write(fd, f->buf + f->buf_index, MIN(pending, size));
+        if (rc < 0) {
+            return rc;
+        }
+        if (rc == 0) {
+            return -1;
+        }
+        f->buf_index += rc;
+        size -= rc;
+    }
+
+    return 0;
+}
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 3f36d4dc8c..dd26037450 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -162,6 +162,7 @@ int qemu_file_shutdown(QEMUFile *f);
 QEMUFile *qemu_file_get_return_path(QEMUFile *f);
 void qemu_fflush(QEMUFile *f);
 void qemu_file_set_blocking(QEMUFile *f, bool block);
+int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
 
 void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
 void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 6/9] vfio/migration: Implement VFIO migration protocol v2
  2022-05-12 15:43 [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (4 preceding siblings ...)
  2022-05-12 15:43 ` [PATCH 5/9] migration/qemu-file: Add qemu_file_get_to_fd() Avihai Horon
@ 2022-05-12 15:43 ` Avihai Horon
  2022-05-23 18:14   ` Joao Martins
  2022-05-12 15:43 ` [PATCH 7/9] vfio/migration: Reset device if setting recover state fails Avihai Horon
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 43+ messages in thread
From: Avihai Horon @ 2022-05-12 15:43 UTC (permalink / raw)
  To: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Juan Quintela, Dr . David Alan Gilbert
  Cc: Yishai Hadas, Jason Gunthorpe, Mark Bloch, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta

Replace the current VFIO migration protocol v1 implementation with a new
implementation corresponding to VFIO migration protocol v2.

The main changes are:
- VFIO device state is now represented as a finite state machine instead
  of a bitmap.

- Migration interface with kernel is now done using VFIO_DEVICE_FEATURE
  ioctl and normal read() and write() instead of the migration region.

- As VFIO migration protocol v2 currently doesn't support the pre-copy
  phase of migration, .save_live_pending and .save_live_iterate handlers
  plus pre-copy relevant code are removed.

Detailed information about VFIO migration protocol v2 and difference
compared to v1 can be found here [1].

[1]
https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com/

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c              |  21 +-
 hw/vfio/migration.c           | 628 +++++++---------------------------
 hw/vfio/trace-events          |   9 +-
 include/hw/vfio/vfio-common.h |   8 +-
 4 files changed, 153 insertions(+), 513 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 29982c7af8..4c6baa5a79 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -40,6 +40,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/migration.h"
+#include "migration/misc.h"
 #include "sysemu/tpm.h"
 
 VFIOGroupList vfio_group_list =
@@ -354,8 +355,9 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
                 return false;
             }
 
-            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
-                && (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) {
+            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
+                 (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
                 return false;
             }
         }
@@ -363,13 +365,16 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
     return true;
 }
 
-static bool vfio_devices_all_running_and_saving(VFIOContainer *container)
+/*
+ * Check if all VFIO devices are running and migration is active, which is
+ * essentially equivalent to the migration being in pre-copy phase.
+ */
+static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
 {
     VFIOGroup *group;
     VFIODevice *vbasedev;
-    MigrationState *ms = migrate_get_current();
 
-    if (!migration_is_setup_or_active(ms->state)) {
+    if (!migration_is_active(migrate_get_current())) {
         return false;
     }
 
@@ -381,8 +386,8 @@ static bool vfio_devices_all_running_and_saving(VFIOContainer *container)
                 return false;
             }
 
-            if ((migration->device_state & VFIO_DEVICE_STATE_V1_SAVING) &&
-                (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) {
+            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P) {
                 continue;
             } else {
                 return false;
@@ -461,7 +466,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
     };
 
     if (iotlb && container->dirty_pages_supported &&
-        vfio_devices_all_running_and_saving(container)) {
+        vfio_devices_all_running_and_mig_active(container)) {
         return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
     }
 
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index d4b6653026..8943ccbace 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -44,309 +44,96 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
 
+#define VFIO_MIG_DATA_BUFFER_SIZE (1024 * 1024)
+
 static int64_t bytes_transferred;
 
-static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
-                                  off_t off, bool iswrite)
+static const char *mig_state_to_str(enum vfio_device_mig_state state)
 {
-    int ret;
-
-    ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
-                    pread(vbasedev->fd, val, count, off);
-    if (ret < count) {
-        error_report("vfio_mig_%s %d byte %s: failed at offset 0x%"
-                     HWADDR_PRIx", err: %s", iswrite ? "write" : "read", count,
-                     vbasedev->name, off, strerror(errno));
-        return (ret < 0) ? ret : -EINVAL;
+    switch (state) {
+    case VFIO_DEVICE_STATE_ERROR:
+        return "ERROR";
+    case VFIO_DEVICE_STATE_STOP:
+        return "STOP";
+    case VFIO_DEVICE_STATE_RUNNING:
+        return "RUNNING";
+    case VFIO_DEVICE_STATE_STOP_COPY:
+        return "STOP_COPY";
+    case VFIO_DEVICE_STATE_RESUMING:
+        return "RESUMING";
+    case VFIO_DEVICE_STATE_RUNNING_P2P:
+        return "RUNNING_P2P";
+    default:
+        return "UNKNOWN STATE";
     }
-    return 0;
 }
 
-static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
-                       off_t off, bool iswrite)
-{
-    int ret, done = 0;
-    __u8 *tbuf = buf;
-
-    while (count) {
-        int bytes = 0;
-
-        if (count >= 8 && !(off % 8)) {
-            bytes = 8;
-        } else if (count >= 4 && !(off % 4)) {
-            bytes = 4;
-        } else if (count >= 2 && !(off % 2)) {
-            bytes = 2;
-        } else {
-            bytes = 1;
-        }
-
-        ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
-        if (ret) {
-            return ret;
-        }
-
-        count -= bytes;
-        done += bytes;
-        off += bytes;
-        tbuf += bytes;
-    }
-    return done;
-}
-
-#define vfio_mig_read(f, v, c, o)       vfio_mig_rw(f, (__u8 *)v, c, o, false)
-#define vfio_mig_write(f, v, c, o)      vfio_mig_rw(f, (__u8 *)v, c, o, true)
-
-#define VFIO_MIG_STRUCT_OFFSET(f)       \
-                                 offsetof(struct vfio_device_migration_info, f)
-/*
- * Change the device_state register for device @vbasedev. Bits set in @mask
- * are preserved, bits set in @value are set, and bits not set in either @mask
- * or @value are cleared in device_state. If the register cannot be accessed,
- * the resulting state would be invalid, or the device enters an error state,
- * an error is returned.
- */
-
-static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
-                                    uint32_t value)
+static int vfio_migration_set_state(VFIODevice *vbasedev,
+                                    enum vfio_device_mig_state new_state,
+                                    enum vfio_device_mig_state recover_state)
 {
     VFIOMigration *migration = vbasedev->migration;
-    VFIORegion *region = &migration->region;
-    off_t dev_state_off = region->fd_offset +
-                          VFIO_MIG_STRUCT_OFFSET(device_state);
-    uint32_t device_state;
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
+                              sizeof(struct vfio_device_feature_mig_state),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (void *)buf;
+    struct vfio_device_feature_mig_state *mig_state = (void *)feature->data;
     int ret;
 
-    ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
-                        dev_state_off);
-    if (ret < 0) {
-        return ret;
-    }
-
-    device_state = (device_state & mask) | value;
-
-    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
-        return -EINVAL;
-    }
-
-    ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state),
-                         dev_state_off);
-    if (ret < 0) {
-        int rret;
-
-        rret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
-                             dev_state_off);
-
-        if ((rret < 0) || (VFIO_DEVICE_STATE_IS_ERROR(device_state))) {
-            hw_error("%s: Device in error state 0x%x", vbasedev->name,
-                     device_state);
-            return rret ? rret : -EIO;
+    feature->argsz = sizeof(buf);
+    feature->flags =
+        VFIO_DEVICE_FEATURE_SET | VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE;
+    mig_state->device_state = new_state;
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
+    if (ret) {
+        /* Try to put the device in some good state */
+        mig_state->device_state = recover_state;
+        if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+            hw_error("%s: Device in error state, can't recover",
+                     vbasedev->name);
         }
-        return ret;
-    }
-
-    migration->device_state = device_state;
-    trace_vfio_migration_set_state(vbasedev->name, device_state);
-    return 0;
-}
 
-static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
-                                   uint64_t data_size, uint64_t *size)
-{
-    void *ptr = NULL;
-    uint64_t limit = 0;
-    int i;
+        error_report("%s: Failed changing device state to %s", vbasedev->name,
+                     mig_state_to_str(new_state));
+        migration->device_state = recover_state;
 
-    if (!region->mmaps) {
-        if (size) {
-            *size = MIN(data_size, region->size - data_offset);
-        }
-        return ptr;
+        return -1;
     }
 
-    for (i = 0; i < region->nr_mmaps; i++) {
-        VFIOMmap *map = region->mmaps + i;
-
-        if ((data_offset >= map->offset) &&
-            (data_offset < map->offset + map->size)) {
-
-            /* check if data_offset is within sparse mmap areas */
-            ptr = map->mmap + data_offset - map->offset;
-            if (size) {
-                *size = MIN(data_size, map->offset + map->size - data_offset);
-            }
-            break;
-        } else if ((data_offset < map->offset) &&
-                   (!limit || limit > map->offset)) {
+    if (mig_state->data_fd != -1) {
+        if (migration->data_fd != -1) {
             /*
-             * data_offset is not within sparse mmap areas, find size of
-             * non-mapped area. Check through all list since region->mmaps list
-             * is not sorted.
+             * This can happen if the device is asynchronously reset and
+             * terminates a data transfer.
              */
-            limit = map->offset;
-        }
-    }
-
-    if (!ptr && size) {
-        *size = limit ? MIN(data_size, limit - data_offset) : data_size;
-    }
-    return ptr;
-}
-
-static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
-{
-    VFIOMigration *migration = vbasedev->migration;
-    VFIORegion *region = &migration->region;
-    uint64_t data_offset = 0, data_size = 0, sz;
-    int ret;
-
-    ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
-                      region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
-    if (ret < 0) {
-        return ret;
-    }
-
-    ret = vfio_mig_read(vbasedev, &data_size, sizeof(data_size),
-                        region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
-    if (ret < 0) {
-        return ret;
-    }
+            error_report("%s: data_fd out of sync", vbasedev->name);
+            close(mig_state->data_fd);
 
-    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
-                           migration->pending_bytes);
-
-    qemu_put_be64(f, data_size);
-    sz = data_size;
-
-    while (sz) {
-        void *buf;
-        uint64_t sec_size;
-        bool buf_allocated = false;
-
-        buf = get_data_section_size(region, data_offset, sz, &sec_size);
-
-        if (!buf) {
-            buf = g_try_malloc(sec_size);
-            if (!buf) {
-                error_report("%s: Error allocating buffer ", __func__);
-                return -ENOMEM;
-            }
-            buf_allocated = true;
-
-            ret = vfio_mig_read(vbasedev, buf, sec_size,
-                                region->fd_offset + data_offset);
-            if (ret < 0) {
-                g_free(buf);
-                return ret;
-            }
+            return -1;
         }
 
-        qemu_put_buffer(f, buf, sec_size);
-
-        if (buf_allocated) {
-            g_free(buf);
-        }
-        sz -= sec_size;
-        data_offset += sec_size;
+        migration->data_fd = mig_state->data_fd;
     }
+    migration->device_state = new_state;
 
-    ret = qemu_file_get_error(f);
+    trace_vfio_migration_set_state(vbasedev->name, new_state);
 
-    if (!ret && size) {
-        *size = data_size;
-    }
-
-    bytes_transferred += data_size;
-    return ret;
+    return 0;
 }
 
 static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
                             uint64_t data_size)
-{
-    VFIORegion *region = &vbasedev->migration->region;
-    uint64_t data_offset = 0, size, report_size;
-    int ret;
-
-    do {
-        ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
-                      region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
-        if (ret < 0) {
-            return ret;
-        }
-
-        if (data_offset + data_size > region->size) {
-            /*
-             * If data_size is greater than the data section of migration region
-             * then iterate the write buffer operation. This case can occur if
-             * size of migration region at destination is smaller than size of
-             * migration region at source.
-             */
-            report_size = size = region->size - data_offset;
-            data_size -= size;
-        } else {
-            report_size = size = data_size;
-            data_size = 0;
-        }
-
-        trace_vfio_load_state_device_data(vbasedev->name, data_offset, size);
-
-        while (size) {
-            void *buf;
-            uint64_t sec_size;
-            bool buf_alloc = false;
-
-            buf = get_data_section_size(region, data_offset, size, &sec_size);
-
-            if (!buf) {
-                buf = g_try_malloc(sec_size);
-                if (!buf) {
-                    error_report("%s: Error allocating buffer ", __func__);
-                    return -ENOMEM;
-                }
-                buf_alloc = true;
-            }
-
-            qemu_get_buffer(f, buf, sec_size);
-
-            if (buf_alloc) {
-                ret = vfio_mig_write(vbasedev, buf, sec_size,
-                        region->fd_offset + data_offset);
-                g_free(buf);
-
-                if (ret < 0) {
-                    return ret;
-                }
-            }
-            size -= sec_size;
-            data_offset += sec_size;
-        }
-
-        ret = vfio_mig_write(vbasedev, &report_size, sizeof(report_size),
-                        region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
-        if (ret < 0) {
-            return ret;
-        }
-    } while (data_size);
-
-    return 0;
-}
-
-static int vfio_update_pending(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
-    VFIORegion *region = &migration->region;
-    uint64_t pending_bytes = 0;
     int ret;
 
-    ret = vfio_mig_read(vbasedev, &pending_bytes, sizeof(pending_bytes),
-                    region->fd_offset + VFIO_MIG_STRUCT_OFFSET(pending_bytes));
-    if (ret < 0) {
-        migration->pending_bytes = 0;
+    ret = qemu_file_get_to_fd(f, migration->data_fd, data_size);
+    if (ret) {
         return ret;
     }
 
-    migration->pending_bytes = pending_bytes;
-    trace_vfio_update_pending(vbasedev->name, pending_bytes);
+    trace_vfio_load_state_device_data(vbasedev->name, data_size);
+
     return 0;
 }
 
@@ -398,9 +185,8 @@ static void vfio_migration_cleanup(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
 
-    if (migration->region.mmaps) {
-        vfio_region_unmap(&migration->region);
-    }
+    close(migration->data_fd);
+    migration->data_fd = -1;
 }
 
 /* ---------------------------------------------------------------------- */
@@ -408,44 +194,13 @@ static void vfio_migration_cleanup(VFIODevice *vbasedev)
 static int vfio_save_setup(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    int ret;
 
     trace_vfio_save_setup(vbasedev->name);
 
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
-
-    if (migration->region.mmaps) {
-        /*
-         * Calling vfio_region_mmap() from migration thread. Memory API called
-         * from this function require locking the iothread when called from
-         * outside the main loop thread.
-         */
-        qemu_mutex_lock_iothread();
-        ret = vfio_region_mmap(&migration->region);
-        qemu_mutex_unlock_iothread();
-        if (ret) {
-            error_report("%s: Failed to mmap VFIO migration region: %s",
-                         vbasedev->name, strerror(-ret));
-            error_report("%s: Falling back to slow path", vbasedev->name);
-        }
-    }
-
-    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
-                                   VFIO_DEVICE_STATE_V1_SAVING);
-    if (ret) {
-        error_report("%s: Failed to set state SAVING", vbasedev->name);
-        return ret;
-    }
-
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
 
-    ret = qemu_file_get_error(f);
-    if (ret) {
-        return ret;
-    }
-
-    return 0;
+    return qemu_file_get_error(f);
 }
 
 static void vfio_save_cleanup(void *opaque)
@@ -456,127 +211,67 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
-static void vfio_save_pending(QEMUFile *f, void *opaque,
-                              uint64_t threshold_size,
-                              uint64_t *res_precopy_only,
-                              uint64_t *res_compatible,
-                              uint64_t *res_postcopy_only)
-{
-    VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    int ret;
-
-    ret = vfio_update_pending(vbasedev);
-    if (ret) {
-        return;
-    }
-
-    *res_precopy_only += migration->pending_bytes;
-
-    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
-                            *res_postcopy_only, *res_compatible);
-}
-
-static int vfio_save_iterate(QEMUFile *f, void *opaque)
+/* Returns 1 if end-of-stream is reached, 0 if more data and -1 if error */
+static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
 {
-    VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    uint64_t data_size;
-    int ret;
-
-    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
-
-    if (migration->pending_bytes == 0) {
-        ret = vfio_update_pending(vbasedev);
-        if (ret) {
-            return ret;
-        }
+    ssize_t data_size;
 
-        if (migration->pending_bytes == 0) {
-            qemu_put_be64(f, 0);
-            qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
-            /* indicates data finished, goto complete phase */
-            return 1;
-        }
+    data_size = read(migration->data_fd, migration->data_buffer,
+                     migration->data_buffer_size);
+    if (data_size < 0) {
+        return -1;
     }
-
-    ret = vfio_save_buffer(f, vbasedev, &data_size);
-    if (ret) {
-        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
-                     strerror(errno));
-        return ret;
+    if (data_size == 0) {
+        return 1;
     }
 
-    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+    qemu_put_be64(f, data_size);
+    qemu_put_buffer_async(f, migration->data_buffer, data_size, false);
+    qemu_fflush(f);
+    bytes_transferred += data_size;
 
-    ret = qemu_file_get_error(f);
-    if (ret) {
-        return ret;
-    }
+    trace_vfio_save_block(migration->vbasedev->name, data_size);
 
-    /*
-     * Reset pending_bytes as .save_live_pending is not called during savevm or
-     * snapshot case, in such case vfio_update_pending() at the start of this
-     * function updates pending_bytes.
-     */
-    migration->pending_bytes = 0;
-    trace_vfio_save_iterate(vbasedev->name, data_size);
-    return 0;
+    return qemu_file_get_error(f);
 }
 
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    uint64_t data_size;
+    enum vfio_device_mig_state recover_state;
     int ret;
 
-    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_RUNNING,
-                                   VFIO_DEVICE_STATE_V1_SAVING);
+    /* We reach here with device state STOP or STOP_COPY only */
+    recover_state = VFIO_DEVICE_STATE_STOP;
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
+                                   recover_state);
     if (ret) {
-        error_report("%s: Failed to set state STOP and SAVING",
-                     vbasedev->name);
         return ret;
     }
 
-    ret = vfio_update_pending(vbasedev);
-    if (ret) {
-        return ret;
-    }
-
-    while (migration->pending_bytes > 0) {
-        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
-        ret = vfio_save_buffer(f, vbasedev, &data_size);
+    do {
+        ret = vfio_save_block(f, vbasedev->migration);
         if (ret < 0) {
-            error_report("%s: Failed to save buffer", vbasedev->name);
-            return ret;
-        }
-
-        if (data_size == 0) {
-            break;
-        }
-
-        ret = vfio_update_pending(vbasedev);
-        if (ret) {
             return ret;
         }
-    }
+    } while (!ret);
 
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
-
     ret = qemu_file_get_error(f);
     if (ret) {
         return ret;
     }
 
-    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_SAVING, 0);
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
+                                   recover_state);
     if (ret) {
-        error_report("%s: Failed to set state STOPPED", vbasedev->name);
         return ret;
     }
 
     trace_vfio_save_complete_precopy(vbasedev->name);
-    return ret;
+
+    return 0;
 }
 
 static void vfio_save_state(QEMUFile *f, void *opaque)
@@ -595,28 +290,9 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
 static int vfio_load_setup(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    int ret = 0;
 
-    if (migration->region.mmaps) {
-        ret = vfio_region_mmap(&migration->region);
-        if (ret) {
-            error_report("%s: Failed to mmap VFIO migration region %d: %s",
-                         vbasedev->name, migration->region.nr,
-                         strerror(-ret));
-            error_report("%s: Falling back to slow path", vbasedev->name);
-        }
-    }
-
-    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
-                                   VFIO_DEVICE_STATE_V1_RESUMING);
-    if (ret) {
-        error_report("%s: Failed to set state RESUMING", vbasedev->name);
-        if (migration->region.mmaps) {
-            vfio_region_unmap(&migration->region);
-        }
-    }
-    return ret;
+    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
+                                   vbasedev->migration->device_state);
 }
 
 static int vfio_load_cleanup(void *opaque)
@@ -685,8 +361,6 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
-    .save_live_pending = vfio_save_pending,
-    .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
     .save_state = vfio_save_state,
     .load_setup = vfio_load_setup,
@@ -699,58 +373,28 @@ static SaveVMHandlers savevm_vfio_handlers = {
 static void vfio_vmstate_change(void *opaque, bool running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    uint32_t value, mask;
+    enum vfio_device_mig_state new_state;
     int ret;
 
-    if (vbasedev->migration->vm_running == running) {
-        return;
-    }
-
     if (running) {
-        /*
-         * Here device state can have one of _SAVING, _RESUMING or _STOP bit.
-         * Transition from _SAVING to _RUNNING can happen if there is migration
-         * failure, in that case clear _SAVING bit.
-         * Transition from _RESUMING to _RUNNING occurs during resuming
-         * phase, in that case clear _RESUMING bit.
-         * In both the above cases, set _RUNNING bit.
-         */
-        mask = ~VFIO_DEVICE_STATE_MASK;
-        value = VFIO_DEVICE_STATE_V1_RUNNING;
+        new_state = VFIO_DEVICE_STATE_RUNNING;
     } else {
-        /*
-         * Here device state could be either _RUNNING or _SAVING|_RUNNING. Reset
-         * _RUNNING bit
-         */
-        mask = ~VFIO_DEVICE_STATE_V1_RUNNING;
-
-        /*
-         * When VM state transition to stop for savevm command, device should
-         * start saving data.
-         */
-        if (state == RUN_STATE_SAVE_VM) {
-            value = VFIO_DEVICE_STATE_V1_SAVING;
-        } else {
-            value = 0;
-        }
+        new_state = VFIO_DEVICE_STATE_STOP;
     }
 
-    ret = vfio_migration_set_state(vbasedev, mask, value);
+    ret =
+        vfio_migration_set_state(vbasedev, new_state, VFIO_DEVICE_STATE_ERROR);
     if (ret) {
         /*
          * Migration should be aborted in this case, but vm_state_notify()
          * currently does not support reporting failures.
          */
-        error_report("%s: Failed to set device state 0x%x", vbasedev->name,
-                     (migration->device_state & mask) | value);
         if (migrate_get_current()->to_dst_file) {
             qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
         }
     }
-    vbasedev->migration->vm_running = running;
     trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
-            (migration->device_state & mask) | value);
+                              new_state);
 }
 
 static void vfio_migration_state_notifier(Notifier *notifier, void *data)
@@ -759,7 +403,6 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
     VFIOMigration *migration = container_of(notifier, VFIOMigration,
                                             migration_state);
     VFIODevice *vbasedev = migration->vbasedev;
-    int ret;
 
     trace_vfio_migration_state_notifier(vbasedev->name,
                                         MigrationStatus_str(s->state));
@@ -769,34 +412,45 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
     case MIGRATION_STATUS_CANCELLED:
     case MIGRATION_STATUS_FAILED:
         bytes_transferred = 0;
-        ret = vfio_migration_set_state(
-            vbasedev,
-            ~(VFIO_DEVICE_STATE_V1_SAVING | VFIO_DEVICE_STATE_V1_RESUMING),
-            VFIO_DEVICE_STATE_V1_RUNNING);
-        if (ret) {
-            error_report("%s: Failed to set state RUNNING", vbasedev->name);
-        }
+        vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
+                                 VFIO_DEVICE_STATE_ERROR);
     }
 }
 
 static void vfio_migration_exit(VFIODevice *vbasedev)
 {
-    VFIOMigration *migration = vbasedev->migration;
-
-    vfio_region_exit(&migration->region);
-    vfio_region_finalize(&migration->region);
+    g_free(vbasedev->migration->data_buffer);
     g_free(vbasedev->migration);
     vbasedev->migration = NULL;
 }
 
-static int vfio_migration_init(VFIODevice *vbasedev,
-                               struct vfio_region_info *info)
+static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags)
+{
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
+                                  sizeof(struct vfio_device_feature_migration),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (void *)buf;
+    struct vfio_device_feature_migration *mig = (void *)feature->data;
+
+    feature->argsz = sizeof(buf);
+    feature->flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIGRATION;
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        return -EOPNOTSUPP;
+    }
+
+    *mig_flags = mig->flags;
+
+    return 0;
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev)
 {
-    int ret;
     Object *obj;
     VFIOMigration *migration;
     char id[256] = "";
     g_autofree char *path = NULL, *oid = NULL;
+    uint64_t mig_flags;
+    int ret;
 
     if (!vbasedev->ops->vfio_get_object) {
         return -EINVAL;
@@ -807,25 +461,23 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         return -EINVAL;
     }
 
-    vbasedev->migration = g_new0(VFIOMigration, 1);
-
-    ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
-                            info->index, "migration");
+    ret = vfio_migration_query_flags(vbasedev, &mig_flags);
     if (ret) {
-        error_report("%s: Failed to setup VFIO migration region %d: %s",
-                     vbasedev->name, info->index, strerror(-ret));
-        goto err;
+        return ret;
     }
 
-    if (!vbasedev->migration->region.size) {
-        error_report("%s: Invalid zero-sized VFIO migration region %d",
-                     vbasedev->name, info->index);
-        ret = -EINVAL;
-        goto err;
+    /* Basic migration functionality must be supported */
+    if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
+        return -EOPNOTSUPP;
     }
 
+    vbasedev->migration = g_new0(VFIOMigration, 1);
+    vbasedev->migration->data_buffer_size = VFIO_MIG_DATA_BUFFER_SIZE;
+    vbasedev->migration->data_buffer =
+        g_malloc0(vbasedev->migration->data_buffer_size);
     migration = vbasedev->migration;
     migration->vbasedev = vbasedev;
+    migration->data_fd = -1;
 
     oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
     if (oid) {
@@ -837,17 +489,13 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 
     register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
                          vbasedev);
-
     migration->vm_state = qdev_add_vm_change_state_handler(vbasedev->dev,
                                                            vfio_vmstate_change,
                                                            vbasedev);
     migration->migration_state.notify = vfio_migration_state_notifier;
     add_migration_state_change_notifier(&migration->migration_state);
-    return 0;
 
-err:
-    vfio_migration_exit(vbasedev);
-    return ret;
+    return 0;
 }
 
 /* ---------------------------------------------------------------------- */
@@ -860,7 +508,6 @@ int64_t vfio_mig_bytes_transferred(void)
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
     VFIOContainer *container = vbasedev->group->container;
-    struct vfio_region_info *info = NULL;
     int ret = -ENOTSUP;
 
     if (!vbasedev->enable_migration) {
@@ -874,27 +521,18 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
         migrate_get_current()->skip_precopy = true;
     }
 
-    ret = vfio_get_dev_region_info(vbasedev,
-                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
-                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
-                                   &info);
+    ret = vfio_migration_init(vbasedev);
     if (ret) {
         goto add_blocker;
     }
 
-    ret = vfio_migration_init(vbasedev, info);
-    if (ret) {
-        goto add_blocker;
-    }
+    trace_vfio_migration_probe(vbasedev->name);
 
-    trace_vfio_migration_probe(vbasedev->name, info->index);
-    g_free(info);
     return 0;
 
 add_blocker:
     error_setg(&vbasedev->migration_blocker,
                "VFIO device doesn't support migration");
-    g_free(info);
 
     ret = migrate_add_blocker(vbasedev->migration_blocker, errp);
     if (ret < 0) {
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 582882db91..a24ea7d8b0 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -148,21 +148,18 @@ vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
 vfio_display_edid_write_error(void) ""
 
 # migration.c
-vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
+vfio_migration_probe(const char *name) " (%s)"
 vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
 vfio_save_setup(const char *name) " (%s)"
 vfio_save_cleanup(const char *name) " (%s)"
-vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
-vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
 vfio_save_device_config_state(const char *name) " (%s)"
-vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
-vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_complete_precopy(const char *name) " (%s)"
 vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
-vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
+vfio_load_state_device_data(const char *name, uint64_t data_size) " (%s) size 0x%"PRIx64
 vfio_load_cleanup(const char *name) " (%s)"
 vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
+vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index e573f5a9f1..09446a9082 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -61,11 +61,11 @@ typedef struct VFIORegion {
 typedef struct VFIOMigration {
     struct VFIODevice *vbasedev;
     VMChangeStateEntry *vm_state;
-    VFIORegion region;
-    uint32_t device_state;
-    int vm_running;
+    enum vfio_device_mig_state device_state;
+    int data_fd;
     Notifier migration_state;
-    uint64_t pending_bytes;
+    void *data_buffer;
+    size_t data_buffer_size;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 7/9] vfio/migration: Reset device if setting recover state fails
  2022-05-12 15:43 [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (5 preceding siblings ...)
  2022-05-12 15:43 ` [PATCH 6/9] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
@ 2022-05-12 15:43 ` Avihai Horon
  2022-05-12 15:43 ` [PATCH 8/9] vfio: Alphabetize migration section of VFIO trace-events file Avihai Horon
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Avihai Horon @ 2022-05-12 15:43 UTC (permalink / raw)
  To: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Juan Quintela, Dr . David Alan Gilbert
  Cc: Yishai Hadas, Jason Gunthorpe, Mark Bloch, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta

If vfio_migration_set_state() fails to set the device in the requested
state it tries to put it in a recover state. If setting the device in
the recover state fails as well, hw_error is triggered and the VM is
aborted.

To improve user experience and avoid VM data loss, reset the device with
VFIO_RESET_DEVICE instead of aborting the VM.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 8943ccbace..aec8afcd6c 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -89,8 +89,16 @@ static int vfio_migration_set_state(VFIODevice *vbasedev,
         /* Try to put the device in some good state */
         mig_state->device_state = recover_state;
         if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
-            hw_error("%s: Device in error state, can't recover",
-                     vbasedev->name);
+            if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) {
+                hw_error("%s: Device in error state, can't recover",
+                         vbasedev->name);
+            }
+
+            error_report(
+                "%s: Device was reset due to failure in changing device state to recover state %s",
+                vbasedev->name, mig_state_to_str(recover_state));
+
+            return -1;
         }
 
         error_report("%s: Failed changing device state to %s", vbasedev->name,
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 8/9] vfio: Alphabetize migration section of VFIO trace-events file
  2022-05-12 15:43 [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (6 preceding siblings ...)
  2022-05-12 15:43 ` [PATCH 7/9] vfio/migration: Reset device if setting recover state fails Avihai Horon
@ 2022-05-12 15:43 ` Avihai Horon
  2022-05-12 15:43 ` [PATCH 9/9] docs/devel: Align vfio-migration docs to VFIO migration v2 Avihai Horon
  2022-05-12 18:02 ` [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2 Alex Williamson
  9 siblings, 0 replies; 43+ messages in thread
From: Avihai Horon @ 2022-05-12 15:43 UTC (permalink / raw)
  To: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Juan Quintela, Dr . David Alan Gilbert
  Cc: Yishai Hadas, Jason Gunthorpe, Mark Bloch, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta

Sort the migration section of VFIO trace events file alphabetically
and move two misplaced traces to common.c section.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/trace-events | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index a24ea7d8b0..d3cba59bfd 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -119,6 +119,8 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
 vfio_dma_unmap_overflow_workaround(void) ""
+vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
+vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
 
 # platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
@@ -148,18 +150,16 @@ vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
 vfio_display_edid_write_error(void) ""
 
 # migration.c
+vfio_load_cleanup(const char *name) " (%s)"
+vfio_load_device_config_state(const char *name) " (%s)"
+vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
+vfio_load_state_device_data(const char *name, uint64_t data_size) " (%s) size 0x%"PRIx64
 vfio_migration_probe(const char *name) " (%s)"
 vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
-vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
-vfio_save_setup(const char *name) " (%s)"
+vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
-vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name) " (%s)"
-vfio_load_device_config_state(const char *name) " (%s)"
-vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
-vfio_load_state_device_data(const char *name, uint64_t data_size) " (%s) size 0x%"PRIx64
-vfio_load_cleanup(const char *name) " (%s)"
-vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
-vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
-vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
+vfio_save_device_config_state(const char *name) " (%s)"
+vfio_save_setup(const char *name) " (%s)"
+vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 9/9] docs/devel: Align vfio-migration docs to VFIO migration v2
  2022-05-12 15:43 [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (7 preceding siblings ...)
  2022-05-12 15:43 ` [PATCH 8/9] vfio: Alphabetize migration section of VFIO trace-events file Avihai Horon
@ 2022-05-12 15:43 ` Avihai Horon
  2022-05-12 18:02 ` [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2 Alex Williamson
  9 siblings, 0 replies; 43+ messages in thread
From: Avihai Horon @ 2022-05-12 15:43 UTC (permalink / raw)
  To: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Juan Quintela, Dr . David Alan Gilbert
  Cc: Yishai Hadas, Jason Gunthorpe, Mark Bloch, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta

Align the vfio-migration documentation to VFIO migration protocol v2.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/vfio-migration.rst | 77 +++++++++++++++--------------------
 1 file changed, 33 insertions(+), 44 deletions(-)

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index 9ff6163c88..09744af5a6 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -7,46 +7,35 @@ the guest is running on source host and restoring this saved state on the
 destination host. This document details how saving and restoring of VFIO
 devices is done in QEMU.
 
-Migration of VFIO devices consists of two phases: the optional pre-copy phase,
-and the stop-and-copy phase. The pre-copy phase is iterative and allows to
-accommodate VFIO devices that have a large amount of data that needs to be
-transferred. The iterative pre-copy phase of migration allows for the guest to
-continue whilst the VFIO device state is transferred to the destination, this
-helps to reduce the total downtime of the VM. VFIO devices can choose to skip
-the pre-copy phase of migration by returning pending_bytes as zero during the
-pre-copy phase.
+Migration of VFIO devices currently consists of a single stop-and-copy phase.
+During the stop-and-copy phase the guest is stopped and the entire VFIO device
+data is transferred to the destination.
+
+The pre-copy phase of migration is currently not supported for VFIO devices,
+so VFIO device data is not transferred during pre-copy phase.
 
 A detailed description of the UAPI for VFIO device migration can be found in
-the comment for the ``vfio_device_migration_info`` structure in the header
-file linux-headers/linux/vfio.h.
+the comment for the ``vfio_device_mig_state`` structure in the header file
+linux-headers/linux/vfio.h.
 
 VFIO implements the device hooks for the iterative approach as follows:
 
-* A ``save_setup`` function that sets up the migration region and sets _SAVING
-  flag in the VFIO device state.
-
-* A ``load_setup`` function that sets up the migration region on the
-  destination and sets _RESUMING flag in the VFIO device state.
-
-* A ``save_live_pending`` function that reads pending_bytes from the vendor
-  driver, which indicates the amount of data that the vendor driver has yet to
-  save for the VFIO device.
+* A ``save_setup`` function that sets up migration on the source.
 
-* A ``save_live_iterate`` function that reads the VFIO device's data from the
-  vendor driver through the migration region during iterative phase.
+* A ``load_setup`` function that sets the VFIO device on the destination in
+  _RESUMING state.
 
 * A ``save_state`` function to save the device config space if it is present.
 
-* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the
-  VFIO device state and iteratively copies the remaining data for the VFIO
-  device until the vendor driver indicates that no data remains (pending bytes
-  is zero).
+* A ``save_live_complete_precopy`` function that sets the VFIO device in
+  _STOP_COPY state and iteratively copies the data for the VFIO device until
+  the vendor driver indicates that no data remains.
 
 * A ``load_state`` function that loads the config section and the data
-  sections that are generated by the save functions above
+  sections that are generated by the save functions above.
 
 * ``cleanup`` functions for both save and load that perform any migration
-  related cleanup, including unmapping the migration region
+  related cleanup.
 
 
 The VFIO migration code uses a VM state change handler to change the VFIO
@@ -71,13 +60,13 @@ tracking can identify dirtied pages, but any page pinned by the vendor driver
 can also be written by the device. There is currently no device or IOMMU
 support for dirty page tracking in hardware.
 
-By default, dirty pages are tracked when the device is in pre-copy as well as
-stop-and-copy phase. So, a page pinned by the vendor driver will be copied to
-the destination in both phases. Copying dirty pages in pre-copy phase helps
-QEMU to predict if it can achieve its downtime tolerances. If QEMU during
-pre-copy phase keeps finding dirty pages continuously, then it understands
-that even in stop-and-copy phase, it is likely to find dirty pages and can
-predict the downtime accordingly.
+By default, dirty pages are tracked during pre-copy as well as stop-and-copy
+phase. So, a page pinned by the vendor driver will be copied to the destination
+in both phases. Copying dirty pages in pre-copy phase helps QEMU to predict if
+it can achieve its downtime tolerances. If QEMU during pre-copy phase keeps
+finding dirty pages continuously, then it understands that even in stop-and-copy
+phase, it is likely to find dirty pages and can predict the downtime
+accordingly.
 
 QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking``
 which disables querying the dirty bitmap during pre-copy phase. If it is set to
@@ -111,23 +100,23 @@ Live migration save path
                                   |
                      migrate_init spawns migration_thread
                 Migration thread then calls each device's .save_setup()
-                    (RUNNING, _SETUP, _RUNNING|_SAVING)
+                       (RUNNING, _SETUP, _RUNNING)
                                   |
-                    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
-             If device is active, get pending_bytes by .save_live_pending()
-          If total pending_bytes >= threshold_size, call .save_live_iterate()
-                  Data of VFIO device for pre-copy phase is copied
+                      (RUNNING, _ACTIVE, _RUNNING)
+         Migration thread calls each .save_live_pending() handler
+  If total pending_bytes >= threshold_size, call each .save_live_iterate() handler
+          Data of this iteration for pre-copy phase is copied
         Iterate till total pending bytes converge and are less than threshold
                                   |
   On migration completion, vCPU stops and calls .save_live_complete_precopy for
-   each active device. The VFIO device is then transitioned into _SAVING state
-                   (FINISH_MIGRATE, _DEVICE, _SAVING)
+  each active device. The VFIO device is then transitioned into _STOP_COPY state
+                  (FINISH_MIGRATE, _DEVICE, _STOP_COPY)
                                   |
      For the VFIO device, iterate in .save_live_complete_precopy until
                          pending data is 0
-                   (FINISH_MIGRATE, _DEVICE, _STOPPED)
+                   (FINISH_MIGRATE, _DEVICE, _STOP)
                                   |
-                 (FINISH_MIGRATE, _COMPLETED, _STOPPED)
+                 (FINISH_MIGRATE, _COMPLETED, _STOP)
              Migraton thread schedules cleanup bottom half and exits
 
 Live migration resume path
@@ -136,7 +125,7 @@ Live migration resume path
 ::
 
               Incoming migration calls .load_setup for each device
-                       (RESTORE_VM, _ACTIVE, _STOPPED)
+                       (RESTORE_VM, _ACTIVE, _STOP)
                                  |
        For each device, .load_state is called for that device section data
                        (RESTORE_VM, _ACTIVE, _RESUMING)
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/9] vfio: Fix compilation errors caused by VFIO migration v1 deprecation
  2022-05-12 15:43 ` [PATCH 2/9] vfio: Fix compilation errors caused by VFIO migration v1 deprecation Avihai Horon
@ 2022-05-12 17:57   ` Alex Williamson
  2022-05-12 18:25     ` Jason Gunthorpe
  2022-05-13  7:08     ` Cornelia Huck
  0 siblings, 2 replies; 43+ messages in thread
From: Alex Williamson @ 2022-05-12 17:57 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Juan Quintela, Dr . David Alan Gilbert, Yishai Hadas,
	Jason Gunthorpe, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Thu, 12 May 2022 18:43:13 +0300
Avihai Horon <avihaih@nvidia.com> wrote:

> VFIO migration protocol v1 was deprecated and as part of it some of the
> uAPI definitions were renamed. This caused compilation errors.
> Fix them.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  hw/vfio/common.c    |  6 +++---
>  hw/vfio/migration.c | 29 ++++++++++++++++-------------
>  2 files changed, 19 insertions(+), 16 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 159f910421..29982c7af8 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -355,7 +355,7 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>              }
>  
>              if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
> -                && (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> +                && (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) {
>                  return false;
>              }
>          }
> @@ -381,8 +381,8 @@ static bool vfio_devices_all_running_and_saving(VFIOContainer *container)
>                  return false;
>              }
>  
> -            if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
> -                (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> +            if ((migration->device_state & VFIO_DEVICE_STATE_V1_SAVING) &&
> +                (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) {
>                  continue;
>              } else {
>                  return false;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index ff6b45de6b..835608cd23 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -432,7 +432,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>      }
>  
>      ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
> -                                   VFIO_DEVICE_STATE_SAVING);
> +                                   VFIO_DEVICE_STATE_V1_SAVING);
>      if (ret) {
>          error_report("%s: Failed to set state SAVING", vbasedev->name);
>          return ret;
> @@ -531,8 +531,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>      uint64_t data_size;
>      int ret;
>  
> -    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
> -                                   VFIO_DEVICE_STATE_SAVING);
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_RUNNING,
> +                                   VFIO_DEVICE_STATE_V1_SAVING);
>      if (ret) {
>          error_report("%s: Failed to set state STOP and SAVING",
>                       vbasedev->name);
> @@ -569,7 +569,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>          return ret;
>      }
>  
> -    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_SAVING, 0);
>      if (ret) {
>          error_report("%s: Failed to set state STOPPED", vbasedev->name);
>          return ret;
> @@ -609,7 +609,7 @@ static int vfio_load_setup(QEMUFile *f, void *opaque)
>      }
>  
>      ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
> -                                   VFIO_DEVICE_STATE_RESUMING);
> +                                   VFIO_DEVICE_STATE_V1_RESUMING);
>      if (ret) {
>          error_report("%s: Failed to set state RESUMING", vbasedev->name);
>          if (migration->region.mmaps) {
> @@ -717,20 +717,20 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>           * In both the above cases, set _RUNNING bit.
>           */
>          mask = ~VFIO_DEVICE_STATE_MASK;
> -        value = VFIO_DEVICE_STATE_RUNNING;
> +        value = VFIO_DEVICE_STATE_V1_RUNNING;
>      } else {
>          /*
>           * Here device state could be either _RUNNING or _SAVING|_RUNNING. Reset
>           * _RUNNING bit
>           */
> -        mask = ~VFIO_DEVICE_STATE_RUNNING;
> +        mask = ~VFIO_DEVICE_STATE_V1_RUNNING;
>  
>          /*
>           * When VM state transition to stop for savevm command, device should
>           * start saving data.
>           */
>          if (state == RUN_STATE_SAVE_VM) {
> -            value = VFIO_DEVICE_STATE_SAVING;
> +            value = VFIO_DEVICE_STATE_V1_SAVING;
>          } else {
>              value = 0;
>          }
> @@ -767,9 +767,10 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>      case MIGRATION_STATUS_CANCELLED:
>      case MIGRATION_STATUS_FAILED:
>          bytes_transferred = 0;
> -        ret = vfio_migration_set_state(vbasedev,
> -                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
> -                      VFIO_DEVICE_STATE_RUNNING);
> +        ret = vfio_migration_set_state(
> +            vbasedev,
> +            ~(VFIO_DEVICE_STATE_V1_SAVING | VFIO_DEVICE_STATE_V1_RESUMING),
> +            VFIO_DEVICE_STATE_V1_RUNNING);

Yikes!  Please follow the line wrapping used elsewhere.  There's no need
to put the first arg on a new line and subsequent wrapped lines should
be indented to match the previous line, or at least to avoid wrapping
itself.  Here we can use something like:

        ret = vfio_migration_set_state(vbasedev,
                                       ~(VFIO_DEVICE_STATE_V1_SAVING |
                                         VFIO_DEVICE_STATE_V1_RESUMING),
                                       VFIO_DEVICE_STATE_V1_RUNNING);

Thanks,
Alex

>          if (ret) {
>              error_report("%s: Failed to set state RUNNING", vbasedev->name);
>          }
> @@ -864,8 +865,10 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>          goto add_blocker;
>      }
>  
> -    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
> -                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
> +    ret = vfio_get_dev_region_info(vbasedev,
> +                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> +                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> +                                   &info);
>      if (ret) {
>          goto add_blocker;
>      }



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2
  2022-05-12 15:43 [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (8 preceding siblings ...)
  2022-05-12 15:43 ` [PATCH 9/9] docs/devel: Align vfio-migration docs to VFIO migration v2 Avihai Horon
@ 2022-05-12 18:02 ` Alex Williamson
  2022-05-13  7:04   ` Cornelia Huck
  9 siblings, 1 reply; 43+ messages in thread
From: Alex Williamson @ 2022-05-12 18:02 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Juan Quintela, Dr . David Alan Gilbert, Yishai Hadas,
	Jason Gunthorpe, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Thomas Huth, Richard Henderson, Matthew Rosato

On Thu, 12 May 2022 18:43:11 +0300
Avihai Horon <avihaih@nvidia.com> wrote:

> Hello,
> 
> Following VFIO migration protocol v2 acceptance in kernel, this series
> implements VFIO migration according to the new v2 protocol and replaces
> the now deprecated v1 implementation.

Let's not bottleneck others waiting on a linux header file update on
also incorporating v2 support.  In the short term we just need the
first two patches here.

Are there any objections to folding those patches together for the sake
of bisection?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/9] vfio: Fix compilation errors caused by VFIO migration v1 deprecation
  2022-05-12 17:57   ` Alex Williamson
@ 2022-05-12 18:25     ` Jason Gunthorpe
  2022-05-12 21:11       ` Alex Williamson
  2022-05-13  7:08     ` Cornelia Huck
  1 sibling, 1 reply; 43+ messages in thread
From: Jason Gunthorpe @ 2022-05-12 18:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Michael S . Tsirkin, Cornelia Huck,
	Paolo Bonzini, Juan Quintela, Dr . David Alan Gilbert,
	Yishai Hadas, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Thu, May 12, 2022 at 11:57:10AM -0600, Alex Williamson wrote:
> > @@ -767,9 +767,10 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> >      case MIGRATION_STATUS_CANCELLED:
> >      case MIGRATION_STATUS_FAILED:
> >          bytes_transferred = 0;
> > -        ret = vfio_migration_set_state(vbasedev,
> > -                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
> > -                      VFIO_DEVICE_STATE_RUNNING);
> > +        ret = vfio_migration_set_state(
> > +            vbasedev,
> > +            ~(VFIO_DEVICE_STATE_V1_SAVING | VFIO_DEVICE_STATE_V1_RESUMING),
> > +            VFIO_DEVICE_STATE_V1_RUNNING);
> 
> Yikes!  Please follow the line wrapping used elsewhere.  There's no need
> to put the first arg on a new line and subsequent wrapped lines should
> be indented to match the previous line, or at least to avoid wrapping
> itself.  Here we can use something like:

This is generated by clang-format with one of the qmeu styles, it
follows the documented guide:

 In case of function, there are several variants:

 - 4 spaces indent from the beginning
 - align the secondary lines just after the opening parenthesis of the
   first

clang-format selected the first option due to its optimization
algorithm.

Knowing nothing about qmeu, I am confused??

Jason


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/9] vfio: Fix compilation errors caused by VFIO migration v1 deprecation
  2022-05-12 18:25     ` Jason Gunthorpe
@ 2022-05-12 21:11       ` Alex Williamson
  2022-05-12 23:32         ` Jason Gunthorpe
  0 siblings, 1 reply; 43+ messages in thread
From: Alex Williamson @ 2022-05-12 21:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Avihai Horon, qemu-devel, Michael S . Tsirkin, Cornelia Huck,
	Paolo Bonzini, Juan Quintela, Dr . David Alan Gilbert,
	Yishai Hadas, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Thu, 12 May 2022 15:25:32 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, May 12, 2022 at 11:57:10AM -0600, Alex Williamson wrote:
> > > @@ -767,9 +767,10 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> > >      case MIGRATION_STATUS_CANCELLED:
> > >      case MIGRATION_STATUS_FAILED:
> > >          bytes_transferred = 0;
> > > -        ret = vfio_migration_set_state(vbasedev,
> > > -                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
> > > -                      VFIO_DEVICE_STATE_RUNNING);
> > > +        ret = vfio_migration_set_state(
> > > +            vbasedev,
> > > +            ~(VFIO_DEVICE_STATE_V1_SAVING | VFIO_DEVICE_STATE_V1_RESUMING),
> > > +            VFIO_DEVICE_STATE_V1_RUNNING);  
> > 
> > Yikes!  Please follow the line wrapping used elsewhere.  There's no need
> > to put the first arg on a new line and subsequent wrapped lines should
> > be indented to match the previous line, or at least to avoid wrapping
> > itself.  Here we can use something like:  
> 
> This is generated by clang-format with one of the qmeu styles, it
> follows the documented guide:
> 
>  In case of function, there are several variants:
> 
>  - 4 spaces indent from the beginning
>  - align the secondary lines just after the opening parenthesis of the
>    first
> 
> clang-format selected the first option due to its optimization
> algorithm.
> 
> Knowing nothing about qmeu, I am confused??

Maybe someone needs to throw more AI models at clang-format so that it
considers the more readable option?  QEMU does a lot wrong with style
imo, and maybe it's technically compliant as written, but I think what
I proposed is also compliant, as well as more readable and more
consistent with the existing file.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/9] vfio: Fix compilation errors caused by VFIO migration v1 deprecation
  2022-05-12 21:11       ` Alex Williamson
@ 2022-05-12 23:32         ` Jason Gunthorpe
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Gunthorpe @ 2022-05-12 23:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Michael S . Tsirkin, Cornelia Huck,
	Paolo Bonzini, Juan Quintela, Dr . David Alan Gilbert,
	Yishai Hadas, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Thu, May 12, 2022 at 03:11:40PM -0600, Alex Williamson wrote:
> On Thu, 12 May 2022 15:25:32 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Thu, May 12, 2022 at 11:57:10AM -0600, Alex Williamson wrote:
> > > > @@ -767,9 +767,10 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> > > >      case MIGRATION_STATUS_CANCELLED:
> > > >      case MIGRATION_STATUS_FAILED:
> > > >          bytes_transferred = 0;
> > > > -        ret = vfio_migration_set_state(vbasedev,
> > > > -                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
> > > > -                      VFIO_DEVICE_STATE_RUNNING);
> > > > +        ret = vfio_migration_set_state(
> > > > +            vbasedev,
> > > > +            ~(VFIO_DEVICE_STATE_V1_SAVING | VFIO_DEVICE_STATE_V1_RESUMING),
> > > > +            VFIO_DEVICE_STATE_V1_RUNNING);  
> > > 
> > > Yikes!  Please follow the line wrapping used elsewhere.  There's no need
> > > to put the first arg on a new line and subsequent wrapped lines should
> > > be indented to match the previous line, or at least to avoid wrapping
> > > itself.  Here we can use something like:  
> > 
> > This is generated by clang-format with one of the qmeu styles, it
> > follows the documented guide:
> > 
> >  In case of function, there are several variants:
> > 
> >  - 4 spaces indent from the beginning
> >  - align the secondary lines just after the opening parenthesis of the
> >    first
> > 
> > clang-format selected the first option due to its optimization
> > algorithm.
> > 
> > Knowing nothing about qmeu, I am confused??
> 
> Maybe someone needs to throw more AI models at clang-format so that it
> considers the more readable option?  QEMU does a lot wrong with style
> imo, and maybe it's technically compliant as written, but I think what
> I proposed is also compliant, as well as more readable and more
> consistent with the existing file.  Thanks,

Let Avihai know any indenting you don't like he will fix it.

IIRC clang scores line-breaking an expression as worse than going to
the smaller indent. Personally I would agree with this.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2
  2022-05-12 18:02 ` [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2 Alex Williamson
@ 2022-05-13  7:04   ` Cornelia Huck
  0 siblings, 0 replies; 43+ messages in thread
From: Cornelia Huck @ 2022-05-13  7:04 UTC (permalink / raw)
  To: Alex Williamson, Avihai Horon
  Cc: qemu-devel, Michael S . Tsirkin, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Yishai Hadas, Jason Gunthorpe,
	Mark Bloch, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Thomas Huth, Richard Henderson, Matthew Rosato

On Thu, May 12 2022, Alex Williamson <alex.williamson@redhat.com> wrote:

> On Thu, 12 May 2022 18:43:11 +0300
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> Hello,
>> 
>> Following VFIO migration protocol v2 acceptance in kernel, this series
>> implements VFIO migration according to the new v2 protocol and replaces
>> the now deprecated v1 implementation.
>
> Let's not bottleneck others waiting on a linux header file update on
> also incorporating v2 support.  In the short term we just need the
> first two patches here.
>
> Are there any objections to folding those patches together for the sake
> of bisection?  Thanks,
>
> Alex

I think folding the headers update and the fixup together makes a lot of
sense. And yes, I'd like to see it in QEMU quickly in order to unblock
other series.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/9] vfio: Fix compilation errors caused by VFIO migration v1 deprecation
  2022-05-12 17:57   ` Alex Williamson
  2022-05-12 18:25     ` Jason Gunthorpe
@ 2022-05-13  7:08     ` Cornelia Huck
  1 sibling, 0 replies; 43+ messages in thread
From: Cornelia Huck @ 2022-05-13  7:08 UTC (permalink / raw)
  To: Alex Williamson, Avihai Horon
  Cc: qemu-devel, Michael S . Tsirkin, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Yishai Hadas, Jason Gunthorpe,
	Mark Bloch, Maor Gottlieb, Kirti Wankhede, Tarun Gupta

On Thu, May 12 2022, Alex Williamson <alex.williamson@redhat.com> wrote:

> On Thu, 12 May 2022 18:43:13 +0300
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> @@ -767,9 +767,10 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>>      case MIGRATION_STATUS_CANCELLED:
>>      case MIGRATION_STATUS_FAILED:
>>          bytes_transferred = 0;
>> -        ret = vfio_migration_set_state(vbasedev,
>> -                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
>> -                      VFIO_DEVICE_STATE_RUNNING);
>> +        ret = vfio_migration_set_state(
>> +            vbasedev,
>> +            ~(VFIO_DEVICE_STATE_V1_SAVING | VFIO_DEVICE_STATE_V1_RESUMING),
>> +            VFIO_DEVICE_STATE_V1_RUNNING);
>
> Yikes!  Please follow the line wrapping used elsewhere.  There's no need
> to put the first arg on a new line and subsequent wrapped lines should
> be indented to match the previous line, or at least to avoid wrapping
> itself.  Here we can use something like:
>
>         ret = vfio_migration_set_state(vbasedev,
>                                        ~(VFIO_DEVICE_STATE_V1_SAVING |
>                                          VFIO_DEVICE_STATE_V1_RESUMING),
>                                        VFIO_DEVICE_STATE_V1_RUNNING);

FWIW, I'd prefer this variant as well.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/9] vfio/migration: Fix NULL pointer dereference bug
  2022-05-12 15:43 ` [PATCH 3/9] vfio/migration: Fix NULL pointer dereference bug Avihai Horon
@ 2022-05-16 11:15   ` Juan Quintela
  2022-05-17 12:28     ` Avihai Horon
  0 siblings, 1 reply; 43+ messages in thread
From: Juan Quintela @ 2022-05-16 11:15 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Dr . David Alan Gilbert, Yishai Hadas,
	Jason Gunthorpe, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

Avihai Horon <avihaih@nvidia.com> wrote:
> As part of its error flow, vfio_vmstate_change() accesses
> MigrationState->to_dst_file without any checks. This can cause a NULL
> pointer dereference if the error flow is taken and
> MigrationState->to_dst_file is not set.
>
> For example, this can happen if VM is started or stopped not during
> migration and vfio_vmstate_change() error flow is taken, as
> MigrationState->to_dst_file is not set at that time.
>
> Fix it by checking that MigrationState->to_dst_file is set before using
> it.
>
> Fixes: 02a7e71b1e5b ("vfio: Add VM state change handler to know state of VM")
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  hw/vfio/migration.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 835608cd23..21e8f9d4d4 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -744,7 +744,9 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>           */
>          error_report("%s: Failed to set device state 0x%x", vbasedev->name,
>                       (migration->device_state & mask) | value);
> -        qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
> +        if (migrate_get_current()->to_dst_file) {
> +            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
> +        }
>      }

Reviewed-by: Juan Quintela <quintela@redhat.com>

I mean that the change is right.

But I am not so sure about using qemu_file_* operations outside of the
migration_thread.  I *think* that everything else that uses qemu_file_*
functions is operating inside the migration_thread, and this one don't
look like it.  Furthermore, a fast look at qemu source, I can't see
anyplace where we setup an error in a vmstate_change.

Later, Juan.

>      vbasedev->migration->vm_running = running;
>      trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-12 15:43 ` [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported Avihai Horon
@ 2022-05-16 11:22   ` Juan Quintela
  2022-05-16 20:22     ` Alex Williamson
  2022-05-20 10:58   ` Joao Martins
  1 sibling, 1 reply; 43+ messages in thread
From: Juan Quintela @ 2022-05-16 11:22 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Dr . David Alan Gilbert, Yishai Hadas,
	Jason Gunthorpe, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

Avihai Horon <avihaih@nvidia.com> wrote:
> Currently, if IOMMU of a VFIO container doesn't support dirty page
> tracking, migration is blocked completely. This is because a DMA-able
> VFIO device can dirty RAM pages without updating QEMU about it, thus
> breaking the migration.
>
> However, this doesn't mean that migration can't be done at all. If
> migration pre-copy phase is skipped, the VFIO device doesn't have a
> chance to dirty RAM pages that have been migrated already, thus
> eliminating the problem previously mentioned.
>
> Hence, in such case allow migration but skip pre-copy phase.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

I don't know (TM).
Several issues:
- Patch is ugly as hell (ok, that depends on taste)
- It changes migration_iteration_run() instead of directly
  migration_thread.
- There is already another case where we skip the sending of RAM
  (localhost migration with shared memory)

In migration/ram.c:

static int ram_find_and_save_block(RAMState *rs, bool last_stage)
{
    PageSearchStatus pss;
    int pages = 0;
    bool again, found;

    /* No dirty page as there is zero RAM */
    if (!ram_bytes_total()) {
        return pages;
    }

This is the other place where we _don't_ send any RAM at all.

I don't have a great idea about how to make things clear at a higher
level, I have to think about this.

Later, Juan.

> ---
>  hw/vfio/migration.c   | 9 ++++++++-
>  migration/migration.c | 5 +++++
>  migration/migration.h | 3 +++
>  3 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 21e8f9d4d4..d4b6653026 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -863,10 +863,17 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>      struct vfio_region_info *info = NULL;
>      int ret = -ENOTSUP;
>  
> -    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
> +    if (!vbasedev->enable_migration) {
>          goto add_blocker;
>      }
>  
> +    if (!container->dirty_pages_supported) {
> +        warn_report(
> +            "%s: IOMMU of the device's VFIO container doesn't support dirty page tracking, migration pre-copy phase will be skipped",
> +            vbasedev->name);
> +        migrate_get_current()->skip_precopy = true;
> +    }
> +
>      ret = vfio_get_dev_region_info(vbasedev,
>                                     VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
>                                     VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> diff --git a/migration/migration.c b/migration/migration.c
> index 5a31b23bd6..668343508d 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3593,6 +3593,11 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>      uint64_t pending_size, pend_pre, pend_compat, pend_post;
>      bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
>  
> +    if (s->skip_precopy) {
> +        migration_completion(s);
> +        return MIG_ITERATE_BREAK;
> +    }
> +
>      qemu_savevm_state_pending(s->to_dst_file, s->threshold_size, &pend_pre,
>                                &pend_compat, &pend_post);
>      pending_size = pend_pre + pend_compat + pend_post;
> diff --git a/migration/migration.h b/migration/migration.h
> index a863032b71..876713e7e1 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -332,6 +332,9 @@ struct MigrationState {
>       * This save hostname when out-going migration starts
>       */
>      char *hostname;
> +
> +    /* Whether to skip pre-copy phase of migration or not */
> +    bool skip_precopy;
>  };
>  
>  void migrate_set_state(int *state, int old_state, int new_state);



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/9] migration/qemu-file: Add qemu_file_get_to_fd()
  2022-05-12 15:43 ` [PATCH 5/9] migration/qemu-file: Add qemu_file_get_to_fd() Avihai Horon
@ 2022-05-16 11:31   ` Juan Quintela
  2022-05-17 12:36     ` Avihai Horon
  0 siblings, 1 reply; 43+ messages in thread
From: Juan Quintela @ 2022-05-16 11:31 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Dr . David Alan Gilbert, Yishai Hadas,
	Jason Gunthorpe, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

Avihai Horon <avihaih@nvidia.com> wrote:
> Add new function qemu_file_get_to_fd() that allows reading data from
> QEMUFile and writing it straight into a given fd.
>
> This will be used later in VFIO migration code.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  migration/qemu-file.c | 34 ++++++++++++++++++++++++++++++++++
>  migration/qemu-file.h |  1 +
>  2 files changed, 35 insertions(+)
>
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index 1479cddad9..cad3d32eb3 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -867,3 +867,37 @@ QIOChannel *qemu_file_get_ioc(QEMUFile *file)
>  {
>      return file->has_ioc ? QIO_CHANNEL(file->opaque) : NULL;
>  }
> +
> +/*
> + * Read size bytes from QEMUFile f and write them to fd.
> + */
> +int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size)
> +{
> +    while (size) {
> +        size_t pending = f->buf_size - f->buf_index;
> +        ssize_t rc;
> +
> +        if (!pending) {
> +            rc = qemu_fill_buffer(f);
> +            if (rc < 0) {
> +                return rc;
> +            }
> +            if (rc == 0) {
> +                return -1;
> +            }
> +            continue;
> +        }
> +
> +        rc = write(fd, f->buf + f->buf_index, MIN(pending, size));
> +        if (rc < 0) {
> +            return rc;
> +        }
> +        if (rc == 0) {
> +            return -1;
> +        }
> +        f->buf_index += rc;
> +        size -= rc;
> +    }
> +
> +    return 0;
> +}

Is there a really performance difference to just use:

uint8_t buffer[size];

qemu_get_buffer(f, buffer, size);
write(fd, buffer, size);

Or telling it otherwise, what sizes are we talking here?

Thanks, Juan.


> diff --git a/migration/qemu-file.h b/migration/qemu-file.h
> index 3f36d4dc8c..dd26037450 100644
> --- a/migration/qemu-file.h
> +++ b/migration/qemu-file.h
> @@ -162,6 +162,7 @@ int qemu_file_shutdown(QEMUFile *f);
>  QEMUFile *qemu_file_get_return_path(QEMUFile *f);
>  void qemu_fflush(QEMUFile *f);
>  void qemu_file_set_blocking(QEMUFile *f, bool block);
> +int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
>  
>  void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
>  void ram_control_after_iterate(QEMUFile *f, uint64_t flags);



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-16 11:22   ` Juan Quintela
@ 2022-05-16 20:22     ` Alex Williamson
  2022-05-16 23:08       ` Jason Gunthorpe
  0 siblings, 1 reply; 43+ messages in thread
From: Alex Williamson @ 2022-05-16 20:22 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Avihai Horon, qemu-devel, Michael S . Tsirkin, Cornelia Huck,
	Paolo Bonzini, Dr . David Alan Gilbert, Yishai Hadas,
	Jason Gunthorpe, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Mon, 16 May 2022 13:22:14 +0200
Juan Quintela <quintela@redhat.com> wrote:

> Avihai Horon <avihaih@nvidia.com> wrote:
> > Currently, if IOMMU of a VFIO container doesn't support dirty page
> > tracking, migration is blocked completely. This is because a DMA-able
> > VFIO device can dirty RAM pages without updating QEMU about it, thus
> > breaking the migration.
> >
> > However, this doesn't mean that migration can't be done at all. If
> > migration pre-copy phase is skipped, the VFIO device doesn't have a
> > chance to dirty RAM pages that have been migrated already, thus
> > eliminating the problem previously mentioned.
> >
> > Hence, in such case allow migration but skip pre-copy phase.
> >
> > Signed-off-by: Avihai Horon <avihaih@nvidia.com>  
> 
> I don't know (TM).
> Several issues:
> - Patch is ugly as hell (ok, that depends on taste)
> - It changes migration_iteration_run() instead of directly
>   migration_thread.
> - There is already another case where we skip the sending of RAM
>   (localhost migration with shared memory)
> 
> In migration/ram.c:
> 
> static int ram_find_and_save_block(RAMState *rs, bool last_stage)
> {
>     PageSearchStatus pss;
>     int pages = 0;
>     bool again, found;
> 
>     /* No dirty page as there is zero RAM */
>     if (!ram_bytes_total()) {
>         return pages;
>     }
> 
> This is the other place where we _don't_ send any RAM at all.
> 
> I don't have a great idea about how to make things clear at a higher
> level, I have to think about this.

It seems like if we have devices dictating what type of migrations can
be performed then there probably needs to be a switch to restrict use of
such devices just as we have the -only-migratable switch now to prevent
attaching devices that don't support migration.  I'd guess that we need
the switch to opt-in to allowing such devices to maintain
compatibility.  There's probably a whole pile of qapi things missing to
expose this to management tools as well.  Thanks,

Alex

> > ---
> >  hw/vfio/migration.c   | 9 ++++++++-
> >  migration/migration.c | 5 +++++
> >  migration/migration.h | 3 +++
> >  3 files changed, 16 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 21e8f9d4d4..d4b6653026 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -863,10 +863,17 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
> >      struct vfio_region_info *info = NULL;
> >      int ret = -ENOTSUP;
> >  
> > -    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
> > +    if (!vbasedev->enable_migration) {
> >          goto add_blocker;
> >      }
> >  
> > +    if (!container->dirty_pages_supported) {
> > +        warn_report(
> > +            "%s: IOMMU of the device's VFIO container doesn't support dirty page tracking, migration pre-copy phase will be skipped",
> > +            vbasedev->name);
> > +        migrate_get_current()->skip_precopy = true;
> > +    }
> > +
> >      ret = vfio_get_dev_region_info(vbasedev,
> >                                     VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> >                                     VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 5a31b23bd6..668343508d 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -3593,6 +3593,11 @@ static MigIterateState migration_iteration_run(MigrationState *s)
> >      uint64_t pending_size, pend_pre, pend_compat, pend_post;
> >      bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
> >  
> > +    if (s->skip_precopy) {
> > +        migration_completion(s);
> > +        return MIG_ITERATE_BREAK;
> > +    }
> > +
> >      qemu_savevm_state_pending(s->to_dst_file, s->threshold_size, &pend_pre,
> >                                &pend_compat, &pend_post);
> >      pending_size = pend_pre + pend_compat + pend_post;
> > diff --git a/migration/migration.h b/migration/migration.h
> > index a863032b71..876713e7e1 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -332,6 +332,9 @@ struct MigrationState {
> >       * This save hostname when out-going migration starts
> >       */
> >      char *hostname;
> > +
> > +    /* Whether to skip pre-copy phase of migration or not */
> > +    bool skip_precopy;
> >  };
> >  
> >  void migrate_set_state(int *state, int old_state, int new_state);  
> 



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-16 20:22     ` Alex Williamson
@ 2022-05-16 23:08       ` Jason Gunthorpe
  2022-05-17 16:00         ` Alex Williamson
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Gunthorpe @ 2022-05-16 23:08 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Juan Quintela, Avihai Horon, qemu-devel, Michael S . Tsirkin,
	Cornelia Huck, Paolo Bonzini, Dr . David Alan Gilbert,
	Yishai Hadas, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Mon, May 16, 2022 at 02:22:00PM -0600, Alex Williamson wrote:
> On Mon, 16 May 2022 13:22:14 +0200
> Juan Quintela <quintela@redhat.com> wrote:
> 
> > Avihai Horon <avihaih@nvidia.com> wrote:
> > > Currently, if IOMMU of a VFIO container doesn't support dirty page
> > > tracking, migration is blocked completely. This is because a DMA-able
> > > VFIO device can dirty RAM pages without updating QEMU about it, thus
> > > breaking the migration.
> > >
> > > However, this doesn't mean that migration can't be done at all. If
> > > migration pre-copy phase is skipped, the VFIO device doesn't have a
> > > chance to dirty RAM pages that have been migrated already, thus
> > > eliminating the problem previously mentioned.
> > >
> > > Hence, in such case allow migration but skip pre-copy phase.
> > >
> > > Signed-off-by: Avihai Horon <avihaih@nvidia.com>  
> > 
> > I don't know (TM).
> > Several issues:
> > - Patch is ugly as hell (ok, that depends on taste)
> > - It changes migration_iteration_run() instead of directly
> >   migration_thread.
> > - There is already another case where we skip the sending of RAM
> >   (localhost migration with shared memory)
> > 
> > In migration/ram.c:
> > 
> > static int ram_find_and_save_block(RAMState *rs, bool last_stage)
> > {
> >     PageSearchStatus pss;
> >     int pages = 0;
> >     bool again, found;
> > 
> >     /* No dirty page as there is zero RAM */
> >     if (!ram_bytes_total()) {
> >         return pages;
> >     }
> > 
> > This is the other place where we _don't_ send any RAM at all.
> > 
> > I don't have a great idea about how to make things clear at a higher
> > level, I have to think about this.
> 
> It seems like if we have devices dictating what type of migrations can
> be performed then there probably needs to be a switch to restrict use of
> such devices just as we have the -only-migratable switch now to prevent
> attaching devices that don't support migration.  I'd guess that we need
> the switch to opt-in to allowing such devices to maintain
> compatibility.  There's probably a whole pile of qapi things missing to
> expose this to management tools as well.  Thanks,

This is really intended to be a NOP from where things are now, as if
you use mlx5 live migration without a patch like this then it causes a
botched pre-copy since everything just ends up permanently dirty.

If it makes more sense we could abort the pre-copy too - in the end
there will be dirty tracking so I don't know if I'd invest in a big
adventure to fully define non-dirty tracking migration.

Jason


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/9] vfio/migration: Fix NULL pointer dereference bug
  2022-05-16 11:15   ` Juan Quintela
@ 2022-05-17 12:28     ` Avihai Horon
  2022-05-18 11:36       ` Juan Quintela
  0 siblings, 1 reply; 43+ messages in thread
From: Avihai Horon @ 2022-05-17 12:28 UTC (permalink / raw)
  To: quintela
  Cc: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Dr . David Alan Gilbert, Yishai Hadas,
	Jason Gunthorpe, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta


On 5/16/2022 2:15 PM, Juan Quintela wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> wrote:
>> As part of its error flow, vfio_vmstate_change() accesses
>> MigrationState->to_dst_file without any checks. This can cause a NULL
>> pointer dereference if the error flow is taken and
>> MigrationState->to_dst_file is not set.
>>
>> For example, this can happen if VM is started or stopped not during
>> migration and vfio_vmstate_change() error flow is taken, as
>> MigrationState->to_dst_file is not set at that time.
>>
>> Fix it by checking that MigrationState->to_dst_file is set before using
>> it.
>>
>> Fixes: 02a7e71b1e5b ("vfio: Add VM state change handler to know state of VM")
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   hw/vfio/migration.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 835608cd23..21e8f9d4d4 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -744,7 +744,9 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>>            */
>>           error_report("%s: Failed to set device state 0x%x", vbasedev->name,
>>                        (migration->device_state & mask) | value);
>> -        qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
>> +        if (migrate_get_current()->to_dst_file) {
>> +            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
>> +        }
>>       }
> Reviewed-by: Juan Quintela <quintela@redhat.com>
>
> I mean that the change is right.
>
> But I am not so sure about using qemu_file_* operations outside of the
> migration_thread.  I *think* that everything else that uses qemu_file_*
> functions is operating inside the migration_thread, and this one don't
> look like it.  Furthermore, a fast look at qemu source, I can't see
> anyplace where we setup an error in a vmstate_change.

vfio_vmstate_change() will operate inside migration_thread if 
migration_thread is the one that called vm start/stop.

In cases where vm start/stop was not called by migration_thread, it will 
operate outside of migration_thread. But I think in such cases 
to_dst_file should not be set.

Ideally we should have returned error, but vm_state_notify() doesn't 
report errors.

Maybe later we can change vm_state_notify() to support error reporting, 
or instead of using to_dst_file to indicate an error, update some flag 
in VFIOMigration.


Thanks.

>
> Later, Juan.
>
>>       vbasedev->migration->vm_running = running;
>>       trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/9] migration/qemu-file: Add qemu_file_get_to_fd()
  2022-05-16 11:31   ` Juan Quintela
@ 2022-05-17 12:36     ` Avihai Horon
  2022-05-18 11:54       ` Juan Quintela
  0 siblings, 1 reply; 43+ messages in thread
From: Avihai Horon @ 2022-05-17 12:36 UTC (permalink / raw)
  To: quintela
  Cc: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Dr . David Alan Gilbert, Yishai Hadas,
	Jason Gunthorpe, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta


On 5/16/2022 2:31 PM, Juan Quintela wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> wrote:
>> Add new function qemu_file_get_to_fd() that allows reading data from
>> QEMUFile and writing it straight into a given fd.
>>
>> This will be used later in VFIO migration code.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   migration/qemu-file.c | 34 ++++++++++++++++++++++++++++++++++
>>   migration/qemu-file.h |  1 +
>>   2 files changed, 35 insertions(+)
>>
>> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
>> index 1479cddad9..cad3d32eb3 100644
>> --- a/migration/qemu-file.c
>> +++ b/migration/qemu-file.c
>> @@ -867,3 +867,37 @@ QIOChannel *qemu_file_get_ioc(QEMUFile *file)
>>   {
>>       return file->has_ioc ? QIO_CHANNEL(file->opaque) : NULL;
>>   }
>> +
>> +/*
>> + * Read size bytes from QEMUFile f and write them to fd.
>> + */
>> +int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size)
>> +{
>> +    while (size) {
>> +        size_t pending = f->buf_size - f->buf_index;
>> +        ssize_t rc;
>> +
>> +        if (!pending) {
>> +            rc = qemu_fill_buffer(f);
>> +            if (rc < 0) {
>> +                return rc;
>> +            }
>> +            if (rc == 0) {
>> +                return -1;
>> +            }
>> +            continue;
>> +        }
>> +
>> +        rc = write(fd, f->buf + f->buf_index, MIN(pending, size));
>> +        if (rc < 0) {
>> +            return rc;
>> +        }
>> +        if (rc == 0) {
>> +            return -1;
>> +        }
>> +        f->buf_index += rc;
>> +        size -= rc;
>> +    }
>> +
>> +    return 0;
>> +}
> Is there a really performance difference to just use:
>
> uint8_t buffer[size];
>
> qemu_get_buffer(f, buffer, size);
> write(fd, buffer, size);
>
> Or telling it otherwise, what sizes are we talking here?

It depends on the device, but It can range from a few MBs to several GBs 
AFAIK.

Thanks.

>
> Thanks, Juan.
>
>
>> diff --git a/migration/qemu-file.h b/migration/qemu-file.h
>> index 3f36d4dc8c..dd26037450 100644
>> --- a/migration/qemu-file.h
>> +++ b/migration/qemu-file.h
>> @@ -162,6 +162,7 @@ int qemu_file_shutdown(QEMUFile *f);
>>   QEMUFile *qemu_file_get_return_path(QEMUFile *f);
>>   void qemu_fflush(QEMUFile *f);
>>   void qemu_file_set_blocking(QEMUFile *f, bool block);
>> +int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
>>
>>   void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
>>   void ram_control_after_iterate(QEMUFile *f, uint64_t flags);


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-16 23:08       ` Jason Gunthorpe
@ 2022-05-17 16:00         ` Alex Williamson
  2022-05-17 16:08           ` Jason Gunthorpe
  0 siblings, 1 reply; 43+ messages in thread
From: Alex Williamson @ 2022-05-17 16:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Juan Quintela, Avihai Horon, qemu-devel, Michael S . Tsirkin,
	Cornelia Huck, Paolo Bonzini, Dr . David Alan Gilbert,
	Yishai Hadas, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Mon, 16 May 2022 20:08:32 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, May 16, 2022 at 02:22:00PM -0600, Alex Williamson wrote:
> > On Mon, 16 May 2022 13:22:14 +0200
> > Juan Quintela <quintela@redhat.com> wrote:
> >   
> > > Avihai Horon <avihaih@nvidia.com> wrote:  
> > > > Currently, if IOMMU of a VFIO container doesn't support dirty page
> > > > tracking, migration is blocked completely. This is because a DMA-able
> > > > VFIO device can dirty RAM pages without updating QEMU about it, thus
> > > > breaking the migration.
> > > >
> > > > However, this doesn't mean that migration can't be done at all. If
> > > > migration pre-copy phase is skipped, the VFIO device doesn't have a
> > > > chance to dirty RAM pages that have been migrated already, thus
> > > > eliminating the problem previously mentioned.
> > > >
> > > > Hence, in such case allow migration but skip pre-copy phase.
> > > >
> > > > Signed-off-by: Avihai Horon <avihaih@nvidia.com>    
> > > 
> > > I don't know (TM).
> > > Several issues:
> > > - Patch is ugly as hell (ok, that depends on taste)
> > > - It changes migration_iteration_run() instead of directly
> > >   migration_thread.
> > > - There is already another case where we skip the sending of RAM
> > >   (localhost migration with shared memory)
> > > 
> > > In migration/ram.c:
> > > 
> > > static int ram_find_and_save_block(RAMState *rs, bool last_stage)
> > > {
> > >     PageSearchStatus pss;
> > >     int pages = 0;
> > >     bool again, found;
> > > 
> > >     /* No dirty page as there is zero RAM */
> > >     if (!ram_bytes_total()) {
> > >         return pages;
> > >     }
> > > 
> > > This is the other place where we _don't_ send any RAM at all.
> > > 
> > > I don't have a great idea about how to make things clear at a higher
> > > level, I have to think about this.  
> > 
> > It seems like if we have devices dictating what type of migrations can
> > be performed then there probably needs to be a switch to restrict use of
> > such devices just as we have the -only-migratable switch now to prevent
> > attaching devices that don't support migration.  I'd guess that we need
> > the switch to opt-in to allowing such devices to maintain
> > compatibility.  There's probably a whole pile of qapi things missing to
> > expose this to management tools as well.  Thanks,  
> 
> This is really intended to be a NOP from where things are now, as if
> you use mlx5 live migration without a patch like this then it causes a
> botched pre-copy since everything just ends up permanently dirty.
> 
> If it makes more sense we could abort the pre-copy too - in the end
> there will be dirty tracking so I don't know if I'd invest in a big
> adventure to fully define non-dirty tracking migration.

How is pre-copy currently "botched" without a patch like this?  If it's
simply that the pre-copy doesn't converge and the downtime constraints
don't allow the VM to enter stop-and-copy, that's the expected behavior
AIUI, and supports backwards compatibility with existing SLAs.

I'm assuming that by setting this new skip_precopy flag that we're
forcing the VM to move to stop-and-copy, regardless of any other SLA
constraints placed on the migration.  It seems like a better solution
would be to expose to management tools that the VM contains a device
that does not support the pre-copy phase so that downtime expectations
can be adjusted.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-17 16:00         ` Alex Williamson
@ 2022-05-17 16:08           ` Jason Gunthorpe
  2022-05-17 17:22             ` Alex Williamson
  2022-05-18 11:39             ` Juan Quintela
  0 siblings, 2 replies; 43+ messages in thread
From: Jason Gunthorpe @ 2022-05-17 16:08 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Juan Quintela, Avihai Horon, qemu-devel, Michael S . Tsirkin,
	Cornelia Huck, Paolo Bonzini, Dr . David Alan Gilbert,
	Yishai Hadas, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Tue, May 17, 2022 at 10:00:45AM -0600, Alex Williamson wrote:

> > This is really intended to be a NOP from where things are now, as if
> > you use mlx5 live migration without a patch like this then it causes a
> > botched pre-copy since everything just ends up permanently dirty.
> > 
> > If it makes more sense we could abort the pre-copy too - in the end
> > there will be dirty tracking so I don't know if I'd invest in a big
> > adventure to fully define non-dirty tracking migration.
> 
> How is pre-copy currently "botched" without a patch like this?  If it's
> simply that the pre-copy doesn't converge and the downtime constraints
> don't allow the VM to enter stop-and-copy, that's the expected behavior
> AIUI, and supports backwards compatibility with existing SLAs.

It means it always fails - that certainly isn't working live
migration. There is no point in trying to converge something that we
already know will never converge.

> I'm assuming that by setting this new skip_precopy flag that we're
> forcing the VM to move to stop-and-copy, regardless of any other SLA
> constraints placed on the migration.  

That does seem like a defect in this patch, any SLA constraints should
still all be checked under the assumption all ram is dirty.

> It seems like a better solution would be to expose to management
> tools that the VM contains a device that does not support the
> pre-copy phase so that downtime expectations can be adjusted.

I don't expect this to be a real use case though..

Remember, you asked for this patch when you wanted qemu to have good
behavior when kernel support for legacy dirty tracking is removed
before we merge v2 support.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-17 16:08           ` Jason Gunthorpe
@ 2022-05-17 17:22             ` Alex Williamson
  2022-05-17 17:39               ` Jason Gunthorpe
  2022-05-18 11:39             ` Juan Quintela
  1 sibling, 1 reply; 43+ messages in thread
From: Alex Williamson @ 2022-05-17 17:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Juan Quintela, Avihai Horon, qemu-devel, Michael S . Tsirkin,
	Cornelia Huck, Paolo Bonzini, Dr . David Alan Gilbert,
	Yishai Hadas, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Tue, 17 May 2022 13:08:44 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, May 17, 2022 at 10:00:45AM -0600, Alex Williamson wrote:
> 
> > > This is really intended to be a NOP from where things are now, as if
> > > you use mlx5 live migration without a patch like this then it causes a
> > > botched pre-copy since everything just ends up permanently dirty.
> > > 
> > > If it makes more sense we could abort the pre-copy too - in the end
> > > there will be dirty tracking so I don't know if I'd invest in a big
> > > adventure to fully define non-dirty tracking migration.  
> > 
> > How is pre-copy currently "botched" without a patch like this?  If it's
> > simply that the pre-copy doesn't converge and the downtime constraints
> > don't allow the VM to enter stop-and-copy, that's the expected behavior
> > AIUI, and supports backwards compatibility with existing SLAs.  
> 
> It means it always fails - that certainly isn't working live
> migration. There is no point in trying to converge something that we
> already know will never converge.

If we eliminate the pre-copy phase then it's not so much live migration
anyway.  Trying to converge is indeed useless work, but afaik it's that
useless work that generates the data that management tools can use to
determine that SLAs cannot be achieved in a compatible way.
 
> > I'm assuming that by setting this new skip_precopy flag that we're
> > forcing the VM to move to stop-and-copy, regardless of any other SLA
> > constraints placed on the migration.    
> 
> That does seem like a defect in this patch, any SLA constraints should
> still all be checked under the assumption all ram is dirty.

The migration iteration function certainly tries to compare remaining
bytes to a threshold based on bandwidth and downtime.  The exit path
added here is the same as it would take if we had achieved our
threshold limit.  It's not clear to me that we're checking the downtime
limit elsewhere or have the data to do it if we don't transfer anything
estimate bandwidth.

> > It seems like a better solution would be to expose to management
> > tools that the VM contains a device that does not support the
> > pre-copy phase so that downtime expectations can be adjusted.  
> 
> I don't expect this to be a real use case though..
> 
> Remember, you asked for this patch when you wanted qemu to have good
> behavior when kernel support for legacy dirty tracking is removed
> before we merge v2 support.

Is wanting good behavior a controversial point?  Did we define this as
the desired good behavior?  Ref?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-17 17:22             ` Alex Williamson
@ 2022-05-17 17:39               ` Jason Gunthorpe
  2022-05-18  3:46                 ` Alex Williamson
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Gunthorpe @ 2022-05-17 17:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Juan Quintela, Avihai Horon, qemu-devel, Michael S . Tsirkin,
	Cornelia Huck, Paolo Bonzini, Dr . David Alan Gilbert,
	Yishai Hadas, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Tue, May 17, 2022 at 11:22:32AM -0600, Alex Williamson wrote:

> > > It seems like a better solution would be to expose to management
> > > tools that the VM contains a device that does not support the
> > > pre-copy phase so that downtime expectations can be adjusted.  
> > 
> > I don't expect this to be a real use case though..
> > 
> > Remember, you asked for this patch when you wanted qemu to have good
> > behavior when kernel support for legacy dirty tracking is removed
> > before we merge v2 support.
> 
> Is wanting good behavior a controversial point?  Did we define this as
> the desired good behavior?  Ref?  

As I said, this was intended as a NOP, which is what I thought we
agreed to. Missing the SLA checking that existed before seems like
something to fix in this patch. This is the discussion thread:

https://lore.kernel.org/kvm/20220324231159.GA11336@nvidia.com/

 "I guess I was assuming that enabling v2 migration in QEMU was dependent
  on the existing type1 dirty tracking because it's the only means we
  have to tell QEMU that all memory is perpetually dirty when we have a
  DMA device.  Is that not correct?"

The only point was to prepare qemu for kernel's that don't support the
legacy dirty tracking API but do have a v2 migration interface. IIRC
something undesired happens if you do that right now.

We could also just dirty all memory in qemu and keep it exactly the
same so every SLA detail works. Or completely block pre-copy based
flows.

It it not intended to be a useful configuration, this is just covering
off backwards compat issues - so I'm not keen to do a bunch of
management work to support it.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-17 17:39               ` Jason Gunthorpe
@ 2022-05-18  3:46                 ` Alex Williamson
  2022-05-18 17:01                   ` Jason Gunthorpe
  0 siblings, 1 reply; 43+ messages in thread
From: Alex Williamson @ 2022-05-18  3:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Juan Quintela, Avihai Horon, qemu-devel, Michael S . Tsirkin,
	Cornelia Huck, Paolo Bonzini, Dr . David Alan Gilbert,
	Yishai Hadas, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Tue, 17 May 2022 14:39:37 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, May 17, 2022 at 11:22:32AM -0600, Alex Williamson wrote:
> 
> > > > It seems like a better solution would be to expose to management
> > > > tools that the VM contains a device that does not support the
> > > > pre-copy phase so that downtime expectations can be adjusted.    
> > > 
> > > I don't expect this to be a real use case though..
> > > 
> > > Remember, you asked for this patch when you wanted qemu to have good
> > > behavior when kernel support for legacy dirty tracking is removed
> > > before we merge v2 support.  
> > 
> > Is wanting good behavior a controversial point?  Did we define this as
> > the desired good behavior?  Ref?    
> 
> As I said, this was intended as a NOP, which is what I thought we
> agreed to. Missing the SLA checking that existed before seems like
> something to fix in this patch.

But even if we have the SLA check, how does a management tool know that
pre-copy will be skipped and that the downtime needs to account for
that?  The current solution is obviously non-optimal, it was mainly
meant for backwards compatibility, but this seems like a fail faster
solution, with less useless work, but also providing less indication
how to configure the migration to succeed.

> This is the discussion thread:
> 
> https://lore.kernel.org/kvm/20220324231159.GA11336@nvidia.com/
> 
>  "I guess I was assuming that enabling v2 migration in QEMU was dependent
>   on the existing type1 dirty tracking because it's the only means we
>   have to tell QEMU that all memory is perpetually dirty when we have a
>   DMA device.  Is that not correct?"

Doesn't sound like me asking for this patch so much as trying to figure
out how to support migration without DMA dirty tracking, and after the
above comment was a concern whether we lock ourselves into a dirty
tracking requirement in the iommufd type1 compatibility interface if we
rely on this type1 feature.  You had spit-balled that QEMU could skip
pre-copy, but this all needs to be vetted with migration folks and
management tools.

> The only point was to prepare qemu for kernel's that don't support the
> legacy dirty tracking API but do have a v2 migration interface. IIRC
> something undesired happens if you do that right now.

Per this patch, the container requires dirty tracking or we add a
blocker and can't migrate the device.

> We could also just dirty all memory in qemu and keep it exactly the
> same so every SLA detail works. Or completely block pre-copy based
> flows.
> 
> It it not intended to be a useful configuration, this is just covering
> off backwards compat issues - so I'm not keen to do a bunch of
> management work to support it.

Are we then deemphasizing the vfio compatibility support in iommufd?
If we really don't want to put effort towards backwards compatibility,
should migration support only be enabled with native iommufd support?
Thanks,

Alex

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/9] vfio/migration: Fix NULL pointer dereference bug
  2022-05-17 12:28     ` Avihai Horon
@ 2022-05-18 11:36       ` Juan Quintela
  0 siblings, 0 replies; 43+ messages in thread
From: Juan Quintela @ 2022-05-18 11:36 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Dr . David Alan Gilbert, Yishai Hadas,
	Jason Gunthorpe, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

Avihai Horon <avihaih@nvidia.com> wrote:
> On 5/16/2022 2:15 PM, Juan Quintela wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Avihai Horon <avihaih@nvidia.com> wrote:
>>> As part of its error flow, vfio_vmstate_change() accesses
>>> MigrationState->to_dst_file without any checks. This can cause a NULL
>>> pointer dereference if the error flow is taken and
>>> MigrationState->to_dst_file is not set.
>>>
>>> For example, this can happen if VM is started or stopped not during
>>> migration and vfio_vmstate_change() error flow is taken, as
>>> MigrationState->to_dst_file is not set at that time.
>>>
>>> Fix it by checking that MigrationState->to_dst_file is set before using
>>> it.
>>>
>>> Fixes: 02a7e71b1e5b ("vfio: Add VM state change handler to know state of VM")
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> ---
>>>   hw/vfio/migration.c | 4 +++-
>>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 835608cd23..21e8f9d4d4 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -744,7 +744,9 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>>>            */
>>>           error_report("%s: Failed to set device state 0x%x", vbasedev->name,
>>>                        (migration->device_state & mask) | value);
>>> -        qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
>>> +        if (migrate_get_current()->to_dst_file) {
>>> +            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
>>> +        }
>>>       }
>> Reviewed-by: Juan Quintela <quintela@redhat.com>
>>
>> I mean that the change is right.
>>
>> But I am not so sure about using qemu_file_* operations outside of the
>> migration_thread.  I *think* that everything else that uses qemu_file_*
>> functions is operating inside the migration_thread, and this one don't
>> look like it.  Furthermore, a fast look at qemu source, I can't see
>> anyplace where we setup an error in a vmstate_change.
>
> vfio_vmstate_change() will operate inside migration_thread if
> migration_thread is the one that called vm start/stop.
>
> In cases where vm start/stop was not called by migration_thread, it
> will operate outside of migration_thread. But I think in such cases 
> to_dst_file should not be set.
>
> Ideally we should have returned error, but vm_state_notify() doesn't
> report errors.
>
> Maybe later we can change vm_state_notify() to support error
> reporting, or instead of using to_dst_file to indicate an error,
> update some flag in VFIOMigration.


I think that sounds like a better option.

There are only two callers of vm_state_notify():

do_vm_stop()

and

vm_prepare_start()

Both already support returning one error.

Thanks, Juan.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-17 16:08           ` Jason Gunthorpe
  2022-05-17 17:22             ` Alex Williamson
@ 2022-05-18 11:39             ` Juan Quintela
  2022-05-18 15:50               ` Jason Gunthorpe
  1 sibling, 1 reply; 43+ messages in thread
From: Juan Quintela @ 2022-05-18 11:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Avihai Horon, qemu-devel, Michael S . Tsirkin,
	Cornelia Huck, Paolo Bonzini, Dr . David Alan Gilbert,
	Yishai Hadas, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Tue, May 17, 2022 at 10:00:45AM -0600, Alex Williamson wrote:
>
>> > This is really intended to be a NOP from where things are now, as if
>> > you use mlx5 live migration without a patch like this then it causes a
>> > botched pre-copy since everything just ends up permanently dirty.
>> > 
>> > If it makes more sense we could abort the pre-copy too - in the end
>> > there will be dirty tracking so I don't know if I'd invest in a big
>> > adventure to fully define non-dirty tracking migration.
>> 
>> How is pre-copy currently "botched" without a patch like this?  If it's
>> simply that the pre-copy doesn't converge and the downtime constraints
>> don't allow the VM to enter stop-and-copy, that's the expected behavior
>> AIUI, and supports backwards compatibility with existing SLAs.
>
> It means it always fails - that certainly isn't working live
> migration. There is no point in trying to converge something that we
> already know will never converge.

Fully agree with you here.

But not how this is being done.  I think we need a way to convince the
migration code that it shouldn't even try to migrate RAM.  That would
fix the current use case, and your use case.

>> I'm assuming that by setting this new skip_precopy flag that we're
>> forcing the VM to move to stop-and-copy, regardless of any other SLA
>> constraints placed on the migration.  
>
> That does seem like a defect in this patch, any SLA constraints should
> still all be checked under the assumption all ram is dirty.

And how are we going to:
- detect the network link speed
- to be sure that we are inside downtime limit

I think that it is not possible, so basically we are skiping the precopy
stage and praying that the other bits are going to be ok.

>> It seems like a better solution would be to expose to management
>> tools that the VM contains a device that does not support the
>> pre-copy phase so that downtime expectations can be adjusted.
>
> I don't expect this to be a real use case though..
>
> Remember, you asked for this patch when you wanted qemu to have good
> behavior when kernel support for legacy dirty tracking is removed
> before we merge v2 support.

I am an ignorant on the subject.  Can I ask how the dirty memory is
tracked on this v2?

Thanks, Juan.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/9] migration/qemu-file: Add qemu_file_get_to_fd()
  2022-05-17 12:36     ` Avihai Horon
@ 2022-05-18 11:54       ` Juan Quintela
  2022-05-18 15:42         ` Jason Gunthorpe
  0 siblings, 1 reply; 43+ messages in thread
From: Juan Quintela @ 2022-05-18 11:54 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Michael S . Tsirkin, Cornelia Huck, Paolo Bonzini,
	Alex Williamson, Dr . David Alan Gilbert, Yishai Hadas,
	Jason Gunthorpe, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

Avihai Horon <avihaih@nvidia.com> wrote:
> On 5/16/2022 2:31 PM, Juan Quintela wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Avihai Horon <avihaih@nvidia.com> wrote:
>>> Add new function qemu_file_get_to_fd() that allows reading data from
>>> QEMUFile and writing it straight into a given fd.
>>>
>>> This will be used later in VFIO migration code.
>>>
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> ---
>>>   migration/qemu-file.c | 34 ++++++++++++++++++++++++++++++++++
>>>   migration/qemu-file.h |  1 +
>>>   2 files changed, 35 insertions(+)
>>>
>>> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
>>> index 1479cddad9..cad3d32eb3 100644
>>> --- a/migration/qemu-file.c
>>> +++ b/migration/qemu-file.c
>>> @@ -867,3 +867,37 @@ QIOChannel *qemu_file_get_ioc(QEMUFile *file)
>>>   {
>>>       return file->has_ioc ? QIO_CHANNEL(file->opaque) : NULL;
>>>   }
>>> +
>>> +/*
>>> + * Read size bytes from QEMUFile f and write them to fd.
>>> + */
>>> +int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size)
>>> +{
>>> +    while (size) {
>>> +        size_t pending = f->buf_size - f->buf_index;
>>> +        ssize_t rc;
>>> +
>>> +        if (!pending) {
>>> +            rc = qemu_fill_buffer(f);
>>> +            if (rc < 0) {
>>> +                return rc;
>>> +            }
>>> +            if (rc == 0) {
>>> +                return -1;
>>> +            }
>>> +            continue;
>>> +        }
>>> +
>>> +        rc = write(fd, f->buf + f->buf_index, MIN(pending, size));
>>> +        if (rc < 0) {
>>> +            return rc;
>>> +        }
>>> +        if (rc == 0) {
>>> +            return -1;
>>> +        }
>>> +        f->buf_index += rc;
>>> +        size -= rc;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>> Is there a really performance difference to just use:
>>
>> uint8_t buffer[size];
>>
>> qemu_get_buffer(f, buffer, size);
>> write(fd, buffer, size);
>>
>> Or telling it otherwise, what sizes are we talking here?
>
> It depends on the device, but It can range from a few MBs to several
> GBs AFAIK.

a few MB is ok.

Several GB on the main migration channel without a single
header/whatever?

This sounds like a recipe for disaster IMHO.
Remember that on source side, we have a migration thread.  This patch
will just block that migration thread.

If we are using 10Gigabit networking, 1GB/second for friends and making
math easy, each GB will take 1 second downtime.

During that downtime, migration control is basically handycapped.
And you are sendinfg this data with qemu_put_buffer_async().  This
function just adds its buffer to an iovec, only user uses 4KB (or
whatever your page size is) to the iovec.  You are talking about adding
gigabytes here.  I don't know how well this is going to work, but my
guess is that migrate_cancel is not going to be happy.

On destination side, this is even worse, because we receive this
multigigabyte chunk in a coroutine, and I am not sure that this will not
completely block all qemu while this is happening (again, multisecond
time).

I have to think more about this problem, but I can see how this is going
to be able to go through the migration main channel.

Later, Juan.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/9] migration/qemu-file: Add qemu_file_get_to_fd()
  2022-05-18 11:54       ` Juan Quintela
@ 2022-05-18 15:42         ` Jason Gunthorpe
  2022-05-18 16:00           ` Daniel P. Berrangé
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Gunthorpe @ 2022-05-18 15:42 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Avihai Horon, qemu-devel, Michael S . Tsirkin, Cornelia Huck,
	Paolo Bonzini, Alex Williamson, Dr . David Alan Gilbert,
	Yishai Hadas, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Wed, May 18, 2022 at 01:54:34PM +0200, Juan Quintela wrote:

> >> Is there a really performance difference to just use:
> >>
> >> uint8_t buffer[size];
> >>
> >> qemu_get_buffer(f, buffer, size);
> >> write(fd, buffer, size);
> >>
> >> Or telling it otherwise, what sizes are we talking here?
> >
> > It depends on the device, but It can range from a few MBs to several
> > GBs AFAIK.
> 
> a few MB is ok.
> 
> Several GB on the main migration channel without a single
> header/whatever?

IIRC it iterates in multi-megabyte chunks each which gets a header.

The chunking size is set by the size of the buffer mmap

The overall point is that memcpying GB's is going to be taxing so we
want to eliminate copies on this path, especially copies that result
in more system calls.

We are expecting to look into further optimization down the road here
because even this is still too slow.

Jason


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-18 11:39             ` Juan Quintela
@ 2022-05-18 15:50               ` Jason Gunthorpe
  2022-05-24 15:11                 ` Avihai Horon
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Gunthorpe @ 2022-05-18 15:50 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Alex Williamson, Avihai Horon, qemu-devel, Michael S . Tsirkin,
	Cornelia Huck, Paolo Bonzini, Dr . David Alan Gilbert,
	Yishai Hadas, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Wed, May 18, 2022 at 01:39:31PM +0200, Juan Quintela wrote:

> > That does seem like a defect in this patch, any SLA constraints should
> > still all be checked under the assumption all ram is dirty.
> 
> And how are we going to:
> - detect the network link speed
> - to be sure that we are inside downtime limit
> 
> I think that it is not possible, so basically we are skiping the precopy
> stage and praying that the other bits are going to be ok.

Like I keep saying, this is not a real use case, we expect dirty
tracking to be available in any real configuration. This is just
trying to make qemu work in some reasonable way if dirty tracking is
not available but a VFIO migration device is plugged in.

Just pick something simple that makes sense. Like if any SLA is set
then just refuse to even start. If no SLA then go directly to
STOP_COPY.

> >> It seems like a better solution would be to expose to management
> >> tools that the VM contains a device that does not support the
> >> pre-copy phase so that downtime expectations can be adjusted.
> >
> > I don't expect this to be a real use case though..
> >
> > Remember, you asked for this patch when you wanted qemu to have good
> > behavior when kernel support for legacy dirty tracking is removed
> > before we merge v2 support.
> 
> I am an ignorant on the subject.  Can I ask how the dirty memory is
> tracked on this v2?

These two RFCs are the current proposal to enable it for the system
iommu:

https://lore.kernel.org/kvm/20220428210933.3583-1-joao.m.martins@oracle.com

And for device internal trackers:

https://lore.kernel.org/kvm/20220501123301.127279-1-yishaih@nvidia.com/ 

Regards,
Jason


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/9] migration/qemu-file: Add qemu_file_get_to_fd()
  2022-05-18 15:42         ` Jason Gunthorpe
@ 2022-05-18 16:00           ` Daniel P. Berrangé
  2022-05-18 16:16             ` Jason Gunthorpe
  0 siblings, 1 reply; 43+ messages in thread
From: Daniel P. Berrangé @ 2022-05-18 16:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Juan Quintela, Avihai Horon, qemu-devel, Michael S . Tsirkin,
	Cornelia Huck, Paolo Bonzini, Alex Williamson,
	Dr . David Alan Gilbert, Yishai Hadas, Mark Bloch, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta

On Wed, May 18, 2022 at 12:42:37PM -0300, Jason Gunthorpe wrote:
> On Wed, May 18, 2022 at 01:54:34PM +0200, Juan Quintela wrote:
> 
> > >> Is there a really performance difference to just use:
> > >>
> > >> uint8_t buffer[size];
> > >>
> > >> qemu_get_buffer(f, buffer, size);
> > >> write(fd, buffer, size);
> > >>
> > >> Or telling it otherwise, what sizes are we talking here?
> > >
> > > It depends on the device, but It can range from a few MBs to several
> > > GBs AFAIK.
> > 
> > a few MB is ok.
> > 
> > Several GB on the main migration channel without a single
> > header/whatever?
> 
> IIRC it iterates in multi-megabyte chunks each which gets a header.
> 
> The chunking size is set by the size of the buffer mmap
> 
> The overall point is that memcpying GB's is going to be taxing so we
> want to eliminate copies on this path, especially copies that result
> in more system calls.
> 
> We are expecting to look into further optimization down the road here
> because even this is still too slow.

Considering the possibility of future optimization, IMHO adding this
kind of API at the QEMUFile level is too high. We'd be better pushing
the impl down into the QIOChannel API level.

   int64_t qio_channel_copy_range(QIOCHannel *srcioc,
                                  QIOChannel *tgtioc,
				  size_t len);

The QIOChannel impl can do pretty much what you showed in the general
case, but in special cases it could have the option to offload to the
kernel copy_range() syscall to avoid the context sitches.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/9] migration/qemu-file: Add qemu_file_get_to_fd()
  2022-05-18 16:00           ` Daniel P. Berrangé
@ 2022-05-18 16:16             ` Jason Gunthorpe
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Gunthorpe @ 2022-05-18 16:16 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Juan Quintela, Avihai Horon, qemu-devel, Michael S . Tsirkin,
	Cornelia Huck, Paolo Bonzini, Alex Williamson,
	Dr . David Alan Gilbert, Yishai Hadas, Mark Bloch, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta

On Wed, May 18, 2022 at 05:00:26PM +0100, Daniel P. Berrangé wrote:
> On Wed, May 18, 2022 at 12:42:37PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 18, 2022 at 01:54:34PM +0200, Juan Quintela wrote:
> > 
> > > >> Is there a really performance difference to just use:
> > > >>
> > > >> uint8_t buffer[size];
> > > >>
> > > >> qemu_get_buffer(f, buffer, size);
> > > >> write(fd, buffer, size);
> > > >>
> > > >> Or telling it otherwise, what sizes are we talking here?
> > > >
> > > > It depends on the device, but It can range from a few MBs to several
> > > > GBs AFAIK.
> > > 
> > > a few MB is ok.
> > > 
> > > Several GB on the main migration channel without a single
> > > header/whatever?
> > 
> > IIRC it iterates in multi-megabyte chunks each which gets a header.
> > 
> > The chunking size is set by the size of the buffer mmap
> > 
> > The overall point is that memcpying GB's is going to be taxing so we
> > want to eliminate copies on this path, especially copies that result
> > in more system calls.
> > 
> > We are expecting to look into further optimization down the road here
> > because even this is still too slow.
> 
> Considering the possibility of future optimization, IMHO adding this
> kind of API at the QEMUFile level is too high. We'd be better pushing
> the impl down into the QIOChannel API level.
> 
>    int64_t qio_channel_copy_range(QIOCHannel *srcioc,
>                                   QIOChannel *tgtioc,
> 				  size_t len);
> 
> The QIOChannel impl can do pretty much what you showed in the general
> case, but in special cases it could have the option to offload to the
> kernel copy_range() syscall to avoid the context sitches.

This is probably something to do down the road when we figure out
exactly what is best.

Currently we don't have kernel support for optimized copy_file_range()
(ie fops splice_read) inside the VFIO drivers so copy_file_range()
will just fail.

I didn't look closely again but IIRC the challenge is that the
QIOChannel doesn't have a ready large temporary buffer to use for a
non-splice fallback path.

Jason


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-18  3:46                 ` Alex Williamson
@ 2022-05-18 17:01                   ` Jason Gunthorpe
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Gunthorpe @ 2022-05-18 17:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Juan Quintela, Avihai Horon, qemu-devel, Michael S . Tsirkin,
	Cornelia Huck, Paolo Bonzini, Dr . David Alan Gilbert,
	Yishai Hadas, Mark Bloch, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Tue, May 17, 2022 at 09:46:56PM -0600, Alex Williamson wrote:

> The current solution is obviously non-optimal, it was mainly
> meant for backwards compatibility, but this seems like a fail faster
> solution, with less useless work, but also providing less indication
> how to configure the migration to succeed.

I don't think this is a production configuration. We should not be
expecting that the user is going to carefully fine tune some SLA
parameter. It may be the "SLA check" we are missing is simply if a SLA
exists or not.

> > It it not intended to be a useful configuration, this is just covering
> > off backwards compat issues - so I'm not keen to do a bunch of
> > management work to support it.
> 
> Are we then deemphasizing the vfio compatibility support in iommufd?

I'm viewing iommufd compatability with VFIO type 1 as only extending
to implemented features.. We are deleting NESTED for instance. As
there is no in-kernel implementation of the type 1 tracker I would
expect to eventually delete it as well once we have consensus this is
what we plan to do.

We discussed this in Joao's series and that was the consensus.

> If we really don't want to put effort towards backwards compatibility,
> should migration support only be enabled with native iommufd
> support?

I'm expecting that the dirty tracking will require native iommufd only
for the system iommu tracker, but the mlx5 on-device tracker will be
fine with either iommu back end.

It is still useful currently for testing the VFIO device part as it
will be some time until the other parts are ready, so I'd rather not
block it completely in qemu.

The goal is for qemu to do something sensible so when we delete kernel
support for type 1 dirty tracking so there is no back compat concern
for existing non-experimental qemu.

Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-12 15:43 ` [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported Avihai Horon
  2022-05-16 11:22   ` Juan Quintela
@ 2022-05-20 10:58   ` Joao Martins
  2022-05-23  6:11     ` Avihai Horon
  1 sibling, 1 reply; 43+ messages in thread
From: Joao Martins @ 2022-05-20 10:58 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Yishai Hadas, qemu-devel, Jason Gunthorpe, Juan Quintela,
	Mark Bloch, Maor Gottlieb, Dr . David Alan Gilbert,
	Alex Williamson, Cornelia Huck, Kirti Wankhede, Tarun Gupta,
	Michael S . Tsirkin, Paolo Bonzini

On 5/12/22 16:43, Avihai Horon wrote:
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 21e8f9d4d4..d4b6653026 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -863,10 +863,17 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>      struct vfio_region_info *info = NULL;
>      int ret = -ENOTSUP;
>  
> -    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
> +    if (!vbasedev->enable_migration) {
>          goto add_blocker;
>      }
>  
> +    if (!container->dirty_pages_supported) {
> +        warn_report(
> +            "%s: IOMMU of the device's VFIO container doesn't support dirty page tracking, migration pre-copy phase will be skipped",
> +            vbasedev->name);

Maybe warn_report_once(...) given that the following N devices would observe the
same thing in theory.

> +        migrate_get_current()->skip_precopy = true;
> +    }
> +

Perhaps it might be easier to reuse the existing knob to disable pre-copy
per device that Kirti added some time ago, rather than changing migration core just
yet (e.g. you might wanna bail of the migration preemptively because the CPU is dirtying
too many pages for example?):

if (!container->dirty_pages_supported) {
    warn_report_once(...)
    pre_copy_dirty_page_tracking = ON_OFF_AUTO_OFF;
}

You might need to tackle the fact you will get when dirty_pages start/stop ioctls
returns you error messages. The errors is just because log_start() and log_stop() simply
fail because the ioctl doesn't exist, but everything else is fine -- at least that's what
I observed at least. Should be noted that it's a problem with the existing
`-device vfio-pci host=XX:YY.ZZ,x-pre-copy-dirty-page-tracking=true` regardless of this patch:

void vfio_listener_log_global_start()
{
	if (vfio_devices_all_dirty_tracking(container)) {
		vfio_set_dirty_page_tracking(container, true);
        }
}

... And same for vfio_listener_log_global_stop() -- maybe a worthwhile predecessor patch.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-20 10:58   ` Joao Martins
@ 2022-05-23  6:11     ` Avihai Horon
  2022-05-23  9:45       ` Joao Martins
  0 siblings, 1 reply; 43+ messages in thread
From: Avihai Horon @ 2022-05-23  6:11 UTC (permalink / raw)
  To: Joao Martins
  Cc: Yishai Hadas, qemu-devel, Jason Gunthorpe, Juan Quintela,
	Mark Bloch, Maor Gottlieb, Dr . David Alan Gilbert,
	Alex Williamson, Cornelia Huck, Kirti Wankhede, Tarun Gupta,
	Michael S . Tsirkin, Paolo Bonzini


On 5/20/2022 1:58 PM, Joao Martins wrote:
> External email: Use caution opening links or attachments
>
>
> On 5/12/22 16:43, Avihai Horon wrote:
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 21e8f9d4d4..d4b6653026 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -863,10 +863,17 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>>       struct vfio_region_info *info = NULL;
>>       int ret = -ENOTSUP;
>>
>> -    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
>> +    if (!vbasedev->enable_migration) {
>>           goto add_blocker;
>>       }
>>
>> +    if (!container->dirty_pages_supported) {
>> +        warn_report(
>> +            "%s: IOMMU of the device's VFIO container doesn't support dirty page tracking, migration pre-copy phase will be skipped",
>> +            vbasedev->name);
> Maybe warn_report_once(...) given that the following N devices would observe the
> same thing in theory.

Yes, you are right. Will change.

>> +        migrate_get_current()->skip_precopy = true;
>> +    }
>> +
> Perhaps it might be easier to reuse the existing knob to disable pre-copy
> per device that Kirti added some time ago, rather than changing migration core just
> yet (e.g. you might wanna bail of the migration preemptively because the CPU is dirtying
> too many pages for example?):
>
> if (!container->dirty_pages_supported) {
>      warn_report_once(...)
>      pre_copy_dirty_page_tracking = ON_OFF_AUTO_OFF;
> }

But this doesn't prevent the VFIO device from dirtying RAM pages during 
pre-copy phase.
The VFIO device can dirty RAM pages during pre-copy and it won't have a 
way to report the dirtied pages to QEMU and migration will be faulty.

Thanks.

>
> You might need to tackle the fact you will get when dirty_pages start/stop ioctls
> returns you error messages. The errors is just because log_start() and log_stop() simply
> fail because the ioctl doesn't exist, but everything else is fine -- at least that's what
> I observed at least. Should be noted that it's a problem with the existing
> `-device vfio-pci host=XX:YY.ZZ,x-pre-copy-dirty-page-tracking=true` regardless of this patch:
>
> void vfio_listener_log_global_start()
> {
>          if (vfio_devices_all_dirty_tracking(container)) {
>                  vfio_set_dirty_page_tracking(container, true);
>          }
> }
>
> ... And same for vfio_listener_log_global_stop() -- maybe a worthwhile predecessor patch.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-23  6:11     ` Avihai Horon
@ 2022-05-23  9:45       ` Joao Martins
  0 siblings, 0 replies; 43+ messages in thread
From: Joao Martins @ 2022-05-23  9:45 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Yishai Hadas, qemu-devel, Jason Gunthorpe, Juan Quintela,
	Mark Bloch, Maor Gottlieb, Dr . David Alan Gilbert,
	Alex Williamson, Cornelia Huck, Kirti Wankhede, Tarun Gupta,
	Michael S . Tsirkin, Paolo Bonzini

On 5/23/22 07:11, Avihai Horon wrote:
> On 5/20/2022 1:58 PM, Joao Martins wrote:
>>> +        migrate_get_current()->skip_precopy = true;
>>> +    }
>>> +
>> Perhaps it might be easier to reuse the existing knob to disable pre-copy
>> per device that Kirti added some time ago, rather than changing migration core just
>> yet (e.g. you might wanna bail of the migration preemptively because the CPU is dirtying
>> too many pages for example?):
>>
>> if (!container->dirty_pages_supported) {
>>      warn_report_once(...)
>>      pre_copy_dirty_page_tracking = ON_OFF_AUTO_OFF;
>> }
> 
> But this doesn't prevent the VFIO device from dirtying RAM pages during 
> pre-copy phase.
> The VFIO device can dirty RAM pages during pre-copy and it won't have a 
> way to report the dirtied pages to QEMU and migration will be faulty.
> 

You're quite right, sorry for the noise. I was a little too obsessed in reusing the
existing field that forgot that letting the iterate stage go the PCI device could
also be driven into dirtying RAM too.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/9] vfio/migration: Implement VFIO migration protocol v2
  2022-05-12 15:43 ` [PATCH 6/9] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
@ 2022-05-23 18:14   ` Joao Martins
  2022-05-24 15:35     ` Avihai Horon
  0 siblings, 1 reply; 43+ messages in thread
From: Joao Martins @ 2022-05-23 18:14 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Yishai Hadas, Jason Gunthorpe, Mark Bloch, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Cornelia Huck, Michael S . Tsirkin,
	Alex Williamson, Paolo Bonzini, Dr . David Alan Gilbert,
	Juan Quintela

On 5/12/22 16:43, Avihai Horon wrote:
> Replace the current VFIO migration protocol v1 implementation with a new
> implementation corresponding to VFIO migration protocol v2.
> 
> The main changes are:
> - VFIO device state is now represented as a finite state machine instead
>   of a bitmap.
> 
> - Migration interface with kernel is now done using VFIO_DEVICE_FEATURE
>   ioctl and normal read() and write() instead of the migration region.
> 
> - As VFIO migration protocol v2 currently doesn't support the pre-copy
>   phase of migration, .save_live_pending and .save_live_iterate handlers
>   plus pre-copy relevant code are removed.
> 
> Detailed information about VFIO migration protocol v2 and difference
> compared to v1 can be found here [1].
> 
> [1]
> https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com/
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  hw/vfio/common.c              |  21 +-
>  hw/vfio/migration.c           | 628 +++++++---------------------------
>  hw/vfio/trace-events          |   9 +-
>  include/hw/vfio/vfio-common.h |   8 +-
>  4 files changed, 153 insertions(+), 513 deletions(-)
> 
This looks like a fairly big patch, though more than 70% of it is removing
code. Perhaps you could split it into adding v2 and removing v1 afterwards, rather
than a single replacement patch? It's just a suggestion anyhow, to hopefully ease
analysis of the new additions. The removal looks to muddle a tiny bit.

I don't wanna throw you into potentially unnecessary work should maintainers disagree,
so here's an attempt:

	https://github.com/jpemartins/qemu/commits/for-avihai

If you apply your series on top of the same base commit (78ac2eebbab9) you should be able
to compare both branches.

I haven't found yet any particular flaws in your new logic (but will reply back if I find
any).


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
  2022-05-18 15:50               ` Jason Gunthorpe
@ 2022-05-24 15:11                 ` Avihai Horon
  0 siblings, 0 replies; 43+ messages in thread
From: Avihai Horon @ 2022-05-24 15:11 UTC (permalink / raw)
  To: Jason Gunthorpe, Juan Quintela
  Cc: Alex Williamson, qemu-devel, Michael S . Tsirkin, Cornelia Huck,
	Paolo Bonzini, Dr . David Alan Gilbert, Yishai Hadas, Mark Bloch,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta

On 5/18/2022 6:50 PM, Jason Gunthorpe wrote:
> On Wed, May 18, 2022 at 01:39:31PM +0200, Juan Quintela wrote:
>
>>> That does seem like a defect in this patch, any SLA constraints should
>>> still all be checked under the assumption all ram is dirty.
>> And how are we going to:
>> - detect the network link speed
>> - to be sure that we are inside downtime limit
>>
>> I think that it is not possible, so basically we are skiping the precopy
>> stage and praying that the other bits are going to be ok.
> Like I keep saying, this is not a real use case, we expect dirty
> tracking to be available in any real configuration. This is just
> trying to make qemu work in some reasonable way if dirty tracking is
> not available but a VFIO migration device is plugged in.
>
> Just pick something simple that makes sense. Like if any SLA is set
> then just refuse to even start. If no SLA then go directly to
> STOP_COPY.

I tried to follow Jason's suggestion to check if there is any SLA and 
block migration, or no SLA at all and allow migration.

It doesn't seem like there is a way to say "no SLA at all".

Migration param downtime_limit takes values between 0 and 2000 seconds.

What does downtime_limit=0 mean?

If it's 0 then MigrationState->threshold_size = 0, so ram_save_pending() 
will never call migration_bitmap_sync_precopy(). Only upon entering 
stop-copy phase will migration_bitmap_sync_precopy() be called, which 
will collect all pages that were dirtied during pre-copy, which could be 
a lot and impact downtime.

In libvirt it seems that downtime_limit = 0 is not valid and can't be set.

Am I missing something here? Is there a way to allow unlimited downtime?

Thanks.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/9] vfio/migration: Implement VFIO migration protocol v2
  2022-05-23 18:14   ` Joao Martins
@ 2022-05-24 15:35     ` Avihai Horon
  0 siblings, 0 replies; 43+ messages in thread
From: Avihai Horon @ 2022-05-24 15:35 UTC (permalink / raw)
  To: Joao Martins, qemu-devel
  Cc: Yishai Hadas, Jason Gunthorpe, Mark Bloch, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Cornelia Huck, Michael S . Tsirkin,
	Alex Williamson, Paolo Bonzini, Dr . David Alan Gilbert,
	Juan Quintela


On 5/23/2022 9:14 PM, Joao Martins wrote:
> External email: Use caution opening links or attachments
>
>
> On 5/12/22 16:43, Avihai Horon wrote:
>> Replace the current VFIO migration protocol v1 implementation with a new
>> implementation corresponding to VFIO migration protocol v2.
>>
>> The main changes are:
>> - VFIO device state is now represented as a finite state machine instead
>>    of a bitmap.
>>
>> - Migration interface with kernel is now done using VFIO_DEVICE_FEATURE
>>    ioctl and normal read() and write() instead of the migration region.
>>
>> - As VFIO migration protocol v2 currently doesn't support the pre-copy
>>    phase of migration, .save_live_pending and .save_live_iterate handlers
>>    plus pre-copy relevant code are removed.
>>
>> Detailed information about VFIO migration protocol v2 and difference
>> compared to v1 can be found here [1].
>>
>> [1]
>> https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com/
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   hw/vfio/common.c              |  21 +-
>>   hw/vfio/migration.c           | 628 +++++++---------------------------
>>   hw/vfio/trace-events          |   9 +-
>>   include/hw/vfio/vfio-common.h |   8 +-
>>   4 files changed, 153 insertions(+), 513 deletions(-)
>>
> This looks like a fairly big patch, though more than 70% of it is removing
> code. Perhaps you could split it into adding v2 and removing v1 afterwards, rather
> than a single replacement patch? It's just a suggestion anyhow, to hopefully ease
> analysis of the new additions. The removal looks to muddle a tiny bit.
>
> I don't wanna throw you into potentially unnecessary work should maintainers disagree,
> so here's an attempt:
>
>          https://github.com/jpemartins/qemu/commits/for-avihai
>
Ah, nice idea, thanks! I was afraid this patch would be too cluttered.

Alex, do you want me to split this patch in a similar manner as Joao 
suggested? and maybe add v1 prefixes to existing functions instead of v2 
to new functions?

> If you apply your series on top of the same base commit (78ac2eebbab9) you should be able
> to compare both branches.
>
> I haven't found yet any particular flaws in your new logic (but will reply back if I find
> any).

Sure, thanks!



^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2022-05-24 16:17 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-12 15:43 [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
2022-05-12 15:43 ` [PATCH 1/9] linux-headers: Update headers to v5.18-rc6 Avihai Horon
2022-05-12 15:43 ` [PATCH 2/9] vfio: Fix compilation errors caused by VFIO migration v1 deprecation Avihai Horon
2022-05-12 17:57   ` Alex Williamson
2022-05-12 18:25     ` Jason Gunthorpe
2022-05-12 21:11       ` Alex Williamson
2022-05-12 23:32         ` Jason Gunthorpe
2022-05-13  7:08     ` Cornelia Huck
2022-05-12 15:43 ` [PATCH 3/9] vfio/migration: Fix NULL pointer dereference bug Avihai Horon
2022-05-16 11:15   ` Juan Quintela
2022-05-17 12:28     ` Avihai Horon
2022-05-18 11:36       ` Juan Quintela
2022-05-12 15:43 ` [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported Avihai Horon
2022-05-16 11:22   ` Juan Quintela
2022-05-16 20:22     ` Alex Williamson
2022-05-16 23:08       ` Jason Gunthorpe
2022-05-17 16:00         ` Alex Williamson
2022-05-17 16:08           ` Jason Gunthorpe
2022-05-17 17:22             ` Alex Williamson
2022-05-17 17:39               ` Jason Gunthorpe
2022-05-18  3:46                 ` Alex Williamson
2022-05-18 17:01                   ` Jason Gunthorpe
2022-05-18 11:39             ` Juan Quintela
2022-05-18 15:50               ` Jason Gunthorpe
2022-05-24 15:11                 ` Avihai Horon
2022-05-20 10:58   ` Joao Martins
2022-05-23  6:11     ` Avihai Horon
2022-05-23  9:45       ` Joao Martins
2022-05-12 15:43 ` [PATCH 5/9] migration/qemu-file: Add qemu_file_get_to_fd() Avihai Horon
2022-05-16 11:31   ` Juan Quintela
2022-05-17 12:36     ` Avihai Horon
2022-05-18 11:54       ` Juan Quintela
2022-05-18 15:42         ` Jason Gunthorpe
2022-05-18 16:00           ` Daniel P. Berrangé
2022-05-18 16:16             ` Jason Gunthorpe
2022-05-12 15:43 ` [PATCH 6/9] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
2022-05-23 18:14   ` Joao Martins
2022-05-24 15:35     ` Avihai Horon
2022-05-12 15:43 ` [PATCH 7/9] vfio/migration: Reset device if setting recover state fails Avihai Horon
2022-05-12 15:43 ` [PATCH 8/9] vfio: Alphabetize migration section of VFIO trace-events file Avihai Horon
2022-05-12 15:43 ` [PATCH 9/9] docs/devel: Align vfio-migration docs to VFIO migration v2 Avihai Horon
2022-05-12 18:02 ` [PATCH 0/9] vfio/migration: Implement VFIO migration protocol v2 Alex Williamson
2022-05-13  7:04   ` Cornelia Huck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.