[RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
@ 2015-11-24 13:35 ` Lan Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

This patchset is to propose a solution of adding live migration
support for SRIOV NIC.

During migration, Qemu needs to let VF driver in the VM to know
migration start and end. Qemu adds faked PCI migration capability
to help to sync status between two sides during migration.

Qemu triggers VF's mailbox irq via sending MSIX msg when migration
status is changed. VF driver tells Qemu its mailbox vector index
via the new PCI capability. In some cases(NIC is suspended or closed),
VF mailbox irq is freed and VF driver can disable irq injecting via
new capability.   

VF driver will put down nic before migration and put up again on
the target machine.

Lan Tianyu (10):
  Qemu/VFIO: Create head file pci.h to share data struct.
  Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
  Qemu/VFIO: Rework vfio_std_cap_max_size() function
  Qemu/VFIO: Add vfio_find_free_cfg_reg() to find free PCI config space
    regs
  Qemu/VFIO: Expose PCI config space read/write and msix functions
  Qemu/PCI: Add macros for faked PCI migration capability
  Qemu: Add post_load_state() to run after restoring CPU state
  Qemu: Add save_before_stop callback to run just before stopping VCPU
    during migration
  Qemu/VFIO: Add SRIOV VF migration support
  Qemu/VFIO: Misc change for enable migration with VFIO

 hw/vfio/Makefile.objs       |   2 +-
 hw/vfio/pci.c               | 196 +++++++++-----------------------------------
 hw/vfio/pci.h               | 168 +++++++++++++++++++++++++++++++++++++
 hw/vfio/sriov.c             | 178 ++++++++++++++++++++++++++++++++++++++++
 include/hw/pci/pci_regs.h   |  19 +++++
 include/migration/vmstate.h |   5 ++
 include/sysemu/sysemu.h     |   1 +
 linux-headers/linux/vfio.h  |  16 ++++
 migration/migration.c       |   3 +-
 migration/savevm.c          |  28 +++++++
 10 files changed, 459 insertions(+), 157 deletions(-)
 create mode 100644 hw/vfio/pci.h
 create mode 100644 hw/vfio/sriov.c

-- 
1.9.3


^ permalink raw reply	[flat|nested] 142+ messages in thread

* [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
@ 2015-11-24 13:35 ` Lan Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

This patchset is to propose a solution of adding live migration
support for SRIOV NIC.

During migration, Qemu needs to let VF driver in the VM to know
migration start and end. Qemu adds faked PCI migration capability
to help to sync status between two sides during migration.

Qemu triggers VF's mailbox irq via sending MSIX msg when migration
status is changed. VF driver tells Qemu its mailbox vector index
via the new PCI capability. In some cases(NIC is suspended or closed),
VF mailbox irq is freed and VF driver can disable irq injecting via
new capability.   

VF driver will put down nic before migration and put up again on
the target machine.

Lan Tianyu (10):
  Qemu/VFIO: Create head file pci.h to share data struct.
  Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
  Qemu/VFIO: Rework vfio_std_cap_max_size() function
  Qemu/VFIO: Add vfio_find_free_cfg_reg() to find free PCI config space
    regs
  Qemu/VFIO: Expose PCI config space read/write and msix functions
  Qemu/PCI: Add macros for faked PCI migration capability
  Qemu: Add post_load_state() to run after restoring CPU state
  Qemu: Add save_before_stop callback to run just before stopping VCPU
    during migration
  Qemu/VFIO: Add SRIOV VF migration support
  Qemu/VFIO: Misc change for enable migration with VFIO

 hw/vfio/Makefile.objs       |   2 +-
 hw/vfio/pci.c               | 196 +++++++++-----------------------------------
 hw/vfio/pci.h               | 168 +++++++++++++++++++++++++++++++++++++
 hw/vfio/sriov.c             | 178 ++++++++++++++++++++++++++++++++++++++++
 include/hw/pci/pci_regs.h   |  19 +++++
 include/migration/vmstate.h |   5 ++
 include/sysemu/sysemu.h     |   1 +
 linux-headers/linux/vfio.h  |  16 ++++
 migration/migration.c       |   3 +-
 migration/savevm.c          |  28 +++++++
 10 files changed, 459 insertions(+), 157 deletions(-)
 create mode 100644 hw/vfio/pci.h
 create mode 100644 hw/vfio/sriov.c

-- 
1.9.3

^ permalink raw reply	[flat|nested] 142+ messages in thread

* [RFC PATCH V2 01/10] Qemu/VFIO: Create head file pci.h to share data struct.
  2015-11-24 13:35 ` [Qemu-devel] " Lan Tianyu
@ 2015-11-24 13:35   ` Lan Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 hw/vfio/pci.c | 137 +-------------------------------------------------
 hw/vfio/pci.h | 158 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 159 insertions(+), 136 deletions(-)
 create mode 100644 hw/vfio/pci.h

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e0e339a..5c3f8a7 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -42,138 +42,7 @@
 #include "trace.h"
 #include "hw/vfio/vfio.h"
 #include "hw/vfio/vfio-common.h"
-
-struct VFIOPCIDevice;
-
-typedef struct VFIOQuirk {
-    MemoryRegion mem;
-    struct VFIOPCIDevice *vdev;
-    QLIST_ENTRY(VFIOQuirk) next;
-    struct {
-        uint32_t base_offset:TARGET_PAGE_BITS;
-        uint32_t address_offset:TARGET_PAGE_BITS;
-        uint32_t address_size:3;
-        uint32_t bar:3;
-
-        uint32_t address_match;
-        uint32_t address_mask;
-
-        uint32_t address_val:TARGET_PAGE_BITS;
-        uint32_t data_offset:TARGET_PAGE_BITS;
-        uint32_t data_size:3;
-
-        uint8_t flags;
-        uint8_t read_flags;
-        uint8_t write_flags;
-    } data;
-} VFIOQuirk;
-
-typedef struct VFIOBAR {
-    VFIORegion region;
-    bool ioport;
-    bool mem64;
-    QLIST_HEAD(, VFIOQuirk) quirks;
-} VFIOBAR;
-
-typedef struct VFIOVGARegion {
-    MemoryRegion mem;
-    off_t offset;
-    int nr;
-    QLIST_HEAD(, VFIOQuirk) quirks;
-} VFIOVGARegion;
-
-typedef struct VFIOVGA {
-    off_t fd_offset;
-    int fd;
-    VFIOVGARegion region[QEMU_PCI_VGA_NUM_REGIONS];
-} VFIOVGA;
-
-typedef struct VFIOINTx {
-    bool pending; /* interrupt pending */
-    bool kvm_accel; /* set when QEMU bypass through KVM enabled */
-    uint8_t pin; /* which pin to pull for qemu_set_irq */
-    EventNotifier interrupt; /* eventfd triggered on interrupt */
-    EventNotifier unmask; /* eventfd for unmask on QEMU bypass */
-    PCIINTxRoute route; /* routing info for QEMU bypass */
-    uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */
-    QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
-} VFIOINTx;
-
-typedef struct VFIOMSIVector {
-    /*
-     * Two interrupt paths are configured per vector.  The first, is only used
-     * for interrupts injected via QEMU.  This is typically the non-accel path,
-     * but may also be used when we want QEMU to handle masking and pending
-     * bits.  The KVM path bypasses QEMU and is therefore higher performance,
-     * but requires masking at the device.  virq is used to track the MSI route
-     * through KVM, thus kvm_interrupt is only available when virq is set to a
-     * valid (>= 0) value.
-     */
-    EventNotifier interrupt;
-    EventNotifier kvm_interrupt;
-    struct VFIOPCIDevice *vdev; /* back pointer to device */
-    int virq;
-    bool use;
-} VFIOMSIVector;
-
-enum {
-    VFIO_INT_NONE = 0,
-    VFIO_INT_INTx = 1,
-    VFIO_INT_MSI  = 2,
-    VFIO_INT_MSIX = 3,
-};
-
-/* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
-typedef struct VFIOMSIXInfo {
-    uint8_t table_bar;
-    uint8_t pba_bar;
-    uint16_t entries;
-    uint32_t table_offset;
-    uint32_t pba_offset;
-    MemoryRegion mmap_mem;
-    void *mmap;
-} VFIOMSIXInfo;
-
-typedef struct VFIOPCIDevice {
-    PCIDevice pdev;
-    VFIODevice vbasedev;
-    VFIOINTx intx;
-    unsigned int config_size;
-    uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
-    off_t config_offset; /* Offset of config space region within device fd */
-    unsigned int rom_size;
-    off_t rom_offset; /* Offset of ROM region within device fd */
-    void *rom;
-    int msi_cap_size;
-    VFIOMSIVector *msi_vectors;
-    VFIOMSIXInfo *msix;
-    int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
-    int interrupt; /* Current interrupt type */
-    VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
-    VFIOVGA vga; /* 0xa0000, 0x3b0, 0x3c0 */
-    PCIHostDeviceAddress host;
-    EventNotifier err_notifier;
-    EventNotifier req_notifier;
-    int (*resetfn)(struct VFIOPCIDevice *);
-    uint32_t features;
-#define VFIO_FEATURE_ENABLE_VGA_BIT 0
-#define VFIO_FEATURE_ENABLE_VGA (1 << VFIO_FEATURE_ENABLE_VGA_BIT)
-#define VFIO_FEATURE_ENABLE_REQ_BIT 1
-#define VFIO_FEATURE_ENABLE_REQ (1 << VFIO_FEATURE_ENABLE_REQ_BIT)
-    int32_t bootindex;
-    uint8_t pm_cap;
-    bool has_vga;
-    bool pci_aer;
-    bool req_enabled;
-    bool has_flr;
-    bool has_pm_reset;
-    bool rom_read_failed;
-} VFIOPCIDevice;
-
-typedef struct VFIORomBlacklistEntry {
-    uint16_t vendor_id;
-    uint16_t device_id;
-} VFIORomBlacklistEntry;
+#include "hw/vfio/pci.h"
 
 /*
  * List of device ids/vendor ids for which to disable
@@ -193,12 +62,8 @@ static const VFIORomBlacklistEntry romblacklist[] = {
     { 0x14e4, 0x168e }
 };
 
-#define MSIX_CAP_LENGTH 12
 
 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
-static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
-static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
-                                  uint32_t val, int len);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 
 /*
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
new file mode 100644
index 0000000..9f360bf
--- /dev/null
+++ b/hw/vfio/pci.h
@@ -0,0 +1,158 @@
+#include <dirent.h>
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "config.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+#include "hw/pci/pci.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "qemu/event_notifier.h"
+#include "qemu/queue.h"
+#include "qemu/range.h"
+#include "sysemu/kvm.h"
+#include "sysemu/sysemu.h"
+#include "trace.h"
+#include "hw/vfio/vfio.h"
+#include "hw/vfio/vfio-common.h"
+
+struct VFIOPCIDevice;
+
+typedef struct VFIOQuirk {
+    MemoryRegion mem;
+    struct VFIOPCIDevice *vdev;
+    QLIST_ENTRY(VFIOQuirk) next;
+    struct {
+        uint32_t base_offset:TARGET_PAGE_BITS;
+        uint32_t address_offset:TARGET_PAGE_BITS;
+        uint32_t address_size:3;
+        uint32_t bar:3;
+
+        uint32_t address_match;
+        uint32_t address_mask;
+
+        uint32_t address_val:TARGET_PAGE_BITS;
+        uint32_t data_offset:TARGET_PAGE_BITS;
+        uint32_t data_size:3;
+
+        uint8_t flags;
+        uint8_t read_flags;
+        uint8_t write_flags;
+    } data;
+} VFIOQuirk;
+
+typedef struct VFIOBAR {
+    VFIORegion region;
+    bool ioport;
+    bool mem64;
+    QLIST_HEAD(, VFIOQuirk) quirks;
+} VFIOBAR;
+
+typedef struct VFIOVGARegion {
+    MemoryRegion mem;
+    off_t offset;
+    int nr;
+    QLIST_HEAD(, VFIOQuirk) quirks;
+} VFIOVGARegion;
+
+typedef struct VFIOVGA {
+    off_t fd_offset;
+    int fd;
+    VFIOVGARegion region[QEMU_PCI_VGA_NUM_REGIONS];
+} VFIOVGA;
+
+typedef struct VFIOINTx {
+    bool pending; /* interrupt pending */
+    bool kvm_accel; /* set when QEMU bypass through KVM enabled */
+    uint8_t pin; /* which pin to pull for qemu_set_irq */
+    EventNotifier interrupt; /* eventfd triggered on interrupt */
+    EventNotifier unmask; /* eventfd for unmask on QEMU bypass */
+    PCIINTxRoute route; /* routing info for QEMU bypass */
+    uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */
+    QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
+} VFIOINTx;
+
+typedef struct VFIOMSIVector {
+    /*
+     * Two interrupt paths are configured per vector.  The first, is only used
+     * for interrupts injected via QEMU.  This is typically the non-accel path,
+     * but may also be used when we want QEMU to handle masking and pending
+     * bits.  The KVM path bypasses QEMU and is therefore higher performance,
+     * but requires masking at the device.  virq is used to track the MSI route
+     * through KVM, thus kvm_interrupt is only available when virq is set to a
+     * valid (>= 0) value.
+     */
+    EventNotifier interrupt;
+    EventNotifier kvm_interrupt;
+    struct VFIOPCIDevice *vdev; /* back pointer to device */
+    int virq;
+    bool use;
+} VFIOMSIVector;
+
+enum {
+    VFIO_INT_NONE = 0,
+    VFIO_INT_INTx = 1,
+    VFIO_INT_MSI  = 2,
+    VFIO_INT_MSIX = 3,
+};
+
+/* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
+typedef struct VFIOMSIXInfo {
+    uint8_t table_bar;
+    uint8_t pba_bar;
+    uint16_t entries;
+    uint32_t table_offset;
+    uint32_t pba_offset;
+    MemoryRegion mmap_mem;
+    void *mmap;
+} VFIOMSIXInfo;
+
+typedef struct VFIOPCIDevice {
+    PCIDevice pdev;
+    VFIODevice vbasedev;
+    VFIOINTx intx;
+    unsigned int config_size;
+    uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
+    off_t config_offset; /* Offset of config space region within device fd */
+    unsigned int rom_size;
+    off_t rom_offset; /* Offset of ROM region within device fd */
+    void *rom;
+    int msi_cap_size;
+    VFIOMSIVector *msi_vectors;
+    VFIOMSIXInfo *msix;
+    int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
+    int interrupt; /* Current interrupt type */
+    VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
+    VFIOVGA vga; /* 0xa0000, 0x3b0, 0x3c0 */
+    PCIHostDeviceAddress host;
+    EventNotifier err_notifier;
+    EventNotifier req_notifier;
+    int (*resetfn)(struct VFIOPCIDevice *);
+    uint32_t features;
+#define VFIO_FEATURE_ENABLE_VGA_BIT 0
+#define VFIO_FEATURE_ENABLE_VGA (1 << VFIO_FEATURE_ENABLE_VGA_BIT)
+#define VFIO_FEATURE_ENABLE_REQ_BIT 1
+#define VFIO_FEATURE_ENABLE_REQ (1 << VFIO_FEATURE_ENABLE_REQ_BIT)
+    int32_t bootindex;
+    uint8_t pm_cap;
+    bool has_vga;
+    bool pci_aer;
+    bool req_enabled;
+    bool has_flr;
+    bool has_pm_reset;
+    bool rom_read_failed;
+} VFIOPCIDevice;
+
+typedef struct VFIORomBlacklistEntry {
+    uint16_t vendor_id;
+    uint16_t device_id;
+} VFIORomBlacklistEntry;
+
+#define MSIX_CAP_LENGTH 12
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [Qemu-devel] [RFC PATCH V2 01/10] Qemu/VFIO: Create head file pci.h to share data struct.
@ 2015-11-24 13:35   ` Lan Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 hw/vfio/pci.c | 137 +-------------------------------------------------
 hw/vfio/pci.h | 158 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 159 insertions(+), 136 deletions(-)
 create mode 100644 hw/vfio/pci.h

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e0e339a..5c3f8a7 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -42,138 +42,7 @@
 #include "trace.h"
 #include "hw/vfio/vfio.h"
 #include "hw/vfio/vfio-common.h"
-
-struct VFIOPCIDevice;
-
-typedef struct VFIOQuirk {
-    MemoryRegion mem;
-    struct VFIOPCIDevice *vdev;
-    QLIST_ENTRY(VFIOQuirk) next;
-    struct {
-        uint32_t base_offset:TARGET_PAGE_BITS;
-        uint32_t address_offset:TARGET_PAGE_BITS;
-        uint32_t address_size:3;
-        uint32_t bar:3;
-
-        uint32_t address_match;
-        uint32_t address_mask;
-
-        uint32_t address_val:TARGET_PAGE_BITS;
-        uint32_t data_offset:TARGET_PAGE_BITS;
-        uint32_t data_size:3;
-
-        uint8_t flags;
-        uint8_t read_flags;
-        uint8_t write_flags;
-    } data;
-} VFIOQuirk;
-
-typedef struct VFIOBAR {
-    VFIORegion region;
-    bool ioport;
-    bool mem64;
-    QLIST_HEAD(, VFIOQuirk) quirks;
-} VFIOBAR;
-
-typedef struct VFIOVGARegion {
-    MemoryRegion mem;
-    off_t offset;
-    int nr;
-    QLIST_HEAD(, VFIOQuirk) quirks;
-} VFIOVGARegion;
-
-typedef struct VFIOVGA {
-    off_t fd_offset;
-    int fd;
-    VFIOVGARegion region[QEMU_PCI_VGA_NUM_REGIONS];
-} VFIOVGA;
-
-typedef struct VFIOINTx {
-    bool pending; /* interrupt pending */
-    bool kvm_accel; /* set when QEMU bypass through KVM enabled */
-    uint8_t pin; /* which pin to pull for qemu_set_irq */
-    EventNotifier interrupt; /* eventfd triggered on interrupt */
-    EventNotifier unmask; /* eventfd for unmask on QEMU bypass */
-    PCIINTxRoute route; /* routing info for QEMU bypass */
-    uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */
-    QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
-} VFIOINTx;
-
-typedef struct VFIOMSIVector {
-    /*
-     * Two interrupt paths are configured per vector.  The first, is only used
-     * for interrupts injected via QEMU.  This is typically the non-accel path,
-     * but may also be used when we want QEMU to handle masking and pending
-     * bits.  The KVM path bypasses QEMU and is therefore higher performance,
-     * but requires masking at the device.  virq is used to track the MSI route
-     * through KVM, thus kvm_interrupt is only available when virq is set to a
-     * valid (>= 0) value.
-     */
-    EventNotifier interrupt;
-    EventNotifier kvm_interrupt;
-    struct VFIOPCIDevice *vdev; /* back pointer to device */
-    int virq;
-    bool use;
-} VFIOMSIVector;
-
-enum {
-    VFIO_INT_NONE = 0,
-    VFIO_INT_INTx = 1,
-    VFIO_INT_MSI  = 2,
-    VFIO_INT_MSIX = 3,
-};
-
-/* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
-typedef struct VFIOMSIXInfo {
-    uint8_t table_bar;
-    uint8_t pba_bar;
-    uint16_t entries;
-    uint32_t table_offset;
-    uint32_t pba_offset;
-    MemoryRegion mmap_mem;
-    void *mmap;
-} VFIOMSIXInfo;
-
-typedef struct VFIOPCIDevice {
-    PCIDevice pdev;
-    VFIODevice vbasedev;
-    VFIOINTx intx;
-    unsigned int config_size;
-    uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
-    off_t config_offset; /* Offset of config space region within device fd */
-    unsigned int rom_size;
-    off_t rom_offset; /* Offset of ROM region within device fd */
-    void *rom;
-    int msi_cap_size;
-    VFIOMSIVector *msi_vectors;
-    VFIOMSIXInfo *msix;
-    int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
-    int interrupt; /* Current interrupt type */
-    VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
-    VFIOVGA vga; /* 0xa0000, 0x3b0, 0x3c0 */
-    PCIHostDeviceAddress host;
-    EventNotifier err_notifier;
-    EventNotifier req_notifier;
-    int (*resetfn)(struct VFIOPCIDevice *);
-    uint32_t features;
-#define VFIO_FEATURE_ENABLE_VGA_BIT 0
-#define VFIO_FEATURE_ENABLE_VGA (1 << VFIO_FEATURE_ENABLE_VGA_BIT)
-#define VFIO_FEATURE_ENABLE_REQ_BIT 1
-#define VFIO_FEATURE_ENABLE_REQ (1 << VFIO_FEATURE_ENABLE_REQ_BIT)
-    int32_t bootindex;
-    uint8_t pm_cap;
-    bool has_vga;
-    bool pci_aer;
-    bool req_enabled;
-    bool has_flr;
-    bool has_pm_reset;
-    bool rom_read_failed;
-} VFIOPCIDevice;
-
-typedef struct VFIORomBlacklistEntry {
-    uint16_t vendor_id;
-    uint16_t device_id;
-} VFIORomBlacklistEntry;
+#include "hw/vfio/pci.h"
 
 /*
  * List of device ids/vendor ids for which to disable
@@ -193,12 +62,8 @@ static const VFIORomBlacklistEntry romblacklist[] = {
     { 0x14e4, 0x168e }
 };
 
-#define MSIX_CAP_LENGTH 12
 
 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
-static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
-static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
-                                  uint32_t val, int len);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 
 /*
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
new file mode 100644
index 0000000..9f360bf
--- /dev/null
+++ b/hw/vfio/pci.h
@@ -0,0 +1,158 @@
+#include <dirent.h>
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "config.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+#include "hw/pci/pci.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "qemu/event_notifier.h"
+#include "qemu/queue.h"
+#include "qemu/range.h"
+#include "sysemu/kvm.h"
+#include "sysemu/sysemu.h"
+#include "trace.h"
+#include "hw/vfio/vfio.h"
+#include "hw/vfio/vfio-common.h"
+
+struct VFIOPCIDevice;
+
+typedef struct VFIOQuirk {
+    MemoryRegion mem;
+    struct VFIOPCIDevice *vdev;
+    QLIST_ENTRY(VFIOQuirk) next;
+    struct {
+        uint32_t base_offset:TARGET_PAGE_BITS;
+        uint32_t address_offset:TARGET_PAGE_BITS;
+        uint32_t address_size:3;
+        uint32_t bar:3;
+
+        uint32_t address_match;
+        uint32_t address_mask;
+
+        uint32_t address_val:TARGET_PAGE_BITS;
+        uint32_t data_offset:TARGET_PAGE_BITS;
+        uint32_t data_size:3;
+
+        uint8_t flags;
+        uint8_t read_flags;
+        uint8_t write_flags;
+    } data;
+} VFIOQuirk;
+
+typedef struct VFIOBAR {
+    VFIORegion region;
+    bool ioport;
+    bool mem64;
+    QLIST_HEAD(, VFIOQuirk) quirks;
+} VFIOBAR;
+
+typedef struct VFIOVGARegion {
+    MemoryRegion mem;
+    off_t offset;
+    int nr;
+    QLIST_HEAD(, VFIOQuirk) quirks;
+} VFIOVGARegion;
+
+typedef struct VFIOVGA {
+    off_t fd_offset;
+    int fd;
+    VFIOVGARegion region[QEMU_PCI_VGA_NUM_REGIONS];
+} VFIOVGA;
+
+typedef struct VFIOINTx {
+    bool pending; /* interrupt pending */
+    bool kvm_accel; /* set when QEMU bypass through KVM enabled */
+    uint8_t pin; /* which pin to pull for qemu_set_irq */
+    EventNotifier interrupt; /* eventfd triggered on interrupt */
+    EventNotifier unmask; /* eventfd for unmask on QEMU bypass */
+    PCIINTxRoute route; /* routing info for QEMU bypass */
+    uint32_t mmap_timeout; /* delay to re-enable mmaps after interrupt */
+    QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
+} VFIOINTx;
+
+typedef struct VFIOMSIVector {
+    /*
+     * Two interrupt paths are configured per vector.  The first, is only used
+     * for interrupts injected via QEMU.  This is typically the non-accel path,
+     * but may also be used when we want QEMU to handle masking and pending
+     * bits.  The KVM path bypasses QEMU and is therefore higher performance,
+     * but requires masking at the device.  virq is used to track the MSI route
+     * through KVM, thus kvm_interrupt is only available when virq is set to a
+     * valid (>= 0) value.
+     */
+    EventNotifier interrupt;
+    EventNotifier kvm_interrupt;
+    struct VFIOPCIDevice *vdev; /* back pointer to device */
+    int virq;
+    bool use;
+} VFIOMSIVector;
+
+enum {
+    VFIO_INT_NONE = 0,
+    VFIO_INT_INTx = 1,
+    VFIO_INT_MSI  = 2,
+    VFIO_INT_MSIX = 3,
+};
+
+/* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
+typedef struct VFIOMSIXInfo {
+    uint8_t table_bar;
+    uint8_t pba_bar;
+    uint16_t entries;
+    uint32_t table_offset;
+    uint32_t pba_offset;
+    MemoryRegion mmap_mem;
+    void *mmap;
+} VFIOMSIXInfo;
+
+typedef struct VFIOPCIDevice {
+    PCIDevice pdev;
+    VFIODevice vbasedev;
+    VFIOINTx intx;
+    unsigned int config_size;
+    uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
+    off_t config_offset; /* Offset of config space region within device fd */
+    unsigned int rom_size;
+    off_t rom_offset; /* Offset of ROM region within device fd */
+    void *rom;
+    int msi_cap_size;
+    VFIOMSIVector *msi_vectors;
+    VFIOMSIXInfo *msix;
+    int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
+    int interrupt; /* Current interrupt type */
+    VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
+    VFIOVGA vga; /* 0xa0000, 0x3b0, 0x3c0 */
+    PCIHostDeviceAddress host;
+    EventNotifier err_notifier;
+    EventNotifier req_notifier;
+    int (*resetfn)(struct VFIOPCIDevice *);
+    uint32_t features;
+#define VFIO_FEATURE_ENABLE_VGA_BIT 0
+#define VFIO_FEATURE_ENABLE_VGA (1 << VFIO_FEATURE_ENABLE_VGA_BIT)
+#define VFIO_FEATURE_ENABLE_REQ_BIT 1
+#define VFIO_FEATURE_ENABLE_REQ (1 << VFIO_FEATURE_ENABLE_REQ_BIT)
+    int32_t bootindex;
+    uint8_t pm_cap;
+    bool has_vga;
+    bool pci_aer;
+    bool req_enabled;
+    bool has_flr;
+    bool has_pm_reset;
+    bool rom_read_failed;
+} VFIOPCIDevice;
+
+typedef struct VFIORomBlacklistEntry {
+    uint16_t vendor_id;
+    uint16_t device_id;
+} VFIORomBlacklistEntry;
+
+#define MSIX_CAP_LENGTH 12
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
  2015-11-24 13:35 ` [Qemu-devel] " Lan Tianyu
@ 2015-11-24 13:35   ` Lan Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 linux-headers/linux/vfio.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 0508d0b..732b0bd 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -495,6 +495,22 @@ struct vfio_eeh_pe_op {
 
 #define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
 
+
+#define VFIO_FIND_FREE_PCI_CONFIG_REG   _IO(VFIO_TYPE, VFIO_BASE + 22)
+
+#define VFIO_GET_PCI_CAP_INFO   _IO(VFIO_TYPE, VFIO_BASE + 22)
+
+struct vfio_pci_cap_info {
+    __u32 argsz;
+    __u32 flags;
+#define VFIO_PCI_CAP_GET_SIZE (1 << 0)
+#define VFIO_PCI_CAP_GET_FREE_REGION (1 << 1)
+    __u32 index;
+    __u32 offset;
+    __u32 size;
+    __u8 cap;
+};
+
 /* ***************************************************************** */
 
 #endif /* VFIO_H */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [Qemu-devel] [RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
@ 2015-11-24 13:35   ` Lan Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 linux-headers/linux/vfio.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 0508d0b..732b0bd 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -495,6 +495,22 @@ struct vfio_eeh_pe_op {
 
 #define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
 
+
+#define VFIO_FIND_FREE_PCI_CONFIG_REG   _IO(VFIO_TYPE, VFIO_BASE + 22)
+
+#define VFIO_GET_PCI_CAP_INFO   _IO(VFIO_TYPE, VFIO_BASE + 22)
+
+struct vfio_pci_cap_info {
+    __u32 argsz;
+    __u32 flags;
+#define VFIO_PCI_CAP_GET_SIZE (1 << 0)
+#define VFIO_PCI_CAP_GET_FREE_REGION (1 << 1)
+    __u32 index;
+    __u32 offset;
+    __u32 size;
+    __u8 cap;
+};
+
 /* ***************************************************************** */
 
 #endif /* VFIO_H */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [RFC PATCH V2 03/10] Qemu/VFIO: Rework vfio_std_cap_max_size() function
  2015-11-24 13:35 ` [Qemu-devel] " Lan Tianyu
@ 2015-11-24 13:35   ` Lan Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

Use new ioctl cmd VFIO_GET_PCI_CAP_INFO to get PCI cap table size.
This helps to get accurate table size and faciliate to find free
PCI config space regs for faked PCI capability. Current code assigns
PCI config space regs from the start of last PCI capability table to
pos 0xff to the last capability and occupy some free PCI config space
regs.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 hw/vfio/pci.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5c3f8a7..29845e3 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2344,18 +2344,20 @@ static void vfio_unmap_bars(VFIOPCIDevice *vdev)
 /*
  * General setup
  */
-static uint8_t vfio_std_cap_max_size(PCIDevice *pdev, uint8_t pos)
+static uint8_t vfio_std_cap_max_size(VFIOPCIDevice *vdev, uint8_t cap)
 {
-    uint8_t tmp, next = 0xff;
+    struct vfio_pci_cap_info reg_info = {
+        .argsz = sizeof(reg_info),
+        .index = VFIO_PCI_CAP_GET_SIZE,
+        .cap = cap
+    };
+    int ret;
 
-    for (tmp = pdev->config[PCI_CAPABILITY_LIST]; tmp;
-         tmp = pdev->config[tmp + 1]) {
-        if (tmp > pos && tmp < next) {
-            next = tmp;
-        }
-    }
+    ret = ioctl(vdev->vbasedev.fd, VFIO_GET_PCI_CAP_INFO, &reg_info);
+    if (ret || reg_info.size == 0)
+        error_report("vfio: Failed to find free PCI config reg: %m\n");
 
-    return next - pos;
+    return reg_info.size;
 }
 
 static void vfio_set_word_bits(uint8_t *buf, uint16_t val, uint16_t mask)
@@ -2521,7 +2523,7 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
      * Since QEMU doesn't actually handle many of the config accesses,
      * exact size doesn't seem worthwhile.
      */
-    size = vfio_std_cap_max_size(pdev, pos);
+    size = vfio_std_cap_max_size(vdev, cap_id);
 
     /*
      * pci_add_capability always inserts the new capability at the head
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [Qemu-devel] [RFC PATCH V2 03/10] Qemu/VFIO: Rework vfio_std_cap_max_size() function
@ 2015-11-24 13:35   ` Lan Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

Use new ioctl cmd VFIO_GET_PCI_CAP_INFO to get PCI cap table size.
This helps to get accurate table size and faciliate to find free
PCI config space regs for faked PCI capability. Current code assigns
PCI config space regs from the start of last PCI capability table to
pos 0xff to the last capability and occupy some free PCI config space
regs.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 hw/vfio/pci.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5c3f8a7..29845e3 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2344,18 +2344,20 @@ static void vfio_unmap_bars(VFIOPCIDevice *vdev)
 /*
  * General setup
  */
-static uint8_t vfio_std_cap_max_size(PCIDevice *pdev, uint8_t pos)
+static uint8_t vfio_std_cap_max_size(VFIOPCIDevice *vdev, uint8_t cap)
 {
-    uint8_t tmp, next = 0xff;
+    struct vfio_pci_cap_info reg_info = {
+        .argsz = sizeof(reg_info),
+        .index = VFIO_PCI_CAP_GET_SIZE,
+        .cap = cap
+    };
+    int ret;
 
-    for (tmp = pdev->config[PCI_CAPABILITY_LIST]; tmp;
-         tmp = pdev->config[tmp + 1]) {
-        if (tmp > pos && tmp < next) {
-            next = tmp;
-        }
-    }
+    ret = ioctl(vdev->vbasedev.fd, VFIO_GET_PCI_CAP_INFO, &reg_info);
+    if (ret || reg_info.size == 0)
+        error_report("vfio: Failed to find free PCI config reg: %m\n");
 
-    return next - pos;
+    return reg_info.size;
 }
 
 static void vfio_set_word_bits(uint8_t *buf, uint16_t val, uint16_t mask)
@@ -2521,7 +2523,7 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
      * Since QEMU doesn't actually handle many of the config accesses,
      * exact size doesn't seem worthwhile.
      */
-    size = vfio_std_cap_max_size(pdev, pos);
+    size = vfio_std_cap_max_size(vdev, cap_id);
 
     /*
      * pci_add_capability always inserts the new capability at the head
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [RFC PATCH V2 04/10] Qemu/VFIO: Add vfio_find_free_cfg_reg() to find free PCI config space regs
  2015-11-24 13:35 ` [Qemu-devel] " Lan Tianyu
@ 2015-11-24 13:35   ` Lan Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

This patch is to add ioctl wrap to find free PCI config sapce regs.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 hw/vfio/pci.c | 19 +++++++++++++++++++
 hw/vfio/pci.h |  2 ++
 2 files changed, 21 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 29845e3..d0354a0 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2508,6 +2508,25 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, uint8_t pos)
     }
 }
 
+uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size)
+{
+    struct vfio_pci_cap_info reg_info = {
+        .argsz = sizeof(reg_info),
+        .offset = pos,
+        .index = VFIO_PCI_CAP_GET_FREE_REGION,
+        .size = size,
+    };
+    int ret;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_GET_PCI_CAP_INFO, &reg_info);
+    if (ret || reg_info.offset == 0) { 
+        error_report("vfio: Failed to find free PCI config reg: %m\n");
+        return -EFAULT;
+    }
+
+    return reg_info.offset; 
+}
+
 static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
 {
     PCIDevice *pdev = &vdev->pdev;
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 9f360bf..6083300 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -156,3 +156,5 @@ typedef struct VFIORomBlacklistEntry {
 } VFIORomBlacklistEntry;
 
 #define MSIX_CAP_LENGTH 12
+
+uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [Qemu-devel] [RFC PATCH V2 04/10] Qemu/VFIO: Add vfio_find_free_cfg_reg() to find free PCI config space regs
@ 2015-11-24 13:35   ` Lan Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

This patch is to add ioctl wrap to find free PCI config sapce regs.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 hw/vfio/pci.c | 19 +++++++++++++++++++
 hw/vfio/pci.h |  2 ++
 2 files changed, 21 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 29845e3..d0354a0 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2508,6 +2508,25 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, uint8_t pos)
     }
 }
 
+uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size)
+{
+    struct vfio_pci_cap_info reg_info = {
+        .argsz = sizeof(reg_info),
+        .offset = pos,
+        .index = VFIO_PCI_CAP_GET_FREE_REGION,
+        .size = size,
+    };
+    int ret;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_GET_PCI_CAP_INFO, &reg_info);
+    if (ret || reg_info.offset == 0) { 
+        error_report("vfio: Failed to find free PCI config reg: %m\n");
+        return -EFAULT;
+    }
+
+    return reg_info.offset; 
+}
+
 static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
 {
     PCIDevice *pdev = &vdev->pdev;
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 9f360bf..6083300 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -156,3 +156,5 @@ typedef struct VFIORomBlacklistEntry {
 } VFIORomBlacklistEntry;
 
 #define MSIX_CAP_LENGTH 12
+
+uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [RFC PATCH V2 05/10] Qemu/VFIO: Expose PCI config space read/write and msix functions
  2015-11-24 13:35 ` [Qemu-devel] " Lan Tianyu
@ 2015-11-24 13:35   ` Lan Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 hw/vfio/pci.c | 6 +++---
 hw/vfio/pci.h | 4 ++++
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index d0354a0..7c43fc1 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -613,7 +613,7 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
     }
 }
 
-static void vfio_enable_msix(VFIOPCIDevice *vdev)
+void vfio_enable_msix(VFIOPCIDevice *vdev)
 {
     vfio_disable_interrupts(vdev);
 
@@ -1931,7 +1931,7 @@ static void vfio_bar_quirk_free(VFIOPCIDevice *vdev, int nr)
 /*
  * PCI config space
  */
-static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
+uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
 {
     VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
     uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val;
@@ -1964,7 +1964,7 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
     return val;
 }
 
-static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
+void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
                                   uint32_t val, int len)
 {
     VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 6083300..6c00575 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -158,3 +158,7 @@ typedef struct VFIORomBlacklistEntry {
 #define MSIX_CAP_LENGTH 12
 
 uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size);
+uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
+void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
+                           uint32_t val, int len);
+void vfio_enable_msix(VFIOPCIDevice *vdev);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [Qemu-devel] [RFC PATCH V2 05/10] Qemu/VFIO: Expose PCI config space read/write and msix functions
@ 2015-11-24 13:35   ` Lan Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 hw/vfio/pci.c | 6 +++---
 hw/vfio/pci.h | 4 ++++
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index d0354a0..7c43fc1 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -613,7 +613,7 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
     }
 }
 
-static void vfio_enable_msix(VFIOPCIDevice *vdev)
+void vfio_enable_msix(VFIOPCIDevice *vdev)
 {
     vfio_disable_interrupts(vdev);
 
@@ -1931,7 +1931,7 @@ static void vfio_bar_quirk_free(VFIOPCIDevice *vdev, int nr)
 /*
  * PCI config space
  */
-static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
+uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
 {
     VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
     uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val;
@@ -1964,7 +1964,7 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
     return val;
 }
 
-static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
+void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
                                   uint32_t val, int len)
 {
     VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 6083300..6c00575 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -158,3 +158,7 @@ typedef struct VFIORomBlacklistEntry {
 #define MSIX_CAP_LENGTH 12
 
 uint8_t vfio_find_free_cfg_reg(VFIOPCIDevice *vdev, int pos, uint8_t size);
+uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
+void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
+                           uint32_t val, int len);
+void vfio_enable_msix(VFIOPCIDevice *vdev);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [RFC PATCH V2 06/10] Qemu/PCI: Add macros for faked PCI migration capability
  2015-11-24 13:35 ` [Qemu-devel] " Lan Tianyu
@ 2015-11-24 13:35   ` Lan Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

This patch is to extend PCI CAP id for migration cap and
add reg macros. The CAP ID is trial and we may find better one if the
solution is feasible.

*PCI_VF_MIGRATION_CAP
For VF driver to  control that triggers mailbox irq or not during migration.

*PCI_VF_MIGRATION_VMM_STATUS
Qemu stores migration status in the reg

*PCI_VF_MIGRATION_VF_STATUS
VF driver tells Qemu ready for migration

*PCI_VF_MIGRATION_IRQ
VF driver stores mailbox interrupt vector in the reg for Qemu to trigger during migration.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 include/hw/pci/pci_regs.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h
index 57e8c80..0dcaf7e 100644
--- a/include/hw/pci/pci_regs.h
+++ b/include/hw/pci/pci_regs.h
@@ -213,6 +213,7 @@
 #define  PCI_CAP_ID_MSIX	0x11	/* MSI-X */
 #define  PCI_CAP_ID_SATA	0x12	/* Serial ATA */
 #define  PCI_CAP_ID_AF		0x13	/* PCI Advanced Features */
+#define  PCI_CAP_ID_MIGRATION   0x14 
 #define PCI_CAP_LIST_NEXT	1	/* Next capability in the list */
 #define PCI_CAP_FLAGS		2	/* Capability defined flags (16 bits) */
 #define PCI_CAP_SIZEOF		4
@@ -716,4 +717,22 @@
 #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
 #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
 
+/* Migration*/
+#define PCI_VF_MIGRATION_CAP        0x04
+#define PCI_VF_MIGRATION_VMM_STATUS	0x05
+#define PCI_VF_MIGRATION_VF_STATUS	0x06
+#define PCI_VF_MIGRATION_IRQ		0x07
+
+#define PCI_VF_MIGRATION_CAP_SIZE   0x08
+
+#define VMM_MIGRATION_END        0x00
+#define VMM_MIGRATION_START      0x01          
+
+#define PCI_VF_WAIT_FOR_MIGRATION   0x00          
+#define PCI_VF_READY_FOR_MIGRATION  0x01        
+
+#define PCI_VF_MIGRATION_DISABLE    0x00
+#define PCI_VF_MIGRATION_ENABLE     0x01
+
+
 #endif /* LINUX_PCI_REGS_H */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [Qemu-devel] [RFC PATCH V2 06/10] Qemu/PCI: Add macros for faked PCI migration capability
@ 2015-11-24 13:35   ` Lan Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

This patch is to extend PCI CAP id for migration cap and
add reg macros. The CAP ID is trial and we may find better one if the
solution is feasible.

*PCI_VF_MIGRATION_CAP
For VF driver to  control that triggers mailbox irq or not during migration.

*PCI_VF_MIGRATION_VMM_STATUS
Qemu stores migration status in the reg

*PCI_VF_MIGRATION_VF_STATUS
VF driver tells Qemu ready for migration

*PCI_VF_MIGRATION_IRQ
VF driver stores mailbox interrupt vector in the reg for Qemu to trigger during migration.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 include/hw/pci/pci_regs.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h
index 57e8c80..0dcaf7e 100644
--- a/include/hw/pci/pci_regs.h
+++ b/include/hw/pci/pci_regs.h
@@ -213,6 +213,7 @@
 #define  PCI_CAP_ID_MSIX	0x11	/* MSI-X */
 #define  PCI_CAP_ID_SATA	0x12	/* Serial ATA */
 #define  PCI_CAP_ID_AF		0x13	/* PCI Advanced Features */
+#define  PCI_CAP_ID_MIGRATION   0x14 
 #define PCI_CAP_LIST_NEXT	1	/* Next capability in the list */
 #define PCI_CAP_FLAGS		2	/* Capability defined flags (16 bits) */
 #define PCI_CAP_SIZEOF		4
@@ -716,4 +717,22 @@
 #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
 #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
 
+/* Migration*/
+#define PCI_VF_MIGRATION_CAP        0x04
+#define PCI_VF_MIGRATION_VMM_STATUS	0x05
+#define PCI_VF_MIGRATION_VF_STATUS	0x06
+#define PCI_VF_MIGRATION_IRQ		0x07
+
+#define PCI_VF_MIGRATION_CAP_SIZE   0x08
+
+#define VMM_MIGRATION_END        0x00
+#define VMM_MIGRATION_START      0x01          
+
+#define PCI_VF_WAIT_FOR_MIGRATION   0x00          
+#define PCI_VF_READY_FOR_MIGRATION  0x01        
+
+#define PCI_VF_MIGRATION_DISABLE    0x00
+#define PCI_VF_MIGRATION_ENABLE     0x01
+
+
 #endif /* LINUX_PCI_REGS_H */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [RFC PATCH V2 07/10] Qemu: Add post_load_state() to run after restoring CPU state
  2015-11-24 13:35 ` [Qemu-devel] " Lan Tianyu
@ 2015-11-24 13:35   ` Lan Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

After migration, Qemu needs to trigger mailbox irq to notify VF driver
in the guest about status change. The irq delivery restarts to work after
restoring CPU state. This patch is to add new callback to run after
restoring CPU state and provide a way to trigger mailbox irq later.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 include/migration/vmstate.h |  2 ++
 migration/savevm.c          | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 0695d7c..dc681a6 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -56,6 +56,8 @@ typedef struct SaveVMHandlers {
     int (*save_live_setup)(QEMUFile *f, void *opaque);
     uint64_t (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t max_size);
 
+    /* This runs after restoring CPU related state */
+    void (*post_load_state)(void *opaque);
     LoadStateHandler *load_state;
 } SaveVMHandlers;
 
diff --git a/migration/savevm.c b/migration/savevm.c
index 9e0e286..48b6223 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -702,6 +702,20 @@ bool qemu_savevm_state_blocked(Error **errp)
     return false;
 }
 
+void qemu_savevm_post_load(void)
+{
+    SaveStateEntry *se;
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops || !se->ops->post_load_state) {
+            continue;
+        }
+    
+        se->ops->post_load_state(se->opaque);
+    }
+}
+
+
 void qemu_savevm_state_header(QEMUFile *f)
 {
     trace_savevm_state_header();
@@ -1140,6 +1154,7 @@ int qemu_loadvm_state(QEMUFile *f)
     }
 
     cpu_synchronize_all_post_init();
+    qemu_savevm_post_load();
 
     ret = 0;
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [Qemu-devel] [RFC PATCH V2 07/10] Qemu: Add post_load_state() to run after restoring CPU state
@ 2015-11-24 13:35   ` Lan Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

After migration, Qemu needs to trigger mailbox irq to notify VF driver
in the guest about status change. The irq delivery restarts to work after
restoring CPU state. This patch is to add new callback to run after
restoring CPU state and provide a way to trigger mailbox irq later.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 include/migration/vmstate.h |  2 ++
 migration/savevm.c          | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 0695d7c..dc681a6 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -56,6 +56,8 @@ typedef struct SaveVMHandlers {
     int (*save_live_setup)(QEMUFile *f, void *opaque);
     uint64_t (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t max_size);
 
+    /* This runs after restoring CPU related state */
+    void (*post_load_state)(void *opaque);
     LoadStateHandler *load_state;
 } SaveVMHandlers;
 
diff --git a/migration/savevm.c b/migration/savevm.c
index 9e0e286..48b6223 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -702,6 +702,20 @@ bool qemu_savevm_state_blocked(Error **errp)
     return false;
 }
 
+void qemu_savevm_post_load(void)
+{
+    SaveStateEntry *se;
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops || !se->ops->post_load_state) {
+            continue;
+        }
+    
+        se->ops->post_load_state(se->opaque);
+    }
+}
+
+
 void qemu_savevm_state_header(QEMUFile *f)
 {
     trace_savevm_state_header();
@@ -1140,6 +1154,7 @@ int qemu_loadvm_state(QEMUFile *f)
     }
 
     cpu_synchronize_all_post_init();
+    qemu_savevm_post_load();
 
     ret = 0;
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [RFC PATCH V2 08/10] Qemu: Add save_before_stop callback to run just before stopping VCPU during migration
  2015-11-24 13:35 ` [Qemu-devel] " Lan Tianyu
@ 2015-11-24 13:35   ` Lan Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

This patch is to add a callback which is called just before stopping VCPU.
It's for VF migration to trigger mailbox irq as later as possible to
decrease service downtime.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 include/migration/vmstate.h |  3 +++
 include/sysemu/sysemu.h     |  1 +
 migration/migration.c       |  3 ++-
 migration/savevm.c          | 13 +++++++++++++
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index dc681a6..093faf1 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -58,6 +58,9 @@ typedef struct SaveVMHandlers {
 
     /* This runs after restoring CPU related state */
     void (*post_load_state)(void *opaque);
+
+    /* This runs before stopping VCPU */
+    void (*save_before_stop)(QEMUFile *f, void *opaque);
     LoadStateHandler *load_state;
 } SaveVMHandlers;
 
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index df80951..3d0d72c 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -84,6 +84,7 @@ void qemu_announce_self(void);
 bool qemu_savevm_state_blocked(Error **errp);
 void qemu_savevm_state_begin(QEMUFile *f,
                              const MigrationParams *params);
+void qemu_savevm_save_before_stop(QEMUFile *f);
 void qemu_savevm_state_header(QEMUFile *f);
 int qemu_savevm_state_iterate(QEMUFile *f);
 void qemu_savevm_state_complete(QEMUFile *f);
diff --git a/migration/migration.c b/migration/migration.c
index c6ac08a..fccadea 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -759,7 +759,6 @@ int64_t migrate_xbzrle_cache_size(void)
 }
 
 /* migration thread support */
-
 static void *migration_thread(void *opaque)
 {
     MigrationState *s = opaque;
@@ -788,6 +787,8 @@ static void *migration_thread(void *opaque)
             } else {
                 int ret;
 
+                qemu_savevm_save_before_stop(s->file);
+
                 qemu_mutex_lock_iothread();
                 start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
                 qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
diff --git a/migration/savevm.c b/migration/savevm.c
index 48b6223..c2e4802 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -715,6 +715,19 @@ void qemu_savevm_post_load(void)
     }
 }
 
+void qemu_savevm_save_before_stop(QEMUFile *f)
+{
+    SaveStateEntry *se;
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops || !se->ops->save_before_stop) {
+            continue;
+        }
+   
+        se->ops->save_before_stop(f, se->opaque);
+    }
+}
+
 
 void qemu_savevm_state_header(QEMUFile *f)
 {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [Qemu-devel] [RFC PATCH V2 08/10] Qemu: Add save_before_stop callback to run just before stopping VCPU during migration
@ 2015-11-24 13:35   ` Lan Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

This patch is to add a callback which is called just before stopping VCPU.
It's for VF migration to trigger mailbox irq as later as possible to
decrease service downtime.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 include/migration/vmstate.h |  3 +++
 include/sysemu/sysemu.h     |  1 +
 migration/migration.c       |  3 ++-
 migration/savevm.c          | 13 +++++++++++++
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index dc681a6..093faf1 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -58,6 +58,9 @@ typedef struct SaveVMHandlers {
 
     /* This runs after restoring CPU related state */
     void (*post_load_state)(void *opaque);
+
+    /* This runs before stopping VCPU */
+    void (*save_before_stop)(QEMUFile *f, void *opaque);
     LoadStateHandler *load_state;
 } SaveVMHandlers;
 
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index df80951..3d0d72c 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -84,6 +84,7 @@ void qemu_announce_self(void);
 bool qemu_savevm_state_blocked(Error **errp);
 void qemu_savevm_state_begin(QEMUFile *f,
                              const MigrationParams *params);
+void qemu_savevm_save_before_stop(QEMUFile *f);
 void qemu_savevm_state_header(QEMUFile *f);
 int qemu_savevm_state_iterate(QEMUFile *f);
 void qemu_savevm_state_complete(QEMUFile *f);
diff --git a/migration/migration.c b/migration/migration.c
index c6ac08a..fccadea 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -759,7 +759,6 @@ int64_t migrate_xbzrle_cache_size(void)
 }
 
 /* migration thread support */
-
 static void *migration_thread(void *opaque)
 {
     MigrationState *s = opaque;
@@ -788,6 +787,8 @@ static void *migration_thread(void *opaque)
             } else {
                 int ret;
 
+                qemu_savevm_save_before_stop(s->file);
+
                 qemu_mutex_lock_iothread();
                 start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
                 qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
diff --git a/migration/savevm.c b/migration/savevm.c
index 48b6223..c2e4802 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -715,6 +715,19 @@ void qemu_savevm_post_load(void)
     }
 }
 
+void qemu_savevm_save_before_stop(QEMUFile *f)
+{
+    SaveStateEntry *se;
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops || !se->ops->save_before_stop) {
+            continue;
+        }
+   
+        se->ops->save_before_stop(f, se->opaque);
+    }
+}
+
 
 void qemu_savevm_state_header(QEMUFile *f)
 {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
  2015-11-24 13:35 ` [Qemu-devel] " Lan Tianyu
@ 2015-11-24 13:35   ` Lan Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

This patch is to add SRIOV VF migration support.
Create new device type "vfio-sriov" and add faked PCI migration capability
to the type device.

The purpose of the new capability
1) sync migration status with VF driver in the VM
2) Get mailbox irq vector to notify VF driver during migration.
3) Provide a way to control injecting irq or not.

Qemu will migrate PCI configure space regs and MSIX config for VF.
Inject mailbox irq at last stage of migration to notify VF about
migration event and wait VF driver ready for migration. VF driver
writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
to tell Qemu.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 hw/vfio/Makefile.objs |   2 +-
 hw/vfio/pci.c         |   6 ++
 hw/vfio/pci.h         |   4 ++
 hw/vfio/sriov.c       | 178 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 189 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/sriov.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index d540c9d..9cf0178 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,6 +1,6 @@
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
-obj-$(CONFIG_PCI) += pci.o
+obj-$(CONFIG_PCI) += pci.o sriov.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 endif
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7c43fc1..e7583b5 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2013,6 +2013,11 @@ void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
         } else if (was_enabled && !is_enabled) {
             vfio_disable_msix(vdev);
         }
+    } else if (vdev->migration_cap &&
+        ranges_overlap(addr, len, vdev->migration_cap, 0x10)) {
+        /* Write everything to QEMU to keep emulated bits correct */
+        pci_default_write_config(pdev, addr, val, len);
+        vfio_migration_cap_handle(pdev, addr, val, len);
     } else {
         /* Write everything to QEMU to keep emulated bits correct */
         pci_default_write_config(pdev, addr, val, len);
@@ -3517,6 +3522,7 @@ static int vfio_initfn(PCIDevice *pdev)
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn(vdev);
+    vfio_add_migration_capability(vdev);
 
     return 0;
 
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 6c00575..ee6ca5e 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -134,6 +134,7 @@ typedef struct VFIOPCIDevice {
     PCIHostDeviceAddress host;
     EventNotifier err_notifier;
     EventNotifier req_notifier;
+    uint16_t    migration_cap;
     int (*resetfn)(struct VFIOPCIDevice *);
     uint32_t features;
 #define VFIO_FEATURE_ENABLE_VGA_BIT 0
@@ -162,3 +163,6 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
 void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
                            uint32_t val, int len);
 void vfio_enable_msix(VFIOPCIDevice *vdev);
+void vfio_add_migration_capability(VFIOPCIDevice *vdev);
+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
+                               uint32_t val, int len);
diff --git a/hw/vfio/sriov.c b/hw/vfio/sriov.c
new file mode 100644
index 0000000..3109538
--- /dev/null
+++ b/hw/vfio/sriov.c
@@ -0,0 +1,178 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/io.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <glob.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+
+#include "hw/hw.h"
+#include "hw/vfio/pci.h"
+#include "hw/vfio/vfio.h"
+#include "hw/vfio/vfio-common.h"
+
+#define TYPE_VFIO_SRIOV "vfio-sriov"
+
+#define SRIOV_LM_SETUP 0x01
+#define SRIOV_LM_COMPLETE 0x02
+
+QemuEvent migration_event;
+
+static void vfio_dev_post_load(void *opaque)
+{
+    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+    MSIMessage msg;
+    int vector;
+
+    if (vfio_pci_read_config(pdev,
+            vdev->migration_cap + PCI_VF_MIGRATION_CAP, 1)
+            != PCI_VF_MIGRATION_ENABLE)
+        return;
+
+    vector = vfio_pci_read_config(pdev,
+        vdev->migration_cap + PCI_VF_MIGRATION_IRQ, 1);
+
+    msg = msix_get_message(pdev, vector);
+    kvm_irqchip_send_msi(kvm_state, msg);
+}
+
+static int vfio_dev_load(QEMUFile *f, void *opaque, int version_id)
+{
+    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+    int ret;
+
+    if(qemu_get_byte(f)!= SRIOV_LM_COMPLETE)
+        return 0;
+
+    ret = pci_device_load(pdev, f);
+    if (ret) {
+        error_report("Faild to load PCI config space.\n");
+        return ret;
+    }
+
+    if (msix_enabled(pdev)) {
+        vfio_enable_msix(vdev);
+        msix_load(pdev, f);
+    }
+
+    vfio_pci_write_config(pdev,vdev->migration_cap +
+        PCI_VF_MIGRATION_VMM_STATUS, VMM_MIGRATION_END, 1);
+    vfio_pci_write_config(pdev,vdev->migration_cap +
+        PCI_VF_MIGRATION_VF_STATUS, PCI_VF_WAIT_FOR_MIGRATION, 1);
+    return 0;
+}
+
+static int vfio_dev_save_complete(QEMUFile *f, void *opaque)
+{
+    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
+
+    qemu_put_byte(f, SRIOV_LM_COMPLETE);
+    pci_device_save(pdev, f);
+
+    if (msix_enabled(pdev)) {
+        msix_save(pdev, f);
+    }
+
+    return 0;
+}
+
+static int vfio_dev_setup(QEMUFile *f, void *opaque)
+{
+    qemu_put_byte(f, SRIOV_LM_SETUP);
+    return 0;
+}
+
+static void vfio_dev_save_before_stop(QEMUFile *f, void *opaque)
+{
+    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+    int vector;
+    MSIMessage msg;
+
+    vfio_pci_write_config(pdev, vdev->migration_cap +
+        PCI_VF_MIGRATION_VMM_STATUS, VMM_MIGRATION_START, 1);
+
+    if (vfio_pci_read_config(pdev,
+            vdev->migration_cap + PCI_VF_MIGRATION_CAP, 1)
+            != PCI_VF_MIGRATION_ENABLE)
+        return;
+
+    vector = vfio_pci_read_config(pdev,
+        vdev->migration_cap + PCI_VF_MIGRATION_IRQ, 1);
+
+    qemu_event_reset(&migration_event);
+
+    msg = msix_get_message(pdev, vector);
+    kvm_irqchip_send_msi(kvm_state, msg);
+
+    qemu_event_wait(&migration_event);
+}
+
+static SaveVMHandlers savevm_pt_handlers = {
+    .save_live_setup = vfio_dev_setup,
+    .save_live_complete = vfio_dev_save_complete,
+    .save_before_stop = vfio_dev_save_before_stop,          
+    .load_state = vfio_dev_load,
+    .post_load_state = vfio_dev_post_load,
+};
+
+void vfio_add_migration_capability(VFIOPCIDevice *vdev)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    int free_pos;
+
+    if (strcmp(object_get_typename(OBJECT(vdev)), TYPE_VFIO_SRIOV))
+        return;
+
+    free_pos = vfio_find_free_cfg_reg(vdev,
+                pdev->config[PCI_CAPABILITY_LIST],
+                PCI_VF_MIGRATION_CAP_SIZE);
+    if (free_pos) {
+        vdev->migration_cap = free_pos;
+    	pci_add_capability(pdev, PCI_CAP_ID_MIGRATION,
+                        free_pos, PCI_VF_MIGRATION_CAP_SIZE);
+    	memset(vdev->emulated_config_bits + free_pos, 0xff,
+                        PCI_VF_MIGRATION_CAP_SIZE);
+    	memset(vdev->pdev.wmask + free_pos, 0xff,
+                        PCI_VF_MIGRATION_CAP_SIZE);
+     } else
+        error_report("vfio: Fail to find free PCI config space regs.\n");
+}
+
+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
+                                  uint32_t val, int len)
+{
+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+
+    if (addr == vdev->migration_cap + PCI_VF_MIGRATION_VF_STATUS
+        && val == PCI_VF_READY_FOR_MIGRATION) {       
+        qemu_event_set(&migration_event);
+    }
+}
+
+static void vfio_sriov_instance_init(Object *obj)
+{
+    PCIDevice *pdev = PCI_DEVICE(obj);
+
+    register_savevm_live(NULL, "vfio-sriov", 1, 1,
+                         &savevm_pt_handlers, pdev);
+
+    qemu_event_init(&migration_event, false);
+
+}
+
+static const TypeInfo vfio_sriov_type_info = {
+    .name = TYPE_VFIO_SRIOV,
+    .parent = "vfio-pci", 
+    .instance_init = vfio_sriov_instance_init,
+};
+
+static void sriov_register_types(void)
+{
+    type_register_static(&vfio_sriov_type_info);
+}
+type_init(sriov_register_types)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [Qemu-devel] [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
@ 2015-11-24 13:35   ` Lan Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

This patch is to add SRIOV VF migration support.
Create new device type "vfio-sriov" and add faked PCI migration capability
to the type device.

The purpose of the new capability
1) sync migration status with VF driver in the VM
2) Get mailbox irq vector to notify VF driver during migration.
3) Provide a way to control injecting irq or not.

Qemu will migrate PCI configure space regs and MSIX config for VF.
Inject mailbox irq at last stage of migration to notify VF about
migration event and wait VF driver ready for migration. VF driver
writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
to tell Qemu.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 hw/vfio/Makefile.objs |   2 +-
 hw/vfio/pci.c         |   6 ++
 hw/vfio/pci.h         |   4 ++
 hw/vfio/sriov.c       | 178 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 189 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/sriov.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index d540c9d..9cf0178 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,6 +1,6 @@
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
-obj-$(CONFIG_PCI) += pci.o
+obj-$(CONFIG_PCI) += pci.o sriov.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 endif
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7c43fc1..e7583b5 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2013,6 +2013,11 @@ void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
         } else if (was_enabled && !is_enabled) {
             vfio_disable_msix(vdev);
         }
+    } else if (vdev->migration_cap &&
+        ranges_overlap(addr, len, vdev->migration_cap, 0x10)) {
+        /* Write everything to QEMU to keep emulated bits correct */
+        pci_default_write_config(pdev, addr, val, len);
+        vfio_migration_cap_handle(pdev, addr, val, len);
     } else {
         /* Write everything to QEMU to keep emulated bits correct */
         pci_default_write_config(pdev, addr, val, len);
@@ -3517,6 +3522,7 @@ static int vfio_initfn(PCIDevice *pdev)
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn(vdev);
+    vfio_add_migration_capability(vdev);
 
     return 0;
 
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 6c00575..ee6ca5e 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -134,6 +134,7 @@ typedef struct VFIOPCIDevice {
     PCIHostDeviceAddress host;
     EventNotifier err_notifier;
     EventNotifier req_notifier;
+    uint16_t    migration_cap;
     int (*resetfn)(struct VFIOPCIDevice *);
     uint32_t features;
 #define VFIO_FEATURE_ENABLE_VGA_BIT 0
@@ -162,3 +163,6 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
 void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
                            uint32_t val, int len);
 void vfio_enable_msix(VFIOPCIDevice *vdev);
+void vfio_add_migration_capability(VFIOPCIDevice *vdev);
+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
+                               uint32_t val, int len);
diff --git a/hw/vfio/sriov.c b/hw/vfio/sriov.c
new file mode 100644
index 0000000..3109538
--- /dev/null
+++ b/hw/vfio/sriov.c
@@ -0,0 +1,178 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/io.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <glob.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+
+#include "hw/hw.h"
+#include "hw/vfio/pci.h"
+#include "hw/vfio/vfio.h"
+#include "hw/vfio/vfio-common.h"
+
+#define TYPE_VFIO_SRIOV "vfio-sriov"
+
+#define SRIOV_LM_SETUP 0x01
+#define SRIOV_LM_COMPLETE 0x02
+
+QemuEvent migration_event;
+
+static void vfio_dev_post_load(void *opaque)
+{
+    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+    MSIMessage msg;
+    int vector;
+
+    if (vfio_pci_read_config(pdev,
+            vdev->migration_cap + PCI_VF_MIGRATION_CAP, 1)
+            != PCI_VF_MIGRATION_ENABLE)
+        return;
+
+    vector = vfio_pci_read_config(pdev,
+        vdev->migration_cap + PCI_VF_MIGRATION_IRQ, 1);
+
+    msg = msix_get_message(pdev, vector);
+    kvm_irqchip_send_msi(kvm_state, msg);
+}
+
+static int vfio_dev_load(QEMUFile *f, void *opaque, int version_id)
+{
+    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+    int ret;
+
+    if(qemu_get_byte(f)!= SRIOV_LM_COMPLETE)
+        return 0;
+
+    ret = pci_device_load(pdev, f);
+    if (ret) {
+        error_report("Faild to load PCI config space.\n");
+        return ret;
+    }
+
+    if (msix_enabled(pdev)) {
+        vfio_enable_msix(vdev);
+        msix_load(pdev, f);
+    }
+
+    vfio_pci_write_config(pdev,vdev->migration_cap +
+        PCI_VF_MIGRATION_VMM_STATUS, VMM_MIGRATION_END, 1);
+    vfio_pci_write_config(pdev,vdev->migration_cap +
+        PCI_VF_MIGRATION_VF_STATUS, PCI_VF_WAIT_FOR_MIGRATION, 1);
+    return 0;
+}
+
+static int vfio_dev_save_complete(QEMUFile *f, void *opaque)
+{
+    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
+
+    qemu_put_byte(f, SRIOV_LM_COMPLETE);
+    pci_device_save(pdev, f);
+
+    if (msix_enabled(pdev)) {
+        msix_save(pdev, f);
+    }
+
+    return 0;
+}
+
+static int vfio_dev_setup(QEMUFile *f, void *opaque)
+{
+    qemu_put_byte(f, SRIOV_LM_SETUP);
+    return 0;
+}
+
+static void vfio_dev_save_before_stop(QEMUFile *f, void *opaque)
+{
+    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+    int vector;
+    MSIMessage msg;
+
+    vfio_pci_write_config(pdev, vdev->migration_cap +
+        PCI_VF_MIGRATION_VMM_STATUS, VMM_MIGRATION_START, 1);
+
+    if (vfio_pci_read_config(pdev,
+            vdev->migration_cap + PCI_VF_MIGRATION_CAP, 1)
+            != PCI_VF_MIGRATION_ENABLE)
+        return;
+
+    vector = vfio_pci_read_config(pdev,
+        vdev->migration_cap + PCI_VF_MIGRATION_IRQ, 1);
+
+    qemu_event_reset(&migration_event);
+
+    msg = msix_get_message(pdev, vector);
+    kvm_irqchip_send_msi(kvm_state, msg);
+
+    qemu_event_wait(&migration_event);
+}
+
+static SaveVMHandlers savevm_pt_handlers = {
+    .save_live_setup = vfio_dev_setup,
+    .save_live_complete = vfio_dev_save_complete,
+    .save_before_stop = vfio_dev_save_before_stop,          
+    .load_state = vfio_dev_load,
+    .post_load_state = vfio_dev_post_load,
+};
+
+void vfio_add_migration_capability(VFIOPCIDevice *vdev)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    int free_pos;
+
+    if (strcmp(object_get_typename(OBJECT(vdev)), TYPE_VFIO_SRIOV))
+        return;
+
+    free_pos = vfio_find_free_cfg_reg(vdev,
+                pdev->config[PCI_CAPABILITY_LIST],
+                PCI_VF_MIGRATION_CAP_SIZE);
+    if (free_pos) {
+        vdev->migration_cap = free_pos;
+    	pci_add_capability(pdev, PCI_CAP_ID_MIGRATION,
+                        free_pos, PCI_VF_MIGRATION_CAP_SIZE);
+    	memset(vdev->emulated_config_bits + free_pos, 0xff,
+                        PCI_VF_MIGRATION_CAP_SIZE);
+    	memset(vdev->pdev.wmask + free_pos, 0xff,
+                        PCI_VF_MIGRATION_CAP_SIZE);
+     } else
+        error_report("vfio: Fail to find free PCI config space regs.\n");
+}
+
+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
+                                  uint32_t val, int len)
+{
+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+
+    if (addr == vdev->migration_cap + PCI_VF_MIGRATION_VF_STATUS
+        && val == PCI_VF_READY_FOR_MIGRATION) {       
+        qemu_event_set(&migration_event);
+    }
+}
+
+static void vfio_sriov_instance_init(Object *obj)
+{
+    PCIDevice *pdev = PCI_DEVICE(obj);
+
+    register_savevm_live(NULL, "vfio-sriov", 1, 1,
+                         &savevm_pt_handlers, pdev);
+
+    qemu_event_init(&migration_event, false);
+
+}
+
+static const TypeInfo vfio_sriov_type_info = {
+    .name = TYPE_VFIO_SRIOV,
+    .parent = "vfio-pci", 
+    .instance_init = vfio_sriov_instance_init,
+};
+
+static void sriov_register_types(void)
+{
+    type_register_static(&vfio_sriov_type_info);
+}
+type_init(sriov_register_types)
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [RFC PATCH V2 10/10] Qemu/VFIO: Misc change for enable migration with VFIO
  2015-11-24 13:35 ` [Qemu-devel] " Lan Tianyu
@ 2015-11-24 13:35   ` Lan Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 hw/vfio/pci.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e7583b5..404a5cd 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3625,11 +3625,6 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static const VMStateDescription vfio_pci_vmstate = {
-    .name = "vfio-pci",
-    .unmigratable = 1,
-};
-
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3637,7 +3632,6 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     dc->props = vfio_pci_dev_properties;
-    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->init = vfio_initfn;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 142+ messages in thread

* [Qemu-devel] [RFC PATCH V2 10/10] Qemu/VFIO: Misc change for enable migration with VFIO
@ 2015-11-24 13:35   ` Lan Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-11-24 13:35 UTC (permalink / raw)
  To: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela
  Cc: Lan Tianyu

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 hw/vfio/pci.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e7583b5..404a5cd 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3625,11 +3625,6 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static const VMStateDescription vfio_pci_vmstate = {
-    .name = "vfio-pci",
-    .unmigratable = 1,
-};
-
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3637,7 +3632,6 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     dc->props = vfio_pci_dev_properties;
-    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->init = vfio_initfn;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
  2015-11-24 13:35   ` [Qemu-devel] " Lan Tianyu
@ 2015-11-24 21:03     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-11-24 21:03 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela

On Tue, Nov 24, 2015 at 09:35:26PM +0800, Lan Tianyu wrote:
> This patch is to add SRIOV VF migration support.
> Create new device type "vfio-sriov" and add faked PCI migration capability
> to the type device.
> 
> The purpose of the new capability
> 1) sync migration status with VF driver in the VM
> 2) Get mailbox irq vector to notify VF driver during migration.
> 3) Provide a way to control injecting irq or not.
> 
> Qemu will migrate PCI configure space regs and MSIX config for VF.
> Inject mailbox irq at last stage of migration to notify VF about
> migration event and wait VF driver ready for migration.

I think this last bit "wait VF driver ready for migration"
is wrong. Not a lot is gained as compared to hotunplug.

To really get a benefit from this feature migration should
succeed even if guest is stuck, then interrupt should
tell guest that it has to reset the driver.


> VF driver
> writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
> to tell Qemu.
> 
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>  hw/vfio/Makefile.objs |   2 +-
>  hw/vfio/pci.c         |   6 ++
>  hw/vfio/pci.h         |   4 ++
>  hw/vfio/sriov.c       | 178 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 189 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/sriov.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index d540c9d..9cf0178 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,6 +1,6 @@
>  ifeq ($(CONFIG_LINUX), y)
>  obj-$(CONFIG_SOFTMMU) += common.o
> -obj-$(CONFIG_PCI) += pci.o
> +obj-$(CONFIG_PCI) += pci.o sriov.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>  endif
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 7c43fc1..e7583b5 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2013,6 +2013,11 @@ void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
>          } else if (was_enabled && !is_enabled) {
>              vfio_disable_msix(vdev);
>          }
> +    } else if (vdev->migration_cap &&
> +        ranges_overlap(addr, len, vdev->migration_cap, 0x10)) {
> +        /* Write everything to QEMU to keep emulated bits correct */
> +        pci_default_write_config(pdev, addr, val, len);
> +        vfio_migration_cap_handle(pdev, addr, val, len);
>      } else {
>          /* Write everything to QEMU to keep emulated bits correct */
>          pci_default_write_config(pdev, addr, val, len);
> @@ -3517,6 +3522,7 @@ static int vfio_initfn(PCIDevice *pdev)
>      vfio_register_err_notifier(vdev);
>      vfio_register_req_notifier(vdev);
>      vfio_setup_resetfn(vdev);
> +    vfio_add_migration_capability(vdev);
>  
>      return 0;
>  
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 6c00575..ee6ca5e 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -134,6 +134,7 @@ typedef struct VFIOPCIDevice {
>      PCIHostDeviceAddress host;
>      EventNotifier err_notifier;
>      EventNotifier req_notifier;
> +    uint16_t    migration_cap;
>      int (*resetfn)(struct VFIOPCIDevice *);
>      uint32_t features;
>  #define VFIO_FEATURE_ENABLE_VGA_BIT 0
> @@ -162,3 +163,6 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>  void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
>                             uint32_t val, int len);
>  void vfio_enable_msix(VFIOPCIDevice *vdev);
> +void vfio_add_migration_capability(VFIOPCIDevice *vdev);
> +void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
> +                               uint32_t val, int len);
> diff --git a/hw/vfio/sriov.c b/hw/vfio/sriov.c
> new file mode 100644
> index 0000000..3109538
> --- /dev/null
> +++ b/hw/vfio/sriov.c
> @@ -0,0 +1,178 @@
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <sys/io.h>
> +#include <sys/mman.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <glob.h>
> +#include <unistd.h>
> +#include <sys/ioctl.h>
> +
> +#include "hw/hw.h"
> +#include "hw/vfio/pci.h"
> +#include "hw/vfio/vfio.h"
> +#include "hw/vfio/vfio-common.h"
> +
> +#define TYPE_VFIO_SRIOV "vfio-sriov"
> +
> +#define SRIOV_LM_SETUP 0x01
> +#define SRIOV_LM_COMPLETE 0x02
> +
> +QemuEvent migration_event;
> +
> +static void vfio_dev_post_load(void *opaque)
> +{
> +    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> +    MSIMessage msg;
> +    int vector;
> +
> +    if (vfio_pci_read_config(pdev,
> +            vdev->migration_cap + PCI_VF_MIGRATION_CAP, 1)
> +            != PCI_VF_MIGRATION_ENABLE)
> +        return;
> +
> +    vector = vfio_pci_read_config(pdev,
> +        vdev->migration_cap + PCI_VF_MIGRATION_IRQ, 1);
> +
> +    msg = msix_get_message(pdev, vector);
> +    kvm_irqchip_send_msi(kvm_state, msg);
> +}
> +
> +static int vfio_dev_load(QEMUFile *f, void *opaque, int version_id)
> +{
> +    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> +    int ret;
> +
> +    if(qemu_get_byte(f)!= SRIOV_LM_COMPLETE)
> +        return 0;
> +
> +    ret = pci_device_load(pdev, f);
> +    if (ret) {
> +        error_report("Faild to load PCI config space.\n");
> +        return ret;
> +    }
> +
> +    if (msix_enabled(pdev)) {
> +        vfio_enable_msix(vdev);
> +        msix_load(pdev, f);
> +    }
> +
> +    vfio_pci_write_config(pdev,vdev->migration_cap +
> +        PCI_VF_MIGRATION_VMM_STATUS, VMM_MIGRATION_END, 1);
> +    vfio_pci_write_config(pdev,vdev->migration_cap +
> +        PCI_VF_MIGRATION_VF_STATUS, PCI_VF_WAIT_FOR_MIGRATION, 1);
> +    return 0;
> +}
> +
> +static int vfio_dev_save_complete(QEMUFile *f, void *opaque)
> +{
> +    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
> +
> +    qemu_put_byte(f, SRIOV_LM_COMPLETE);
> +    pci_device_save(pdev, f);
> +
> +    if (msix_enabled(pdev)) {
> +        msix_save(pdev, f);
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_dev_setup(QEMUFile *f, void *opaque)
> +{
> +    qemu_put_byte(f, SRIOV_LM_SETUP);
> +    return 0;
> +}
> +
> +static void vfio_dev_save_before_stop(QEMUFile *f, void *opaque)
> +{
> +    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> +    int vector;
> +    MSIMessage msg;
> +
> +    vfio_pci_write_config(pdev, vdev->migration_cap +
> +        PCI_VF_MIGRATION_VMM_STATUS, VMM_MIGRATION_START, 1);
> +
> +    if (vfio_pci_read_config(pdev,
> +            vdev->migration_cap + PCI_VF_MIGRATION_CAP, 1)
> +            != PCI_VF_MIGRATION_ENABLE)
> +        return;
> +
> +    vector = vfio_pci_read_config(pdev,
> +        vdev->migration_cap + PCI_VF_MIGRATION_IRQ, 1);
> +
> +    qemu_event_reset(&migration_event);
> +
> +    msg = msix_get_message(pdev, vector);
> +    kvm_irqchip_send_msi(kvm_state, msg);
> +
> +    qemu_event_wait(&migration_event);

So this blocks QEMU, holding the QEMU lock, and
waits for qemu_event_set below.

> +}
> +
> +static SaveVMHandlers savevm_pt_handlers = {
> +    .save_live_setup = vfio_dev_setup,
> +    .save_live_complete = vfio_dev_save_complete,
> +    .save_before_stop = vfio_dev_save_before_stop,          
> +    .load_state = vfio_dev_load,
> +    .post_load_state = vfio_dev_post_load,
> +};
> +
> +void vfio_add_migration_capability(VFIOPCIDevice *vdev)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    int free_pos;
> +
> +    if (strcmp(object_get_typename(OBJECT(vdev)), TYPE_VFIO_SRIOV))
> +        return;
> +
> +    free_pos = vfio_find_free_cfg_reg(vdev,
> +                pdev->config[PCI_CAPABILITY_LIST],
> +                PCI_VF_MIGRATION_CAP_SIZE);
> +    if (free_pos) {
> +        vdev->migration_cap = free_pos;
> +    	pci_add_capability(pdev, PCI_CAP_ID_MIGRATION,
> +                        free_pos, PCI_VF_MIGRATION_CAP_SIZE);
> +    	memset(vdev->emulated_config_bits + free_pos, 0xff,
> +                        PCI_VF_MIGRATION_CAP_SIZE);
> +    	memset(vdev->pdev.wmask + free_pos, 0xff,
> +                        PCI_VF_MIGRATION_CAP_SIZE);
> +     } else
> +        error_report("vfio: Fail to find free PCI config space regs.\n");
> +}
> +
> +void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
> +                                  uint32_t val, int len)
> +{
> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> +
> +    if (addr == vdev->migration_cap + PCI_VF_MIGRATION_VF_STATUS
> +        && val == PCI_VF_READY_FOR_MIGRATION) {       
> +        qemu_event_set(&migration_event);

This would wake migration so it can proceed -
except it needs QEMU lock to run, and that's
taken by the migration thread.

It seems unlikely that this ever worked - how
did you test this?

> +    }
> +}
> +
> +static void vfio_sriov_instance_init(Object *obj)
> +{
> +    PCIDevice *pdev = PCI_DEVICE(obj);
> +
> +    register_savevm_live(NULL, "vfio-sriov", 1, 1,
> +                         &savevm_pt_handlers, pdev);
> +
> +    qemu_event_init(&migration_event, false);
> +
> +}
> +
> +static const TypeInfo vfio_sriov_type_info = {
> +    .name = TYPE_VFIO_SRIOV,
> +    .parent = "vfio-pci", 
> +    .instance_init = vfio_sriov_instance_init,
> +};
> +
> +static void sriov_register_types(void)
> +{
> +    type_register_static(&vfio_sriov_type_info);
> +}
> +type_init(sriov_register_types)
> -- 
> 1.9.3

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
@ 2015-11-24 21:03     ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-11-24 21:03 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or

On Tue, Nov 24, 2015 at 09:35:26PM +0800, Lan Tianyu wrote:
> This patch is to add SRIOV VF migration support.
> Create new device type "vfio-sriov" and add faked PCI migration capability
> to the type device.
> 
> The purpose of the new capability
> 1) sync migration status with VF driver in the VM
> 2) Get mailbox irq vector to notify VF driver during migration.
> 3) Provide a way to control injecting irq or not.
> 
> Qemu will migrate PCI configure space regs and MSIX config for VF.
> Inject mailbox irq at last stage of migration to notify VF about
> migration event and wait VF driver ready for migration.

I think this last bit "wait VF driver ready for migration"
is wrong. Not a lot is gained as compared to hotunplug.

To really get a benefit from this feature migration should
succeed even if guest is stuck, then interrupt should
tell guest that it has to reset the driver.


> VF driver
> writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
> to tell Qemu.
> 
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>  hw/vfio/Makefile.objs |   2 +-
>  hw/vfio/pci.c         |   6 ++
>  hw/vfio/pci.h         |   4 ++
>  hw/vfio/sriov.c       | 178 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 189 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/sriov.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index d540c9d..9cf0178 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -1,6 +1,6 @@
>  ifeq ($(CONFIG_LINUX), y)
>  obj-$(CONFIG_SOFTMMU) += common.o
> -obj-$(CONFIG_PCI) += pci.o
> +obj-$(CONFIG_PCI) += pci.o sriov.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>  endif
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 7c43fc1..e7583b5 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2013,6 +2013,11 @@ void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
>          } else if (was_enabled && !is_enabled) {
>              vfio_disable_msix(vdev);
>          }
> +    } else if (vdev->migration_cap &&
> +        ranges_overlap(addr, len, vdev->migration_cap, 0x10)) {
> +        /* Write everything to QEMU to keep emulated bits correct */
> +        pci_default_write_config(pdev, addr, val, len);
> +        vfio_migration_cap_handle(pdev, addr, val, len);
>      } else {
>          /* Write everything to QEMU to keep emulated bits correct */
>          pci_default_write_config(pdev, addr, val, len);
> @@ -3517,6 +3522,7 @@ static int vfio_initfn(PCIDevice *pdev)
>      vfio_register_err_notifier(vdev);
>      vfio_register_req_notifier(vdev);
>      vfio_setup_resetfn(vdev);
> +    vfio_add_migration_capability(vdev);
>  
>      return 0;
>  
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 6c00575..ee6ca5e 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -134,6 +134,7 @@ typedef struct VFIOPCIDevice {
>      PCIHostDeviceAddress host;
>      EventNotifier err_notifier;
>      EventNotifier req_notifier;
> +    uint16_t    migration_cap;
>      int (*resetfn)(struct VFIOPCIDevice *);
>      uint32_t features;
>  #define VFIO_FEATURE_ENABLE_VGA_BIT 0
> @@ -162,3 +163,6 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>  void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
>                             uint32_t val, int len);
>  void vfio_enable_msix(VFIOPCIDevice *vdev);
> +void vfio_add_migration_capability(VFIOPCIDevice *vdev);
> +void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
> +                               uint32_t val, int len);
> diff --git a/hw/vfio/sriov.c b/hw/vfio/sriov.c
> new file mode 100644
> index 0000000..3109538
> --- /dev/null
> +++ b/hw/vfio/sriov.c
> @@ -0,0 +1,178 @@
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <sys/io.h>
> +#include <sys/mman.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <glob.h>
> +#include <unistd.h>
> +#include <sys/ioctl.h>
> +
> +#include "hw/hw.h"
> +#include "hw/vfio/pci.h"
> +#include "hw/vfio/vfio.h"
> +#include "hw/vfio/vfio-common.h"
> +
> +#define TYPE_VFIO_SRIOV "vfio-sriov"
> +
> +#define SRIOV_LM_SETUP 0x01
> +#define SRIOV_LM_COMPLETE 0x02
> +
> +QemuEvent migration_event;
> +
> +static void vfio_dev_post_load(void *opaque)
> +{
> +    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> +    MSIMessage msg;
> +    int vector;
> +
> +    if (vfio_pci_read_config(pdev,
> +            vdev->migration_cap + PCI_VF_MIGRATION_CAP, 1)
> +            != PCI_VF_MIGRATION_ENABLE)
> +        return;
> +
> +    vector = vfio_pci_read_config(pdev,
> +        vdev->migration_cap + PCI_VF_MIGRATION_IRQ, 1);
> +
> +    msg = msix_get_message(pdev, vector);
> +    kvm_irqchip_send_msi(kvm_state, msg);
> +}
> +
> +static int vfio_dev_load(QEMUFile *f, void *opaque, int version_id)
> +{
> +    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> +    int ret;
> +
> +    if(qemu_get_byte(f)!= SRIOV_LM_COMPLETE)
> +        return 0;
> +
> +    ret = pci_device_load(pdev, f);
> +    if (ret) {
> +        error_report("Faild to load PCI config space.\n");
> +        return ret;
> +    }
> +
> +    if (msix_enabled(pdev)) {
> +        vfio_enable_msix(vdev);
> +        msix_load(pdev, f);
> +    }
> +
> +    vfio_pci_write_config(pdev,vdev->migration_cap +
> +        PCI_VF_MIGRATION_VMM_STATUS, VMM_MIGRATION_END, 1);
> +    vfio_pci_write_config(pdev,vdev->migration_cap +
> +        PCI_VF_MIGRATION_VF_STATUS, PCI_VF_WAIT_FOR_MIGRATION, 1);
> +    return 0;
> +}
> +
> +static int vfio_dev_save_complete(QEMUFile *f, void *opaque)
> +{
> +    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
> +
> +    qemu_put_byte(f, SRIOV_LM_COMPLETE);
> +    pci_device_save(pdev, f);
> +
> +    if (msix_enabled(pdev)) {
> +        msix_save(pdev, f);
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_dev_setup(QEMUFile *f, void *opaque)
> +{
> +    qemu_put_byte(f, SRIOV_LM_SETUP);
> +    return 0;
> +}
> +
> +static void vfio_dev_save_before_stop(QEMUFile *f, void *opaque)
> +{
> +    struct PCIDevice *pdev = (struct PCIDevice *)opaque;
> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> +    int vector;
> +    MSIMessage msg;
> +
> +    vfio_pci_write_config(pdev, vdev->migration_cap +
> +        PCI_VF_MIGRATION_VMM_STATUS, VMM_MIGRATION_START, 1);
> +
> +    if (vfio_pci_read_config(pdev,
> +            vdev->migration_cap + PCI_VF_MIGRATION_CAP, 1)
> +            != PCI_VF_MIGRATION_ENABLE)
> +        return;
> +
> +    vector = vfio_pci_read_config(pdev,
> +        vdev->migration_cap + PCI_VF_MIGRATION_IRQ, 1);
> +
> +    qemu_event_reset(&migration_event);
> +
> +    msg = msix_get_message(pdev, vector);
> +    kvm_irqchip_send_msi(kvm_state, msg);
> +
> +    qemu_event_wait(&migration_event);

So this blocks QEMU, holding the QEMU lock, and
waits for qemu_event_set below.

> +}
> +
> +static SaveVMHandlers savevm_pt_handlers = {
> +    .save_live_setup = vfio_dev_setup,
> +    .save_live_complete = vfio_dev_save_complete,
> +    .save_before_stop = vfio_dev_save_before_stop,          
> +    .load_state = vfio_dev_load,
> +    .post_load_state = vfio_dev_post_load,
> +};
> +
> +void vfio_add_migration_capability(VFIOPCIDevice *vdev)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    int free_pos;
> +
> +    if (strcmp(object_get_typename(OBJECT(vdev)), TYPE_VFIO_SRIOV))
> +        return;
> +
> +    free_pos = vfio_find_free_cfg_reg(vdev,
> +                pdev->config[PCI_CAPABILITY_LIST],
> +                PCI_VF_MIGRATION_CAP_SIZE);
> +    if (free_pos) {
> +        vdev->migration_cap = free_pos;
> +    	pci_add_capability(pdev, PCI_CAP_ID_MIGRATION,
> +                        free_pos, PCI_VF_MIGRATION_CAP_SIZE);
> +    	memset(vdev->emulated_config_bits + free_pos, 0xff,
> +                        PCI_VF_MIGRATION_CAP_SIZE);
> +    	memset(vdev->pdev.wmask + free_pos, 0xff,
> +                        PCI_VF_MIGRATION_CAP_SIZE);
> +     } else
> +        error_report("vfio: Fail to find free PCI config space regs.\n");
> +}
> +
> +void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
> +                                  uint32_t val, int len)
> +{
> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> +
> +    if (addr == vdev->migration_cap + PCI_VF_MIGRATION_VF_STATUS
> +        && val == PCI_VF_READY_FOR_MIGRATION) {       
> +        qemu_event_set(&migration_event);

This would wake migration so it can proceed -
except it needs QEMU lock to run, and that's
taken by the migration thread.

It seems unlikely that this ever worked - how
did you test this?

> +    }
> +}
> +
> +static void vfio_sriov_instance_init(Object *obj)
> +{
> +    PCIDevice *pdev = PCI_DEVICE(obj);
> +
> +    register_savevm_live(NULL, "vfio-sriov", 1, 1,
> +                         &savevm_pt_handlers, pdev);
> +
> +    qemu_event_init(&migration_event, false);
> +
> +}
> +
> +static const TypeInfo vfio_sriov_type_info = {
> +    .name = TYPE_VFIO_SRIOV,
> +    .parent = "vfio-pci", 
> +    .instance_init = vfio_sriov_instance_init,
> +};
> +
> +static void sriov_register_types(void)
> +{
> +    type_register_static(&vfio_sriov_type_info);
> +}
> +type_init(sriov_register_types)
> -- 
> 1.9.3

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
  2015-11-24 21:03     ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-11-25 15:32       ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-11-25 15:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela


On 11/25/2015 5:03 AM, Michael S. Tsirkin wrote:
>> >+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
>> >+                                  uint32_t val, int len)
>> >+{
>> >+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>> >+
>> >+    if (addr == vdev->migration_cap + PCI_VF_MIGRATION_VF_STATUS
>> >+        && val == PCI_VF_READY_FOR_MIGRATION) {
>> >+        qemu_event_set(&migration_event);
> This would wake migration so it can proceed -
> except it needs QEMU lock to run, and that's
> taken by the migration thread.

Sorry, I seem to miss something.
Which lock may cause dead lock when calling vfio_migration_cap_handle()
and run migration?
The function is called when VF accesses faked PCI capability.

>
> It seems unlikely that this ever worked - how
> did you test this?
>

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
@ 2015-11-25 15:32       ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-11-25 15:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or


On 11/25/2015 5:03 AM, Michael S. Tsirkin wrote:
>> >+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
>> >+                                  uint32_t val, int len)
>> >+{
>> >+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>> >+
>> >+    if (addr == vdev->migration_cap + PCI_VF_MIGRATION_VF_STATUS
>> >+        && val == PCI_VF_READY_FOR_MIGRATION) {
>> >+        qemu_event_set(&migration_event);
> This would wake migration so it can proceed -
> except it needs QEMU lock to run, and that's
> taken by the migration thread.

Sorry, I seem to miss something.
Which lock may cause dead lock when calling vfio_migration_cap_handle()
and run migration?
The function is called when VF accesses faked PCI capability.

>
> It seems unlikely that this ever worked - how
> did you test this?
>

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
  2015-11-25 15:32       ` [Qemu-devel] " Lan, Tianyu
@ 2015-11-25 15:44         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-11-25 15:44 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela

On Wed, Nov 25, 2015 at 11:32:23PM +0800, Lan, Tianyu wrote:
> 
> On 11/25/2015 5:03 AM, Michael S. Tsirkin wrote:
> >>>+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
> >>>+                                  uint32_t val, int len)
> >>>+{
> >>>+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> >>>+
> >>>+    if (addr == vdev->migration_cap + PCI_VF_MIGRATION_VF_STATUS
> >>>+        && val == PCI_VF_READY_FOR_MIGRATION) {
> >>>+        qemu_event_set(&migration_event);
> >This would wake migration so it can proceed -
> >except it needs QEMU lock to run, and that's
> >taken by the migration thread.
> 
> Sorry, I seem to miss something.
> Which lock may cause dead lock when calling vfio_migration_cap_handle()
> and run migration?

qemu_global_mutex.

> The function is called when VF accesses faked PCI capability.
> 
> >
> >It seems unlikely that this ever worked - how
> >did you test this?
> >

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
@ 2015-11-25 15:44         ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-11-25 15:44 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or

On Wed, Nov 25, 2015 at 11:32:23PM +0800, Lan, Tianyu wrote:
> 
> On 11/25/2015 5:03 AM, Michael S. Tsirkin wrote:
> >>>+void vfio_migration_cap_handle(PCIDevice *pdev, uint32_t addr,
> >>>+                                  uint32_t val, int len)
> >>>+{
> >>>+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> >>>+
> >>>+    if (addr == vdev->migration_cap + PCI_VF_MIGRATION_VF_STATUS
> >>>+        && val == PCI_VF_READY_FOR_MIGRATION) {
> >>>+        qemu_event_set(&migration_event);
> >This would wake migration so it can proceed -
> >except it needs QEMU lock to run, and that's
> >taken by the migration thread.
> 
> Sorry, I seem to miss something.
> Which lock may cause dead lock when calling vfio_migration_cap_handle()
> and run migration?

qemu_global_mutex.

> The function is called when VF accesses faked PCI capability.
> 
> >
> >It seems unlikely that this ever worked - how
> >did you test this?
> >

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
  2015-11-24 13:35 ` [Qemu-devel] " Lan Tianyu
@ 2015-11-30  8:01   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-11-30  8:01 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela

On Tue, Nov 24, 2015 at 09:35:17PM +0800, Lan Tianyu wrote:
> This patchset is to propose a solution of adding live migration
> support for SRIOV NIC.
> 
> During migration, Qemu needs to let VF driver in the VM to know
> migration start and end. Qemu adds faked PCI migration capability
> to help to sync status between two sides during migration.
> 
> Qemu triggers VF's mailbox irq via sending MSIX msg when migration
> status is changed. VF driver tells Qemu its mailbox vector index
> via the new PCI capability. In some cases(NIC is suspended or closed),
> VF mailbox irq is freed and VF driver can disable irq injecting via
> new capability.   
> 
> VF driver will put down nic before migration and put up again on
> the target machine.

It is still not very clear what it is you are trying to achieve, and
whether your patchset achieves it.  You merely say "adding live
migration" but it seems pretty clear this isn't about being able to
migrate a guest transparently, since you are adding a host/guest
handshake.

This isn't about functionality either: I think that on KVM, it isn't
hard to live migrate if you can do a host/guest handshake, even today,
with no kernel changes:
1. before migration, expose a pv nic to guest (can be done directly on
  boot)
2. use e.g. a serial connection to move IP from an assigned device to pv nic
3. maybe move the mac as well
4. eject the assigned device
5. detect eject on host (QEMU generates a DEVICE_DELETED event when this
   happens) and start migration

Is this patchset a performance optimization then?
If yes it needs to be accompanied with some performance numbers.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
@ 2015-11-30  8:01   ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-11-30  8:01 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or

On Tue, Nov 24, 2015 at 09:35:17PM +0800, Lan Tianyu wrote:
> This patchset is to propose a solution of adding live migration
> support for SRIOV NIC.
> 
> During migration, Qemu needs to let VF driver in the VM to know
> migration start and end. Qemu adds faked PCI migration capability
> to help to sync status between two sides during migration.
> 
> Qemu triggers VF's mailbox irq via sending MSIX msg when migration
> status is changed. VF driver tells Qemu its mailbox vector index
> via the new PCI capability. In some cases(NIC is suspended or closed),
> VF mailbox irq is freed and VF driver can disable irq injecting via
> new capability.   
> 
> VF driver will put down nic before migration and put up again on
> the target machine.

It is still not very clear what it is you are trying to achieve, and
whether your patchset achieves it.  You merely say "adding live
migration" but it seems pretty clear this isn't about being able to
migrate a guest transparently, since you are adding a host/guest
handshake.

This isn't about functionality either: I think that on KVM, it isn't
hard to live migrate if you can do a host/guest handshake, even today,
with no kernel changes:
1. before migration, expose a pv nic to guest (can be done directly on
  boot)
2. use e.g. a serial connection to move IP from an assigned device to pv nic
3. maybe move the mac as well
4. eject the assigned device
5. detect eject on host (QEMU generates a DEVICE_DELETED event when this
   happens) and start migration

Is this patchset a performance optimization then?
If yes it needs to be accompanied with some performance numbers.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
  2015-11-30  8:01   ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-12-01  6:26     ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-01  6:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela

On 11/30/2015 4:01 PM, Michael S. Tsirkin wrote:
> It is still not very clear what it is you are trying to achieve, and
> whether your patchset achieves it.  You merely say "adding live
> migration" but it seems pretty clear this isn't about being able to
> migrate a guest transparently, since you are adding a host/guest
> handshake.
>
> This isn't about functionality either: I think that on KVM, it isn't
> hard to live migrate if you can do a host/guest handshake, even today,
> with no kernel changes:
> 1. before migration, expose a pv nic to guest (can be done directly on
>    boot)
> 2. use e.g. a serial connection to move IP from an assigned device to pv nic
> 3. maybe move the mac as well
> 4. eject the assigned device
> 5. detect eject on host (QEMU generates a DEVICE_DELETED event when this
>     happens) and start migration
>

This looks like the bonding driver solution which put pv nic and VF
in one bonded interface under active-backup mode. The bonding driver
will switch from VF to PV nic automatically when VF is unplugged during
migration. This is the only available solution for VF NIC migration. But
it requires guest OS to do specific configurations inside and rely on
bonding driver which blocks it work on Windows. From performance side,
putting VF and virtio NIC under bonded interface will affect their
performance even when not do migration. These factors block to use VF
NIC passthough in some user cases(Especially in the cloud) which require
migration.

Current solution we proposed changes NIC driver and Qemu. Guest Os
doesn't need to do special thing for migration. It's easy to deploy and
all changes are in the NIC driver, NIC vendor can implement migration
support just in the their driver.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
@ 2015-12-01  6:26     ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-01  6:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or

On 11/30/2015 4:01 PM, Michael S. Tsirkin wrote:
> It is still not very clear what it is you are trying to achieve, and
> whether your patchset achieves it.  You merely say "adding live
> migration" but it seems pretty clear this isn't about being able to
> migrate a guest transparently, since you are adding a host/guest
> handshake.
>
> This isn't about functionality either: I think that on KVM, it isn't
> hard to live migrate if you can do a host/guest handshake, even today,
> with no kernel changes:
> 1. before migration, expose a pv nic to guest (can be done directly on
>    boot)
> 2. use e.g. a serial connection to move IP from an assigned device to pv nic
> 3. maybe move the mac as well
> 4. eject the assigned device
> 5. detect eject on host (QEMU generates a DEVICE_DELETED event when this
>     happens) and start migration
>

This looks like the bonding driver solution which put pv nic and VF
in one bonded interface under active-backup mode. The bonding driver
will switch from VF to PV nic automatically when VF is unplugged during
migration. This is the only available solution for VF NIC migration. But
it requires guest OS to do specific configurations inside and rely on
bonding driver which blocks it work on Windows. From performance side,
putting VF and virtio NIC under bonded interface will affect their
performance even when not do migration. These factors block to use VF
NIC passthough in some user cases(Especially in the cloud) which require
migration.

Current solution we proposed changes NIC driver and Qemu. Guest Os
doesn't need to do special thing for migration. It's easy to deploy and
all changes are in the NIC driver, NIC vendor can implement migration
support just in the their driver.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
  2015-12-01  6:26     ` [Qemu-devel] " Lan, Tianyu
@ 2015-12-01 15:02       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-01 15:02 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela

On Tue, Dec 01, 2015 at 02:26:57PM +0800, Lan, Tianyu wrote:
> 
> 
> On 11/30/2015 4:01 PM, Michael S. Tsirkin wrote:
> >It is still not very clear what it is you are trying to achieve, and
> >whether your patchset achieves it.  You merely say "adding live
> >migration" but it seems pretty clear this isn't about being able to
> >migrate a guest transparently, since you are adding a host/guest
> >handshake.
> >
> >This isn't about functionality either: I think that on KVM, it isn't
> >hard to live migrate if you can do a host/guest handshake, even today,
> >with no kernel changes:
> >1. before migration, expose a pv nic to guest (can be done directly on
> >   boot)
> >2. use e.g. a serial connection to move IP from an assigned device to pv nic
> >3. maybe move the mac as well
> >4. eject the assigned device
> >5. detect eject on host (QEMU generates a DEVICE_DELETED event when this
> >    happens) and start migration
> >
> 
> This looks like the bonding driver solution

Why does it? Unlike bonding, this doesn't touch data path or
any kernel code. Just run a script from guest agent.

> which put pv nic and VF
> in one bonded interface under active-backup mode. The bonding driver
> will switch from VF to PV nic automatically when VF is unplugged during
> migration. This is the only available solution for VF NIC migration.

It really isn't. For one, there is also teaming.

> But
> it requires guest OS to do specific configurations inside and rely on
> bonding driver which blocks it work on Windows.
> From performance side,
> putting VF and virtio NIC under bonded interface will affect their
> performance even when not do migration. These factors block to use VF
> NIC passthough in some user cases(Especially in the cloud) which require
> migration.

That's really up to guest. You don't need to do bonding,
you can just move the IP and mac from userspace, that's
possible on most OS-es.

Or write something in guest kernel that is more lightweight if you are
so inclined. What we are discussing here is the host-guest interface,
not the in-guest interface.

> Current solution we proposed changes NIC driver and Qemu. Guest Os
> doesn't need to do special thing for migration.
> It's easy to deploy


Except of course these patches don't even work properly yet.

And when they do, even minor changes in host side NIC hardware across
migration will break guests in hard to predict ways.

> and
> all changes are in the NIC driver, NIC vendor can implement migration
> support just in the their driver.

Kernel code and hypervisor code is not easier to develop and deploy than
a userspace script.  If that is all the motivation there is, that's a
pretty small return on investment.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
@ 2015-12-01 15:02       ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-01 15:02 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or

On Tue, Dec 01, 2015 at 02:26:57PM +0800, Lan, Tianyu wrote:
> 
> 
> On 11/30/2015 4:01 PM, Michael S. Tsirkin wrote:
> >It is still not very clear what it is you are trying to achieve, and
> >whether your patchset achieves it.  You merely say "adding live
> >migration" but it seems pretty clear this isn't about being able to
> >migrate a guest transparently, since you are adding a host/guest
> >handshake.
> >
> >This isn't about functionality either: I think that on KVM, it isn't
> >hard to live migrate if you can do a host/guest handshake, even today,
> >with no kernel changes:
> >1. before migration, expose a pv nic to guest (can be done directly on
> >   boot)
> >2. use e.g. a serial connection to move IP from an assigned device to pv nic
> >3. maybe move the mac as well
> >4. eject the assigned device
> >5. detect eject on host (QEMU generates a DEVICE_DELETED event when this
> >    happens) and start migration
> >
> 
> This looks like the bonding driver solution

Why does it? Unlike bonding, this doesn't touch data path or
any kernel code. Just run a script from guest agent.

> which put pv nic and VF
> in one bonded interface under active-backup mode. The bonding driver
> will switch from VF to PV nic automatically when VF is unplugged during
> migration. This is the only available solution for VF NIC migration.

It really isn't. For one, there is also teaming.

> But
> it requires guest OS to do specific configurations inside and rely on
> bonding driver which blocks it work on Windows.
> From performance side,
> putting VF and virtio NIC under bonded interface will affect their
> performance even when not do migration. These factors block to use VF
> NIC passthough in some user cases(Especially in the cloud) which require
> migration.

That's really up to guest. You don't need to do bonding,
you can just move the IP and mac from userspace, that's
possible on most OS-es.

Or write something in guest kernel that is more lightweight if you are
so inclined. What we are discussing here is the host-guest interface,
not the in-guest interface.

> Current solution we proposed changes NIC driver and Qemu. Guest Os
> doesn't need to do special thing for migration.
> It's easy to deploy


Except of course these patches don't even work properly yet.

And when they do, even minor changes in host side NIC hardware across
migration will break guests in hard to predict ways.

> and
> all changes are in the NIC driver, NIC vendor can implement migration
> support just in the their driver.

Kernel code and hypervisor code is not easier to develop and deploy than
a userspace script.  If that is all the motivation there is, that's a
pretty small return on investment.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
  2015-12-01 15:02       ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-12-02 14:08         ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-02 14:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or

On 12/1/2015 11:02 PM, Michael S. Tsirkin wrote:
>> But
>> it requires guest OS to do specific configurations inside and rely on
>> bonding driver which blocks it work on Windows.
>>  From performance side,
>> putting VF and virtio NIC under bonded interface will affect their
>> performance even when not do migration. These factors block to use VF
>> NIC passthough in some user cases(Especially in the cloud) which require
>> migration.
>
> That's really up to guest. You don't need to do bonding,
> you can just move the IP and mac from userspace, that's
> possible on most OS-es.
>
> Or write something in guest kernel that is more lightweight if you are
> so inclined. What we are discussing here is the host-guest interface,
> not the in-guest interface.
>
>> Current solution we proposed changes NIC driver and Qemu. Guest Os
>> doesn't need to do special thing for migration.
>> It's easy to deploy
>
>
> Except of course these patches don't even work properly yet.
>
> And when they do, even minor changes in host side NIC hardware across
> migration will break guests in hard to predict ways.

Switching between PV and VF NIC will introduce network stop and the
latency of hotplug VF is measurable. For some user cases(cloud service
and OPNFV) which are sensitive to network stabilization and performance,
these are not friend and blocks SRIOV NIC usage in these case. We hope
to find a better way to make SRIOV NIC work in these cases and this is
worth to do since SRIOV NIC provides better network performance compared
with PV NIC. Current patches have some issues. I think we can find
solution for them andimprove them step by step.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
@ 2015-12-02 14:08         ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-02 14:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or

On 12/1/2015 11:02 PM, Michael S. Tsirkin wrote:
>> But
>> it requires guest OS to do specific configurations inside and rely on
>> bonding driver which blocks it work on Windows.
>>  From performance side,
>> putting VF and virtio NIC under bonded interface will affect their
>> performance even when not do migration. These factors block to use VF
>> NIC passthough in some user cases(Especially in the cloud) which require
>> migration.
>
> That's really up to guest. You don't need to do bonding,
> you can just move the IP and mac from userspace, that's
> possible on most OS-es.
>
> Or write something in guest kernel that is more lightweight if you are
> so inclined. What we are discussing here is the host-guest interface,
> not the in-guest interface.
>
>> Current solution we proposed changes NIC driver and Qemu. Guest Os
>> doesn't need to do special thing for migration.
>> It's easy to deploy
>
>
> Except of course these patches don't even work properly yet.
>
> And when they do, even minor changes in host side NIC hardware across
> migration will break guests in hard to predict ways.

Switching between PV and VF NIC will introduce network stop and the
latency of hotplug VF is measurable. For some user cases(cloud service
and OPNFV) which are sensitive to network stabilization and performance,
these are not friend and blocks SRIOV NIC usage in these case. We hope
to find a better way to make SRIOV NIC work in these cases and this is
worth to do since SRIOV NIC provides better network performance compared
with PV NIC. Current patches have some issues. I think we can find
solution for them andimprove them step by step.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
  2015-12-02 14:08         ` [Qemu-devel] " Lan, Tianyu
@ 2015-12-02 14:31           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-02 14:31 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela

On Wed, Dec 02, 2015 at 10:08:25PM +0800, Lan, Tianyu wrote:
> On 12/1/2015 11:02 PM, Michael S. Tsirkin wrote:
> >>But
> >>it requires guest OS to do specific configurations inside and rely on
> >>bonding driver which blocks it work on Windows.
> >> From performance side,
> >>putting VF and virtio NIC under bonded interface will affect their
> >>performance even when not do migration. These factors block to use VF
> >>NIC passthough in some user cases(Especially in the cloud) which require
> >>migration.
> >
> >That's really up to guest. You don't need to do bonding,
> >you can just move the IP and mac from userspace, that's
> >possible on most OS-es.
> >
> >Or write something in guest kernel that is more lightweight if you are
> >so inclined. What we are discussing here is the host-guest interface,
> >not the in-guest interface.
> >
> >>Current solution we proposed changes NIC driver and Qemu. Guest Os
> >>doesn't need to do special thing for migration.
> >>It's easy to deploy
> >
> >
> >Except of course these patches don't even work properly yet.
> >
> >And when they do, even minor changes in host side NIC hardware across
> >migration will break guests in hard to predict ways.
> 
> Switching between PV and VF NIC will introduce network stop and the
> latency of hotplug VF is measurable.
> For some user cases(cloud service
> and OPNFV) which are sensitive to network stabilization and performance,
> these are not friend and blocks SRIOV NIC usage in these case.

I find this hard to credit. hotplug is not normally a data path
operation.

> We hope
> to find a better way to make SRIOV NIC work in these cases and this is
> worth to do since SRIOV NIC provides better network performance compared
> with PV NIC.

If this is a performance optimization as the above implies,
you need to include some numbers, and document how did
you implement the switch and how did you measure the performance.

> Current patches have some issues. I think we can find
> solution for them andimprove them step by step.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
@ 2015-12-02 14:31           ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-02 14:31 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or

On Wed, Dec 02, 2015 at 10:08:25PM +0800, Lan, Tianyu wrote:
> On 12/1/2015 11:02 PM, Michael S. Tsirkin wrote:
> >>But
> >>it requires guest OS to do specific configurations inside and rely on
> >>bonding driver which blocks it work on Windows.
> >> From performance side,
> >>putting VF and virtio NIC under bonded interface will affect their
> >>performance even when not do migration. These factors block to use VF
> >>NIC passthough in some user cases(Especially in the cloud) which require
> >>migration.
> >
> >That's really up to guest. You don't need to do bonding,
> >you can just move the IP and mac from userspace, that's
> >possible on most OS-es.
> >
> >Or write something in guest kernel that is more lightweight if you are
> >so inclined. What we are discussing here is the host-guest interface,
> >not the in-guest interface.
> >
> >>Current solution we proposed changes NIC driver and Qemu. Guest Os
> >>doesn't need to do special thing for migration.
> >>It's easy to deploy
> >
> >
> >Except of course these patches don't even work properly yet.
> >
> >And when they do, even minor changes in host side NIC hardware across
> >migration will break guests in hard to predict ways.
> 
> Switching between PV and VF NIC will introduce network stop and the
> latency of hotplug VF is measurable.
> For some user cases(cloud service
> and OPNFV) which are sensitive to network stabilization and performance,
> these are not friend and blocks SRIOV NIC usage in these case.

I find this hard to credit. hotplug is not normally a data path
operation.

> We hope
> to find a better way to make SRIOV NIC work in these cases and this is
> worth to do since SRIOV NIC provides better network performance compared
> with PV NIC.

If this is a performance optimization as the above implies,
you need to include some numbers, and document how did
you implement the switch and how did you measure the performance.

> Current patches have some issues. I think we can find
> solution for them andimprove them step by step.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
  2015-11-24 13:35   ` [Qemu-devel] " Lan Tianyu
@ 2015-12-02 22:25     ` Alex Williamson
  -1 siblings, 0 replies; 142+ messages in thread
From: Alex Williamson @ 2015-12-02 22:25 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: aik, amit.shah, anthony, ard.biesheuvel, blauwirbel,
	cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, gerlitz.or, donald.c.skidmore,
	mark.d.rustad, mst, kraxel, lcapitulino, quintela

On Tue, 2015-11-24 at 21:35 +0800, Lan Tianyu wrote:
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>  linux-headers/linux/vfio.h | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 0508d0b..732b0bd 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -495,6 +495,22 @@ struct vfio_eeh_pe_op {
>  
>  #define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
>  
> +
> +#define VFIO_FIND_FREE_PCI_CONFIG_REG   _IO(VFIO_TYPE, VFIO_BASE + 22)
> +
> +#define VFIO_GET_PCI_CAP_INFO   _IO(VFIO_TYPE, VFIO_BASE + 22)
> +
> +struct vfio_pci_cap_info {
> +    __u32 argsz;
> +    __u32 flags;
> +#define VFIO_PCI_CAP_GET_SIZE (1 << 0)
> +#define VFIO_PCI_CAP_GET_FREE_REGION (1 << 1)
> +    __u32 index;
> +    __u32 offset;
> +    __u32 size;
> +    __u8 cap;
> +};
> +
>  /* ***************************************************************** */
>  
>  #endif /* VFIO_H */

I didn't seen a matching kernel patch series for this, but why is the
kernel more capable of doing this than userspace is already?  These seem
like pointless ioctls, we're creating a purely virtual PCI capability,
the kernel doesn't really need to participate in that.  Also, why are we
restricting ourselves to standard capabilities?  That's often a crowded
space and we can't always know whether an area is free or not based only
on it being covered by a capability.  Some capabilities can also appear
more than once, so there's context that isn't being passed to the kernel
here.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
@ 2015-12-02 22:25     ` Alex Williamson
  0 siblings, 0 replies; 142+ messages in thread
From: Alex Williamson @ 2015-12-02 22:25 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: emil.s.tantilov, kvm, ard.biesheuvel, aik, donald.c.skidmore,
	mst, eddie.dong, agraf, quintela, qemu-devel, blauwirbel,
	cornelia.huck, nrupal.jani, kraxel, anthony, amit.shah, pbonzini,
	mark.d.rustad, lcapitulino, gerlitz.or

On Tue, 2015-11-24 at 21:35 +0800, Lan Tianyu wrote:
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>  linux-headers/linux/vfio.h | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 0508d0b..732b0bd 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -495,6 +495,22 @@ struct vfio_eeh_pe_op {
>  
>  #define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
>  
> +
> +#define VFIO_FIND_FREE_PCI_CONFIG_REG   _IO(VFIO_TYPE, VFIO_BASE + 22)
> +
> +#define VFIO_GET_PCI_CAP_INFO   _IO(VFIO_TYPE, VFIO_BASE + 22)
> +
> +struct vfio_pci_cap_info {
> +    __u32 argsz;
> +    __u32 flags;
> +#define VFIO_PCI_CAP_GET_SIZE (1 << 0)
> +#define VFIO_PCI_CAP_GET_FREE_REGION (1 << 1)
> +    __u32 index;
> +    __u32 offset;
> +    __u32 size;
> +    __u8 cap;
> +};
> +
>  /* ***************************************************************** */
>  
>  #endif /* VFIO_H */

I didn't seen a matching kernel patch series for this, but why is the
kernel more capable of doing this than userspace is already?  These seem
like pointless ioctls, we're creating a purely virtual PCI capability,
the kernel doesn't really need to participate in that.  Also, why are we
restricting ourselves to standard capabilities?  That's often a crowded
space and we can't always know whether an area is free or not based only
on it being covered by a capability.  Some capabilities can also appear
more than once, so there's context that isn't being passed to the kernel
here.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
  2015-11-24 13:35   ` [Qemu-devel] " Lan Tianyu
@ 2015-12-02 22:25     ` Alex Williamson
  -1 siblings, 0 replies; 142+ messages in thread
From: Alex Williamson @ 2015-12-02 22:25 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: aik, amit.shah, anthony, ard.biesheuvel, blauwirbel,
	cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, gerlitz.or, donald.c.skidmore,
	mark.d.rustad, mst, kraxel, lcapitulino, quintela

On Tue, 2015-11-24 at 21:35 +0800, Lan Tianyu wrote:
> This patch is to add SRIOV VF migration support.
> Create new device type "vfio-sriov" and add faked PCI migration capability
> to the type device.
> 
> The purpose of the new capability
> 1) sync migration status with VF driver in the VM
> 2) Get mailbox irq vector to notify VF driver during migration.
> 3) Provide a way to control injecting irq or not.
> 
> Qemu will migrate PCI configure space regs and MSIX config for VF.
> Inject mailbox irq at last stage of migration to notify VF about
> migration event and wait VF driver ready for migration. VF driver
> writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
> to tell Qemu.

What makes this sr-iov specific?  Why wouldn't we simply extend vfio-pci
with a migration=on feature?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
@ 2015-12-02 22:25     ` Alex Williamson
  0 siblings, 0 replies; 142+ messages in thread
From: Alex Williamson @ 2015-12-02 22:25 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: emil.s.tantilov, kvm, ard.biesheuvel, aik, donald.c.skidmore,
	mst, eddie.dong, agraf, quintela, qemu-devel, blauwirbel,
	cornelia.huck, nrupal.jani, kraxel, anthony, amit.shah, pbonzini,
	mark.d.rustad, lcapitulino, gerlitz.or

On Tue, 2015-11-24 at 21:35 +0800, Lan Tianyu wrote:
> This patch is to add SRIOV VF migration support.
> Create new device type "vfio-sriov" and add faked PCI migration capability
> to the type device.
> 
> The purpose of the new capability
> 1) sync migration status with VF driver in the VM
> 2) Get mailbox irq vector to notify VF driver during migration.
> 3) Provide a way to control injecting irq or not.
> 
> Qemu will migrate PCI configure space regs and MSIX config for VF.
> Inject mailbox irq at last stage of migration to notify VF about
> migration event and wait VF driver ready for migration. VF driver
> writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
> to tell Qemu.

What makes this sr-iov specific?  Why wouldn't we simply extend vfio-pci
with a migration=on feature?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 06/10] Qemu/PCI: Add macros for faked PCI migration capability
  2015-11-24 13:35   ` [Qemu-devel] " Lan Tianyu
@ 2015-12-02 22:25     ` Alex Williamson
  -1 siblings, 0 replies; 142+ messages in thread
From: Alex Williamson @ 2015-12-02 22:25 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: aik, amit.shah, anthony, ard.biesheuvel, blauwirbel,
	cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, gerlitz.or, donald.c.skidmore,
	mark.d.rustad, mst, kraxel, lcapitulino, quintela

On Tue, 2015-11-24 at 21:35 +0800, Lan Tianyu wrote:
> This patch is to extend PCI CAP id for migration cap and
> add reg macros. The CAP ID is trial and we may find better one if the
> solution is feasible.
> 
> *PCI_VF_MIGRATION_CAP
> For VF driver to  control that triggers mailbox irq or not during migration.
> 
> *PCI_VF_MIGRATION_VMM_STATUS
> Qemu stores migration status in the reg
> 
> *PCI_VF_MIGRATION_VF_STATUS
> VF driver tells Qemu ready for migration
> 
> *PCI_VF_MIGRATION_IRQ
> VF driver stores mailbox interrupt vector in the reg for Qemu to trigger during migration.
> 
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>  include/hw/pci/pci_regs.h | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h
> index 57e8c80..0dcaf7e 100644
> --- a/include/hw/pci/pci_regs.h
> +++ b/include/hw/pci/pci_regs.h
> @@ -213,6 +213,7 @@
>  #define  PCI_CAP_ID_MSIX	0x11	/* MSI-X */
>  #define  PCI_CAP_ID_SATA	0x12	/* Serial ATA */
>  #define  PCI_CAP_ID_AF		0x13	/* PCI Advanced Features */
> +#define  PCI_CAP_ID_MIGRATION   0x14 
>  #define PCI_CAP_LIST_NEXT	1	/* Next capability in the list */
>  #define PCI_CAP_FLAGS		2	/* Capability defined flags (16 bits) */
>  #define PCI_CAP_SIZEOF		4
> @@ -716,4 +717,22 @@
>  #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
>  #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
>  
> +/* Migration*/
> +#define PCI_VF_MIGRATION_CAP        0x04
> +#define PCI_VF_MIGRATION_VMM_STATUS	0x05
> +#define PCI_VF_MIGRATION_VF_STATUS	0x06
> +#define PCI_VF_MIGRATION_IRQ		0x07
> +
> +#define PCI_VF_MIGRATION_CAP_SIZE   0x08
> +
> +#define VMM_MIGRATION_END        0x00
> +#define VMM_MIGRATION_START      0x01          
> +
> +#define PCI_VF_WAIT_FOR_MIGRATION   0x00          
> +#define PCI_VF_READY_FOR_MIGRATION  0x01        
> +
> +#define PCI_VF_MIGRATION_DISABLE    0x00
> +#define PCI_VF_MIGRATION_ENABLE     0x01
> +
> +
>  #endif /* LINUX_PCI_REGS_H */

This will of course break if the PCI SIG defines that capability index.
Couldn't this be done within a vendor defined capability?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 06/10] Qemu/PCI: Add macros for faked PCI migration capability
@ 2015-12-02 22:25     ` Alex Williamson
  0 siblings, 0 replies; 142+ messages in thread
From: Alex Williamson @ 2015-12-02 22:25 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: emil.s.tantilov, kvm, ard.biesheuvel, aik, donald.c.skidmore,
	mst, eddie.dong, agraf, quintela, qemu-devel, blauwirbel,
	cornelia.huck, nrupal.jani, kraxel, anthony, amit.shah, pbonzini,
	mark.d.rustad, lcapitulino, gerlitz.or

On Tue, 2015-11-24 at 21:35 +0800, Lan Tianyu wrote:
> This patch is to extend PCI CAP id for migration cap and
> add reg macros. The CAP ID is trial and we may find better one if the
> solution is feasible.
> 
> *PCI_VF_MIGRATION_CAP
> For VF driver to  control that triggers mailbox irq or not during migration.
> 
> *PCI_VF_MIGRATION_VMM_STATUS
> Qemu stores migration status in the reg
> 
> *PCI_VF_MIGRATION_VF_STATUS
> VF driver tells Qemu ready for migration
> 
> *PCI_VF_MIGRATION_IRQ
> VF driver stores mailbox interrupt vector in the reg for Qemu to trigger during migration.
> 
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>  include/hw/pci/pci_regs.h | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h
> index 57e8c80..0dcaf7e 100644
> --- a/include/hw/pci/pci_regs.h
> +++ b/include/hw/pci/pci_regs.h
> @@ -213,6 +213,7 @@
>  #define  PCI_CAP_ID_MSIX	0x11	/* MSI-X */
>  #define  PCI_CAP_ID_SATA	0x12	/* Serial ATA */
>  #define  PCI_CAP_ID_AF		0x13	/* PCI Advanced Features */
> +#define  PCI_CAP_ID_MIGRATION   0x14 
>  #define PCI_CAP_LIST_NEXT	1	/* Next capability in the list */
>  #define PCI_CAP_FLAGS		2	/* Capability defined flags (16 bits) */
>  #define PCI_CAP_SIZEOF		4
> @@ -716,4 +717,22 @@
>  #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
>  #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
>  
> +/* Migration*/
> +#define PCI_VF_MIGRATION_CAP        0x04
> +#define PCI_VF_MIGRATION_VMM_STATUS	0x05
> +#define PCI_VF_MIGRATION_VF_STATUS	0x06
> +#define PCI_VF_MIGRATION_IRQ		0x07
> +
> +#define PCI_VF_MIGRATION_CAP_SIZE   0x08
> +
> +#define VMM_MIGRATION_END        0x00
> +#define VMM_MIGRATION_START      0x01          
> +
> +#define PCI_VF_WAIT_FOR_MIGRATION   0x00          
> +#define PCI_VF_READY_FOR_MIGRATION  0x01        
> +
> +#define PCI_VF_MIGRATION_DISABLE    0x00
> +#define PCI_VF_MIGRATION_ENABLE     0x01
> +
> +
>  #endif /* LINUX_PCI_REGS_H */

This will of course break if the PCI SIG defines that capability index.
Couldn't this be done within a vendor defined capability?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
  2015-12-02 22:25     ` [Qemu-devel] " Alex Williamson
@ 2015-12-03  8:40       ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-03  8:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aik, amit.shah, anthony, ard.biesheuvel, blauwirbel,
	cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, gerlitz.or, donald.c.skidmore,
	mark.d.rustad, mst, kraxel, lcapitulino, quintela

On 12/3/2015 6:25 AM, Alex Williamson wrote:
> I didn't seen a matching kernel patch series for this, but why is the
> kernel more capable of doing this than userspace is already?
The following link is the kernel patch.
http://marc.info/?l=kvm&m=144837328920989&w=2

> These seem
> like pointless ioctls, we're creating a purely virtual PCI capability,
> the kernel doesn't really need to participate in that.

VFIO kernel driver has pci_config_map which indicates the PCI capability 
position and length which helps to find free PCI config regs. Qemu side 
doesn't have such info and can't get the exact table size of PCI 
capability. If we want to add such support in the Qemu, needs duplicates 
a lot of code of vfio_pci_configs.c in the Qemu.

> Also, why are we
> restricting ourselves to standard capabilities?

This version is to check whether it's on the right way and We can extend
this to pci extended capability later.

> That's often a crowded
> space and we can't always know whether an area is free or not based only
> on it being covered by a capability.  Some capabilities can also appear
> more than once, so there's context that isn't being passed to the kernel
> here.  Thanks,

The region outside of PCI capability are not passed to kernel or used by 
Qemu for MSI/MSIX . It's possible to use these places for new 
capability. One concerns is that guest driver may abuse them and quirk 
of masking some special regs outside of capability maybe helpful.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
@ 2015-12-03  8:40       ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-03  8:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: emil.s.tantilov, kvm, ard.biesheuvel, aik, donald.c.skidmore,
	mst, eddie.dong, agraf, quintela, qemu-devel, blauwirbel,
	cornelia.huck, nrupal.jani, kraxel, anthony, amit.shah, pbonzini,
	mark.d.rustad, lcapitulino, gerlitz.or

On 12/3/2015 6:25 AM, Alex Williamson wrote:
> I didn't seen a matching kernel patch series for this, but why is the
> kernel more capable of doing this than userspace is already?
The following link is the kernel patch.
http://marc.info/?l=kvm&m=144837328920989&w=2

> These seem
> like pointless ioctls, we're creating a purely virtual PCI capability,
> the kernel doesn't really need to participate in that.

VFIO kernel driver has pci_config_map which indicates the PCI capability 
position and length which helps to find free PCI config regs. Qemu side 
doesn't have such info and can't get the exact table size of PCI 
capability. If we want to add such support in the Qemu, needs duplicates 
a lot of code of vfio_pci_configs.c in the Qemu.

> Also, why are we
> restricting ourselves to standard capabilities?

This version is to check whether it's on the right way and We can extend
this to pci extended capability later.

> That's often a crowded
> space and we can't always know whether an area is free or not based only
> on it being covered by a capability.  Some capabilities can also appear
> more than once, so there's context that isn't being passed to the kernel
> here.  Thanks,

The region outside of PCI capability are not passed to kernel or used by 
Qemu for MSI/MSIX . It's possible to use these places for new 
capability. One concerns is that guest driver may abuse them and quirk 
of masking some special regs outside of capability maybe helpful.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
  2015-12-02 22:25     ` [Qemu-devel] " Alex Williamson
@ 2015-12-03  8:56       ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-03  8:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aik, amit.shah, anthony, ard.biesheuvel, blauwirbel,
	cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, gerlitz.or, donald.c.skidmore,
	mark.d.rustad, mst, kraxel, lcapitulino, quintela



On 12/3/2015 6:25 AM, Alex Williamson wrote:
> On Tue, 2015-11-24 at 21:35 +0800, Lan Tianyu wrote:
>> This patch is to add SRIOV VF migration support.
>> Create new device type "vfio-sriov" and add faked PCI migration capability
>> to the type device.
>>
>> The purpose of the new capability
>> 1) sync migration status with VF driver in the VM
>> 2) Get mailbox irq vector to notify VF driver during migration.
>> 3) Provide a way to control injecting irq or not.
>>
>> Qemu will migrate PCI configure space regs and MSIX config for VF.
>> Inject mailbox irq at last stage of migration to notify VF about
>> migration event and wait VF driver ready for migration. VF driver
>> writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
>> to tell Qemu.
>
> What makes this sr-iov specific?  Why wouldn't we simply extend vfio-pci
> with a migration=on feature?  Thanks,

Sounds reasonable and will update it.

>
> Alex
>

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support
@ 2015-12-03  8:56       ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-03  8:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: emil.s.tantilov, kvm, ard.biesheuvel, aik, donald.c.skidmore,
	mst, eddie.dong, agraf, quintela, qemu-devel, blauwirbel,
	cornelia.huck, nrupal.jani, kraxel, anthony, amit.shah, pbonzini,
	mark.d.rustad, lcapitulino, gerlitz.or



On 12/3/2015 6:25 AM, Alex Williamson wrote:
> On Tue, 2015-11-24 at 21:35 +0800, Lan Tianyu wrote:
>> This patch is to add SRIOV VF migration support.
>> Create new device type "vfio-sriov" and add faked PCI migration capability
>> to the type device.
>>
>> The purpose of the new capability
>> 1) sync migration status with VF driver in the VM
>> 2) Get mailbox irq vector to notify VF driver during migration.
>> 3) Provide a way to control injecting irq or not.
>>
>> Qemu will migrate PCI configure space regs and MSIX config for VF.
>> Inject mailbox irq at last stage of migration to notify VF about
>> migration event and wait VF driver ready for migration. VF driver
>> writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
>> to tell Qemu.
>
> What makes this sr-iov specific?  Why wouldn't we simply extend vfio-pci
> with a migration=on feature?  Thanks,

Sounds reasonable and will update it.

>
> Alex
>

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 06/10] Qemu/PCI: Add macros for faked PCI migration capability
  2015-12-02 22:25     ` [Qemu-devel] " Alex Williamson
@ 2015-12-03  8:57       ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-03  8:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aik, amit.shah, anthony, ard.biesheuvel, blauwirbel,
	cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, gerlitz.or, donald.c.skidmore,
	mark.d.rustad, mst, kraxel, lcapitulino, quintela



On 12/3/2015 6:25 AM, Alex Williamson wrote:
> This will of course break if the PCI SIG defines that capability index.
> Couldn't this be done within a vendor defined capability?  Thanks,

Yes, it should work and thanks for suggestion.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 06/10] Qemu/PCI: Add macros for faked PCI migration capability
@ 2015-12-03  8:57       ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-03  8:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: emil.s.tantilov, kvm, ard.biesheuvel, aik, donald.c.skidmore,
	mst, eddie.dong, agraf, quintela, qemu-devel, blauwirbel,
	cornelia.huck, nrupal.jani, kraxel, anthony, amit.shah, pbonzini,
	mark.d.rustad, lcapitulino, gerlitz.or



On 12/3/2015 6:25 AM, Alex Williamson wrote:
> This will of course break if the PCI SIG defines that capability index.
> Couldn't this be done within a vendor defined capability?  Thanks,

Yes, it should work and thanks for suggestion.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
  2015-12-02 14:31           ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-12-03 14:53             ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-03 14:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela


On 12/2/2015 10:31 PM, Michael S. Tsirkin wrote:
>> >We hope
>> >to find a better way to make SRIOV NIC work in these cases and this is
>> >worth to do since SRIOV NIC provides better network performance compared
>> >with PV NIC.
> If this is a performance optimization as the above implies,
> you need to include some numbers, and document how did
> you implement the switch and how did you measure the performance.
>

OK. Some ideas of my patches come from paper "CompSC: Live Migration with
Pass-through Devices".
http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/papers/p109.pdf

It compared performance data between the solution of switching PV and VF 
and VF migration.(Chapter 7: Discussion)


>> >Current patches have some issues. I think we can find
>> >solution for them andimprove them step by step.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
@ 2015-12-03 14:53             ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-03 14:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or


On 12/2/2015 10:31 PM, Michael S. Tsirkin wrote:
>> >We hope
>> >to find a better way to make SRIOV NIC work in these cases and this is
>> >worth to do since SRIOV NIC provides better network performance compared
>> >with PV NIC.
> If this is a performance optimization as the above implies,
> you need to include some numbers, and document how did
> you implement the switch and how did you measure the performance.
>

OK. Some ideas of my patches come from paper "CompSC: Live Migration with
Pass-through Devices".
http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/papers/p109.pdf

It compared performance data between the solution of switching PV and VF 
and VF migration.(Chapter 7: Discussion)


>> >Current patches have some issues. I think we can find
>> >solution for them andimprove them step by step.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
  2015-12-03  8:40       ` [Qemu-devel] " Lan, Tianyu
@ 2015-12-03 15:26         ` Alex Williamson
  -1 siblings, 0 replies; 142+ messages in thread
From: Alex Williamson @ 2015-12-03 15:26 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: aik, amit.shah, anthony, ard.biesheuvel, blauwirbel,
	cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, gerlitz.or, donald.c.skidmore,
	mark.d.rustad, mst, kraxel, lcapitulino, quintela

On Thu, 2015-12-03 at 16:40 +0800, Lan, Tianyu wrote:
> On 12/3/2015 6:25 AM, Alex Williamson wrote:
> > I didn't seen a matching kernel patch series for this, but why is the
> > kernel more capable of doing this than userspace is already?
> The following link is the kernel patch.
> http://marc.info/?l=kvm&m=144837328920989&w=2
> 
> > These seem
> > like pointless ioctls, we're creating a purely virtual PCI capability,
> > the kernel doesn't really need to participate in that.
> 
> VFIO kernel driver has pci_config_map which indicates the PCI capability 
> position and length which helps to find free PCI config regs. Qemu side 
> doesn't have such info and can't get the exact table size of PCI 
> capability. If we want to add such support in the Qemu, needs duplicates 
> a lot of code of vfio_pci_configs.c in the Qemu.

That's an internal implementation detail of the kernel, not motivation
for creating a new userspace ABI.  QEMU can recreate this data on its
own.  The kernel is in no more of an authoritative position to determine
capability extents than userspace.

> > Also, why are we
> > restricting ourselves to standard capabilities?
> 
> This version is to check whether it's on the right way and We can extend
> this to pci extended capability later.
> 
> > That's often a crowded
> > space and we can't always know whether an area is free or not based only
> > on it being covered by a capability.  Some capabilities can also appear
> > more than once, so there's context that isn't being passed to the kernel
> > here.  Thanks,
> 
> The region outside of PCI capability are not passed to kernel or used by 
> Qemu for MSI/MSIX . It's possible to use these places for new 
> capability. One concerns is that guest driver may abuse them and quirk 
> of masking some special regs outside of capability maybe helpful.

That's not correct, see kernel commit
a7d1ea1c11b33bda2691f3294b4d735ed635535a.  Gaps between capabilities are
exposed with raw read-write access from the kernel and some drivers and
devices depend on this.  There's also no guarantee that there's a
sufficiently sized gap in conventional space.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
@ 2015-12-03 15:26         ` Alex Williamson
  0 siblings, 0 replies; 142+ messages in thread
From: Alex Williamson @ 2015-12-03 15:26 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: emil.s.tantilov, kvm, ard.biesheuvel, aik, donald.c.skidmore,
	mst, eddie.dong, agraf, quintela, qemu-devel, blauwirbel,
	cornelia.huck, nrupal.jani, kraxel, anthony, amit.shah, pbonzini,
	mark.d.rustad, lcapitulino, gerlitz.or

On Thu, 2015-12-03 at 16:40 +0800, Lan, Tianyu wrote:
> On 12/3/2015 6:25 AM, Alex Williamson wrote:
> > I didn't seen a matching kernel patch series for this, but why is the
> > kernel more capable of doing this than userspace is already?
> The following link is the kernel patch.
> http://marc.info/?l=kvm&m=144837328920989&w=2
> 
> > These seem
> > like pointless ioctls, we're creating a purely virtual PCI capability,
> > the kernel doesn't really need to participate in that.
> 
> VFIO kernel driver has pci_config_map which indicates the PCI capability 
> position and length which helps to find free PCI config regs. Qemu side 
> doesn't have such info and can't get the exact table size of PCI 
> capability. If we want to add such support in the Qemu, needs duplicates 
> a lot of code of vfio_pci_configs.c in the Qemu.

That's an internal implementation detail of the kernel, not motivation
for creating a new userspace ABI.  QEMU can recreate this data on its
own.  The kernel is in no more of an authoritative position to determine
capability extents than userspace.

> > Also, why are we
> > restricting ourselves to standard capabilities?
> 
> This version is to check whether it's on the right way and We can extend
> this to pci extended capability later.
> 
> > That's often a crowded
> > space and we can't always know whether an area is free or not based only
> > on it being covered by a capability.  Some capabilities can also appear
> > more than once, so there's context that isn't being passed to the kernel
> > here.  Thanks,
> 
> The region outside of PCI capability are not passed to kernel or used by 
> Qemu for MSI/MSIX . It's possible to use these places for new 
> capability. One concerns is that guest driver may abuse them and quirk 
> of masking some special regs outside of capability maybe helpful.

That's not correct, see kernel commit
a7d1ea1c11b33bda2691f3294b4d735ed635535a.  Gaps between capabilities are
exposed with raw read-write access from the kernel and some drivers and
devices depend on this.  There's also no guarantee that there's a
sufficiently sized gap in conventional space.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
  2015-12-02 14:08         ` [Qemu-devel] " Lan, Tianyu
@ 2015-12-03 18:32           ` Alexander Duyck
  -1 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-03 18:32 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Michael S. Tsirkin, aik, Alex Williamson, amit.shah, anthony,
	ard.biesheuvel, blauwirbel, cornelia.huck, Dong, Eddie, Jani,
	Nrupal, Alexander Graf, kvm, Paolo Bonzini, qemu-devel, Tantilov,
	Emil S, Or Gerlitz, Skidmore, Donald C, Rustad, Mark D, kraxel,
	lcapitulino, quintela

On Wed, Dec 2, 2015 at 6:08 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:
> On 12/1/2015 11:02 PM, Michael S. Tsirkin wrote:
>>>
>>> But
>>> it requires guest OS to do specific configurations inside and rely on
>>> bonding driver which blocks it work on Windows.
>>>  From performance side,
>>> putting VF and virtio NIC under bonded interface will affect their
>>> performance even when not do migration. These factors block to use VF
>>> NIC passthough in some user cases(Especially in the cloud) which require
>>> migration.
>>
>>
>> That's really up to guest. You don't need to do bonding,
>> you can just move the IP and mac from userspace, that's
>> possible on most OS-es.
>>
>> Or write something in guest kernel that is more lightweight if you are
>> so inclined. What we are discussing here is the host-guest interface,
>> not the in-guest interface.
>>
>>> Current solution we proposed changes NIC driver and Qemu. Guest Os
>>> doesn't need to do special thing for migration.
>>> It's easy to deploy
>>
>>
>>
>> Except of course these patches don't even work properly yet.
>>
>> And when they do, even minor changes in host side NIC hardware across
>> migration will break guests in hard to predict ways.
>
>
> Switching between PV and VF NIC will introduce network stop and the
> latency of hotplug VF is measurable. For some user cases(cloud service
> and OPNFV) which are sensitive to network stabilization and performance,
> these are not friend and blocks SRIOV NIC usage in these case. We hope
> to find a better way to make SRIOV NIC work in these cases and this is
> worth to do since SRIOV NIC provides better network performance compared
> with PV NIC. Current patches have some issues. I think we can find
> solution for them andimprove them step by step.

I still believe the concepts being put into use here are deeply
flawed.  You are assuming you can somehow complete the migration while
the device is active and I seriously doubt that is the case.  You are
going to cause data corruption or worse cause a kernel panic when you
end up corrupting the guest memory.

You have to halt the device at some point in order to complete the
migration.  Now I fully agree it is best to do this for as small a
window as possible.  I really think that your best approach would be
embrace and extend the current solution that is making use of bonding.
The first step being to make it so that you don't have to hot-plug the
VF until just before you halt the guest instead of before you start he
migration.  Just doing that would yield a significant gain in terms of
performance during the migration.  In addition something like that
should be able to be done without having to be overly invasive into
the drivers.  A few tweaks to the DMA API and you could probably have
that resolved.

As far as avoiding the hot-plug itself that would be better handled as
a separate follow-up, and really belongs more to the PCI layer than
the NIC device drivers.  The device drivers should already have code
for handling a suspend/resume due to a power cycle event.  If you
could make use of that then it is just a matter of implementing
something in the hot-plug or PCIe drivers that would allow QEMU to
signal when the device needs to go into D3 and when it can resume
normal operation at D0.  You could probably use the PCI Bus Master
Enable bit as the test on if the device is ready for migration or not.
If the bit is set you cannot migrate the VM, and if it is cleared than
you are ready to migrate.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
@ 2015-12-03 18:32           ` Alexander Duyck
  0 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-03 18:32 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: qemu-devel, Tantilov, Emil S, kvm, ard.biesheuvel, aik, Skidmore,
	Donald C, Michael S. Tsirkin, Dong, Eddie, Jani, Nrupal,
	quintela, Alexander Graf, blauwirbel, cornelia.huck,
	Alex Williamson, kraxel, anthony, amit.shah, Paolo Bonzini,
	Rustad, Mark D, lcapitulino, Or Gerlitz

On Wed, Dec 2, 2015 at 6:08 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:
> On 12/1/2015 11:02 PM, Michael S. Tsirkin wrote:
>>>
>>> But
>>> it requires guest OS to do specific configurations inside and rely on
>>> bonding driver which blocks it work on Windows.
>>>  From performance side,
>>> putting VF and virtio NIC under bonded interface will affect their
>>> performance even when not do migration. These factors block to use VF
>>> NIC passthough in some user cases(Especially in the cloud) which require
>>> migration.
>>
>>
>> That's really up to guest. You don't need to do bonding,
>> you can just move the IP and mac from userspace, that's
>> possible on most OS-es.
>>
>> Or write something in guest kernel that is more lightweight if you are
>> so inclined. What we are discussing here is the host-guest interface,
>> not the in-guest interface.
>>
>>> Current solution we proposed changes NIC driver and Qemu. Guest Os
>>> doesn't need to do special thing for migration.
>>> It's easy to deploy
>>
>>
>>
>> Except of course these patches don't even work properly yet.
>>
>> And when they do, even minor changes in host side NIC hardware across
>> migration will break guests in hard to predict ways.
>
>
> Switching between PV and VF NIC will introduce network stop and the
> latency of hotplug VF is measurable. For some user cases(cloud service
> and OPNFV) which are sensitive to network stabilization and performance,
> these are not friend and blocks SRIOV NIC usage in these case. We hope
> to find a better way to make SRIOV NIC work in these cases and this is
> worth to do since SRIOV NIC provides better network performance compared
> with PV NIC. Current patches have some issues. I think we can find
> solution for them andimprove them step by step.

I still believe the concepts being put into use here are deeply
flawed.  You are assuming you can somehow complete the migration while
the device is active and I seriously doubt that is the case.  You are
going to cause data corruption or worse cause a kernel panic when you
end up corrupting the guest memory.

You have to halt the device at some point in order to complete the
migration.  Now I fully agree it is best to do this for as small a
window as possible.  I really think that your best approach would be
embrace and extend the current solution that is making use of bonding.
The first step being to make it so that you don't have to hot-plug the
VF until just before you halt the guest instead of before you start he
migration.  Just doing that would yield a significant gain in terms of
performance during the migration.  In addition something like that
should be able to be done without having to be overly invasive into
the drivers.  A few tweaks to the DMA API and you could probably have
that resolved.

As far as avoiding the hot-plug itself that would be better handled as
a separate follow-up, and really belongs more to the PCI layer than
the NIC device drivers.  The device drivers should already have code
for handling a suspend/resume due to a power cycle event.  If you
could make use of that then it is just a matter of implementing
something in the hot-plug or PCIe drivers that would allow QEMU to
signal when the device needs to go into D3 and when it can resume
normal operation at D0.  You could probably use the PCI Bus Master
Enable bit as the test on if the device is ready for migration or not.
If the bit is set you cannot migrate the VM, and if it is cleared than
you are ready to migrate.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
  2015-12-02 14:31           ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-12-04  6:42             ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-04  6:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela


On 12/2/2015 10:31 PM, Michael S. Tsirkin wrote:
>> >We hope
>> >to find a better way to make SRIOV NIC work in these cases and this is
>> >worth to do since SRIOV NIC provides better network performance compared
>> >with PV NIC.
> If this is a performance optimization as the above implies,
> you need to include some numbers, and document how did
> you implement the switch and how did you measure the performance.
>

OK. Some ideas of my patches come from paper "CompSC: Live Migration with
Pass-through Devices".
http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/papers/p109.pdf

It compared performance data between the solution of switching PV and VF 
and VF migration.(Chapter 7: Discussion)


>> >Current patches have some issues. I think we can find
>> >solution for them andimprove them step by step.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
@ 2015-12-04  6:42             ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-04  6:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or


On 12/2/2015 10:31 PM, Michael S. Tsirkin wrote:
>> >We hope
>> >to find a better way to make SRIOV NIC work in these cases and this is
>> >worth to do since SRIOV NIC provides better network performance compared
>> >with PV NIC.
> If this is a performance optimization as the above implies,
> you need to include some numbers, and document how did
> you implement the switch and how did you measure the performance.
>

OK. Some ideas of my patches come from paper "CompSC: Live Migration with
Pass-through Devices".
http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/papers/p109.pdf

It compared performance data between the solution of switching PV and VF 
and VF migration.(Chapter 7: Discussion)


>> >Current patches have some issues. I think we can find
>> >solution for them andimprove them step by step.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
  2015-12-04  6:42             ` [Qemu-devel] " Lan, Tianyu
@ 2015-12-04  8:05               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-04  8:05 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela

On Fri, Dec 04, 2015 at 02:42:36PM +0800, Lan, Tianyu wrote:
> 
> On 12/2/2015 10:31 PM, Michael S. Tsirkin wrote:
> >>>We hope
> >>>to find a better way to make SRIOV NIC work in these cases and this is
> >>>worth to do since SRIOV NIC provides better network performance compared
> >>>with PV NIC.
> >If this is a performance optimization as the above implies,
> >you need to include some numbers, and document how did
> >you implement the switch and how did you measure the performance.
> >
> 
> OK. Some ideas of my patches come from paper "CompSC: Live Migration with
> Pass-through Devices".
> http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/papers/p109.pdf
> 
> It compared performance data between the solution of switching PV and VF and
> VF migration.(Chapter 7: Discussion)
> 

I haven't read it, but I would like to note you can't rely on research
papers.  If you propose a patch to be merged you need to measure what is
its actual effect on modern linux at the end of 2015.

> >>>Current patches have some issues. I think we can find
> >>>solution for them andimprove them step by step.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
@ 2015-12-04  8:05               ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-04  8:05 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or

On Fri, Dec 04, 2015 at 02:42:36PM +0800, Lan, Tianyu wrote:
> 
> On 12/2/2015 10:31 PM, Michael S. Tsirkin wrote:
> >>>We hope
> >>>to find a better way to make SRIOV NIC work in these cases and this is
> >>>worth to do since SRIOV NIC provides better network performance compared
> >>>with PV NIC.
> >If this is a performance optimization as the above implies,
> >you need to include some numbers, and document how did
> >you implement the switch and how did you measure the performance.
> >
> 
> OK. Some ideas of my patches come from paper "CompSC: Live Migration with
> Pass-through Devices".
> http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/papers/p109.pdf
> 
> It compared performance data between the solution of switching PV and VF and
> VF migration.(Chapter 7: Discussion)
> 

I haven't read it, but I would like to note you can't rely on research
papers.  If you propose a patch to be merged you need to measure what is
its actual effect on modern linux at the end of 2015.

> >>>Current patches have some issues. I think we can find
> >>>solution for them andimprove them step by step.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
  2015-12-04  8:05               ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-12-04 12:11                 ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-04 12:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela


On 12/4/2015 4:05 PM, Michael S. Tsirkin wrote:
> I haven't read it, but I would like to note you can't rely on research
> papers.  If you propose a patch to be merged you need to measure what is
> its actual effect on modern linux at the end of 2015.

Sure. Will do that.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
@ 2015-12-04 12:11                 ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-04 12:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or


On 12/4/2015 4:05 PM, Michael S. Tsirkin wrote:
> I haven't read it, but I would like to note you can't rely on research
> papers.  If you propose a patch to be merged you need to measure what is
> its actual effect on modern linux at the end of 2015.

Sure. Will do that.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* live migration vs device assignment (was Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC)
  2015-11-24 13:35 ` [Qemu-devel] " Lan Tianyu
@ 2015-12-07 16:50   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-07 16:50 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela

On Tue, Nov 24, 2015 at 09:35:17PM +0800, Lan Tianyu wrote:
> This patchset is to propose a solution of adding live migration
> support for SRIOV NIC.

I thought about what this is doing at the high level, and I do have some
value in what you are trying to do, but I also think we need to clarify
the motivation a bit more.  What you are saying is not really what the
patches are doing.

And with that clearer understanding of the motivation in mind (assuming
it actually captures a real need), I would also like to suggest some
changes.

TLDR:
- split this into 3 unrelated efforts/patchsets
- try implementing this host-side only using VT-d dirty tracking
- if making guest changes, make them in a way that makes many devices benefit
- measure speed before trying to improve it

-------

First, this does not help to actually do migration with an
active assigned device. Guest needs to deactivate the device
before VM is moved around.

What they are actually able to do, instead, is three things.
My suggestion is to split them up, and work on them
separately.  There's really no need to have them all.

I discuss all 3 things below, but if we do need to have some discussion,
please snip and  let's have separate threads for each item please.

1. Starting live migration with device running.
This might help speed up networking during pre-copy where there is a
long warm-up phase.

Note: To complete migration, one also has to do something to stop
the device, but that's a separate item, since existing hot-unplug
request will do that just as well.

Proposed changes of approach:
One option is to write into the dma memory to make it dirty.  Your
patches do this within the driver, but doing this in the generic dma
unmap code seems more elegant as it will help all devices.  An
interesting note: on unplug, driver unmaps all memory for DMA, so this
works out fine.

Some benchmarking will be needed to show the performance overhead.
It is likely non zero, so an interface would be needed
to enable this tracking before starting migration.

According to the VT-d spec, I note that bit 6 in the PTE is the dirty
bit.  Why don't we use this to detect memory changes by the device?
Specifically, periodically scan pages that we have already
sent, test and clear atomically the dirty bit in the PTE of
the IOMMU, and if set, resend the page.
The interface could be simply an ioctl for VFIO giving
it a range of memory, and have VFIO do the scan and set
bits for userspace.

This might be slower than writing into DMA page,
since e.g. PML does not work here.

We could go for a mixed approach, where we negotiate with the
guest: if guest can write into memory on unmap, then
skip the scanning, otherwise do scanning of IOMMU PTEs
as described above.

I would suggest starting with clean IOMMU PTE polling
on host. If you see that there is a performance problem,
optimize later by enabling the updates within guest
if required.

2.  (Presumably) faster device stop.
After the warmup phase, we need to enter the stop and
copy phase. At that point, device needs to be stopped.
One way to do this is to send request to guest while
we continue to track and send memory changes.
I am not sure whether this is what you are doing,
but I'm assuming it is.

I don't know what do you do on the host,
I guesss you could send removal request to guest, and
keep sending page updates meanwhile.
After guest eject/stop acknowledge is received on the host,
you can enter stop and copy.

Your patches seem to stop device with a custom device specific
register, but using more generic interfaces, such as
e.g. device removal, could also work, even if
it's less optimal.

The way you defined the interfaces, they don't
seem device specific at all.
A new PCI capability ID reserved by the PCI SIG
could be one way to add the new interface
if it's needed.

We also need a way to know what does guest support.
With hotplug we know all modern guests support
it, but with any custom code we need negotiation,
and then fall back on either hot unplug
or blocking migration.

Additionally, hot-unplug will unmap all dma
memory so if all dma unmap callbacks do
a write, you get that memory dirtied for free.

At the moment, device removal destroys state such as IP address and arp
cache, but we could have guest move these around
if necessary. Possibly this can be done in userspace with
the guest agent. We could discuss guest kernel or firmware solutions
if we need to address corner cases such as network boot.

You might run into hotplug behaviour such as
a 5 second timeout until device is actually
detected. It always seemed silly to me.
A simple if (!kvm) in that code might be justified.

The fact that guest cooperation is needed
to complete migration is a big problem IMHO.
This practically means you need to give a lot of
CPU to a guest on an overcommitted host
in order to be able to move it out to another host.
Meanwhile, guest can abuse the extra CPU it got.

Can not surprise removal be emulated instead?
Remove device from guest control by unmapping
it from guest PTEs, teach guest not to crash
and not to hang. Ideally reset the device instead.

This sounds like a useful thing to support even
outside virtualization context.

3.  (Presumably) faster device start
Finally, device needs to be started at destination.  Again, hotplug
will also work. Isn't it fast enough? where exactly is
the time spent?

Alternatively, some kind of hot-unplug that makes e.g.
net core save the device state so the following hotplug can restore it
back might work. This is closer to what you are trying to
do, but it is not very robust since device at source
and destination could be slightly different.

A full reset at destination sounds better.

If combining with surprise removal in 2 above, maybe pretend to linux
that there was a fatal error on the device, and have linux re-initialize
it?  To me this sounds better as it will survive across minor device
changes src/dst. Still won't survive if driver happens to change which
isn't something users always have control over.
We could teach linux about a new event that replaces
the device.

Again, might be a useful thing to support even
outside virtualization context.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* [Qemu-devel] live migration vs device assignment (was Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC)
@ 2015-12-07 16:50   ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-07 16:50 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or

On Tue, Nov 24, 2015 at 09:35:17PM +0800, Lan Tianyu wrote:
> This patchset is to propose a solution of adding live migration
> support for SRIOV NIC.

I thought about what this is doing at the high level, and I do have some
value in what you are trying to do, but I also think we need to clarify
the motivation a bit more.  What you are saying is not really what the
patches are doing.

And with that clearer understanding of the motivation in mind (assuming
it actually captures a real need), I would also like to suggest some
changes.

TLDR:
- split this into 3 unrelated efforts/patchsets
- try implementing this host-side only using VT-d dirty tracking
- if making guest changes, make them in a way that makes many devices benefit
- measure speed before trying to improve it

-------

First, this does not help to actually do migration with an
active assigned device. Guest needs to deactivate the device
before VM is moved around.

What they are actually able to do, instead, is three things.
My suggestion is to split them up, and work on them
separately.  There's really no need to have them all.

I discuss all 3 things below, but if we do need to have some discussion,
please snip and  let's have separate threads for each item please.

1. Starting live migration with device running.
This might help speed up networking during pre-copy where there is a
long warm-up phase.

Note: To complete migration, one also has to do something to stop
the device, but that's a separate item, since existing hot-unplug
request will do that just as well.

Proposed changes of approach:
One option is to write into the dma memory to make it dirty.  Your
patches do this within the driver, but doing this in the generic dma
unmap code seems more elegant as it will help all devices.  An
interesting note: on unplug, driver unmaps all memory for DMA, so this
works out fine.

Some benchmarking will be needed to show the performance overhead.
It is likely non zero, so an interface would be needed
to enable this tracking before starting migration.

According to the VT-d spec, I note that bit 6 in the PTE is the dirty
bit.  Why don't we use this to detect memory changes by the device?
Specifically, periodically scan pages that we have already
sent, test and clear atomically the dirty bit in the PTE of
the IOMMU, and if set, resend the page.
The interface could be simply an ioctl for VFIO giving
it a range of memory, and have VFIO do the scan and set
bits for userspace.

This might be slower than writing into DMA page,
since e.g. PML does not work here.

We could go for a mixed approach, where we negotiate with the
guest: if guest can write into memory on unmap, then
skip the scanning, otherwise do scanning of IOMMU PTEs
as described above.

I would suggest starting with clean IOMMU PTE polling
on host. If you see that there is a performance problem,
optimize later by enabling the updates within guest
if required.

2.  (Presumably) faster device stop.
After the warmup phase, we need to enter the stop and
copy phase. At that point, device needs to be stopped.
One way to do this is to send request to guest while
we continue to track and send memory changes.
I am not sure whether this is what you are doing,
but I'm assuming it is.

I don't know what do you do on the host,
I guesss you could send removal request to guest, and
keep sending page updates meanwhile.
After guest eject/stop acknowledge is received on the host,
you can enter stop and copy.

Your patches seem to stop device with a custom device specific
register, but using more generic interfaces, such as
e.g. device removal, could also work, even if
it's less optimal.

The way you defined the interfaces, they don't
seem device specific at all.
A new PCI capability ID reserved by the PCI SIG
could be one way to add the new interface
if it's needed.

We also need a way to know what does guest support.
With hotplug we know all modern guests support
it, but with any custom code we need negotiation,
and then fall back on either hot unplug
or blocking migration.

Additionally, hot-unplug will unmap all dma
memory so if all dma unmap callbacks do
a write, you get that memory dirtied for free.

At the moment, device removal destroys state such as IP address and arp
cache, but we could have guest move these around
if necessary. Possibly this can be done in userspace with
the guest agent. We could discuss guest kernel or firmware solutions
if we need to address corner cases such as network boot.

You might run into hotplug behaviour such as
a 5 second timeout until device is actually
detected. It always seemed silly to me.
A simple if (!kvm) in that code might be justified.

The fact that guest cooperation is needed
to complete migration is a big problem IMHO.
This practically means you need to give a lot of
CPU to a guest on an overcommitted host
in order to be able to move it out to another host.
Meanwhile, guest can abuse the extra CPU it got.

Can not surprise removal be emulated instead?
Remove device from guest control by unmapping
it from guest PTEs, teach guest not to crash
and not to hang. Ideally reset the device instead.

This sounds like a useful thing to support even
outside virtualization context.

3.  (Presumably) faster device start
Finally, device needs to be started at destination.  Again, hotplug
will also work. Isn't it fast enough? where exactly is
the time spent?

Alternatively, some kind of hot-unplug that makes e.g.
net core save the device state so the following hotplug can restore it
back might work. This is closer to what you are trying to
do, but it is not very robust since device at source
and destination could be slightly different.

A full reset at destination sounds better.

If combining with surprise removal in 2 above, maybe pretend to linux
that there was a fatal error on the device, and have linux re-initialize
it?  To me this sounds better as it will survive across minor device
changes src/dst. Still won't survive if driver happens to change which
isn't something users always have control over.
We could teach linux about a new event that replaces
the device.

Again, might be a useful thing to support even
outside virtualization context.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: live migration vs device assignment (motivation)
  2015-12-07 16:50   ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-12-09 16:26     ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-09 16:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela

On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> I thought about what this is doing at the high level, and I do have some
> value in what you are trying to do, but I also think we need to clarify
> the motivation a bit more.  What you are saying is not really what the
> patches are doing.
>
> And with that clearer understanding of the motivation in mind (assuming
> it actually captures a real need), I would also like to suggest some
> changes.

Motivation:
Most current solutions for migration with passthough device are based on
the PCI hotplug but it has side affect and can't work for all device.

For NIC device:
PCI hotplug solution can work around Network device migration
via switching VF and PF.

But switching network interface will introduce service down time.

I tested the service down time via putting VF and PV interface
into a bonded interface and ping the bonded interface during plug
and unplug VF.
1) About 100ms when add VF
2) About 30ms when del VF

It also requires guest to do switch configuration. These are hard to
manage and deploy from our customers. To maintain PV performance during
migration, host side also needs to assign a VF to PV device. This
affects scalability.

These factors block SRIOV NIC passthough usage in the cloud service and
OPNFV which require network high performance and stability a lot.

For other kind of devices, it's hard to work.
We are also adding migration support for QAT(QuickAssist Technology) device.

QAT device user case introduction.
Server, networking, big data, and storage applications use QuickAssist 
Technology to offload servers from handling compute-intensive 
operations, such as:
1) Symmetric cryptography functions including cipher operations and 
authentication operations
2) Public key functions including RSA, Diffie-Hellman, and elliptic 
curve cryptography
3) Compression and decompression functions including DEFLATE and LZS

PCI hotplug will not work for such devices during migration and these 
operations will fail when unplug device.

So we are trying implementing a new solution which really migrates
device state to target machine and won't affect user during migration
with low service down time.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-09 16:26     ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-09 16:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or

On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> I thought about what this is doing at the high level, and I do have some
> value in what you are trying to do, but I also think we need to clarify
> the motivation a bit more.  What you are saying is not really what the
> patches are doing.
>
> And with that clearer understanding of the motivation in mind (assuming
> it actually captures a real need), I would also like to suggest some
> changes.

Motivation:
Most current solutions for migration with passthough device are based on
the PCI hotplug but it has side affect and can't work for all device.

For NIC device:
PCI hotplug solution can work around Network device migration
via switching VF and PF.

But switching network interface will introduce service down time.

I tested the service down time via putting VF and PV interface
into a bonded interface and ping the bonded interface during plug
and unplug VF.
1) About 100ms when add VF
2) About 30ms when del VF

It also requires guest to do switch configuration. These are hard to
manage and deploy from our customers. To maintain PV performance during
migration, host side also needs to assign a VF to PV device. This
affects scalability.

These factors block SRIOV NIC passthough usage in the cloud service and
OPNFV which require network high performance and stability a lot.

For other kind of devices, it's hard to work.
We are also adding migration support for QAT(QuickAssist Technology) device.

QAT device user case introduction.
Server, networking, big data, and storage applications use QuickAssist 
Technology to offload servers from handling compute-intensive 
operations, such as:
1) Symmetric cryptography functions including cipher operations and 
authentication operations
2) Public key functions including RSA, Diffie-Hellman, and elliptic 
curve cryptography
3) Compression and decompression functions including DEFLATE and LZS

PCI hotplug will not work for such devices during migration and these 
operations will fail when unplug device.

So we are trying implementing a new solution which really migrates
device state to target machine and won't affect user during migration
with low service down time.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: live migration vs device assignment (motivation)
  2015-12-09 16:26     ` [Qemu-devel] " Lan, Tianyu
@ 2015-12-09 17:14       ` Alexander Duyck
  -1 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-09 17:14 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Michael S. Tsirkin, aik, Alex Williamson, amit.shah,
	Anthony Liguori, Ard Biesheuvel, Blue Swirl, cornelia.huck, Dong,
	Eddie, Jani, Nrupal, Alexander Graf, kvm, Paolo Bonzini,
	qemu-devel, Tantilov, Emil S, Or Gerlitz, Skidmore, Donald C,
	Rustad, Mark D, kraxel, lcapitulino, quintela

On Wed, Dec 9, 2015 at 8:26 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:

> For other kind of devices, it's hard to work.
> We are also adding migration support for QAT(QuickAssist Technology) device.
>
> QAT device user case introduction.
> Server, networking, big data, and storage applications use QuickAssist
> Technology to offload servers from handling compute-intensive operations,
> such as:
> 1) Symmetric cryptography functions including cipher operations and
> authentication operations
> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> cryptography
> 3) Compression and decompression functions including DEFLATE and LZS
>
> PCI hotplug will not work for such devices during migration and these
> operations will fail when unplug device.

I assume the problem is that with a PCI hotplug event you are losing
the state information for the device, do I have that right?

Looking over the QAT drivers it doesn't seem like any of them support
the suspend/resume PM calls.  I would imagine it makes it difficult
for a system with a QAT card in it to be able to drop the system to a
low power state.  You might want to try enabling suspend/resume
support for the devices on bare metal before you attempt to take on
migration as it would provide you with a good testing framework to see
what needs to be saved/restored within the device and in what order
before you attempt to do the same while migrating from one system to
another.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-09 17:14       ` Alexander Duyck
  0 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-09 17:14 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: qemu-devel, Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore,
	Donald C, Michael S. Tsirkin, Dong, Eddie, Jani, Nrupal,
	quintela, Alexander Graf, Blue Swirl, cornelia.huck,
	Alex Williamson, kraxel, Anthony Liguori, amit.shah,
	Paolo Bonzini, Rustad, Mark D, lcapitulino, Or Gerlitz

On Wed, Dec 9, 2015 at 8:26 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:

> For other kind of devices, it's hard to work.
> We are also adding migration support for QAT(QuickAssist Technology) device.
>
> QAT device user case introduction.
> Server, networking, big data, and storage applications use QuickAssist
> Technology to offload servers from handling compute-intensive operations,
> such as:
> 1) Symmetric cryptography functions including cipher operations and
> authentication operations
> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> cryptography
> 3) Compression and decompression functions including DEFLATE and LZS
>
> PCI hotplug will not work for such devices during migration and these
> operations will fail when unplug device.

I assume the problem is that with a PCI hotplug event you are losing
the state information for the device, do I have that right?

Looking over the QAT drivers it doesn't seem like any of them support
the suspend/resume PM calls.  I would imagine it makes it difficult
for a system with a QAT card in it to be able to drop the system to a
low power state.  You might want to try enabling suspend/resume
support for the devices on bare metal before you attempt to take on
migration as it would provide you with a good testing framework to see
what needs to be saved/restored within the device and in what order
before you attempt to do the same while migrating from one system to
another.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: live migration vs device assignment (motivation)
  2015-12-09 16:26     ` [Qemu-devel] " Lan, Tianyu
@ 2015-12-09 20:07       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-09 20:07 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela

On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote:
> On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >I thought about what this is doing at the high level, and I do have some
> >value in what you are trying to do, but I also think we need to clarify
> >the motivation a bit more.  What you are saying is not really what the
> >patches are doing.
> >
> >And with that clearer understanding of the motivation in mind (assuming
> >it actually captures a real need), I would also like to suggest some
> >changes.
> 
> Motivation:
> Most current solutions for migration with passthough device are based on
> the PCI hotplug but it has side affect and can't work for all device.
> 
> For NIC device:
> PCI hotplug solution can work around Network device migration
> via switching VF and PF.

This is just more confusion. hotplug is just a way to add and remove
devices. switching VF and PF is up to guest and hypervisor.

> But switching network interface will introduce service down time.
> 
> I tested the service down time via putting VF and PV interface
> into a bonded interface and ping the bonded interface during plug
> and unplug VF.
> 1) About 100ms when add VF
> 2) About 30ms when del VF

OK and what's the source of the downtime?
I'm guessing that's just arp being repopulated.  So simply save and
re-populate it.

There would be a much cleaner solution.

Or maybe there's a timer there that just delays hotplug
for no reason. Fix it, everyone will benefit.

> It also requires guest to do switch configuration.

That's just wrong. if you want a switch, you need to
configure a switch.

> These are hard to
> manage and deploy from our customers.

So kernel want to remain flexible, and the stack is
configurable. Downside: customers need to deploy userspace
to configure it. Your solution: a hard-coded configuration
within kernel and hypervisor.  Sorry, this makes no sense.
If kernel is easier for you to deploy than userspace,
you need to rethink your deployment strategy.

> To maintain PV performance during
> migration, host side also needs to assign a VF to PV device. This
> affects scalability.

No idea what this means.

> These factors block SRIOV NIC passthough usage in the cloud service and
> OPNFV which require network high performance and stability a lot.

Everyone needs performance and scalability.

> 
> For other kind of devices, it's hard to work.
> We are also adding migration support for QAT(QuickAssist Technology) device.
> 
> QAT device user case introduction.
> Server, networking, big data, and storage applications use QuickAssist
> Technology to offload servers from handling compute-intensive operations,
> such as:
> 1) Symmetric cryptography functions including cipher operations and
> authentication operations
> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> cryptography
> 3) Compression and decompression functions including DEFLATE and LZS
> 
> PCI hotplug will not work for such devices during migration and these
> operations will fail when unplug device.
> 
> So we are trying implementing a new solution which really migrates
> device state to target machine and won't affect user during migration
> with low service down time.

Let's assume for the sake of the argument that there's a lot going on
and removing the device is just too slow (though you should figure out
what's going on before giving up and just building something new from
scratch).

I still don't think you should be migrating state.  That's just too
fragile, and it also means you depend on driver to be nice and shut down
device on source, so you can not migrate at will.  Instead, reset device
on destination and re-initialize it.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-09 20:07       ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-09 20:07 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or

On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote:
> On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >I thought about what this is doing at the high level, and I do have some
> >value in what you are trying to do, but I also think we need to clarify
> >the motivation a bit more.  What you are saying is not really what the
> >patches are doing.
> >
> >And with that clearer understanding of the motivation in mind (assuming
> >it actually captures a real need), I would also like to suggest some
> >changes.
> 
> Motivation:
> Most current solutions for migration with passthough device are based on
> the PCI hotplug but it has side affect and can't work for all device.
> 
> For NIC device:
> PCI hotplug solution can work around Network device migration
> via switching VF and PF.

This is just more confusion. hotplug is just a way to add and remove
devices. switching VF and PF is up to guest and hypervisor.

> But switching network interface will introduce service down time.
> 
> I tested the service down time via putting VF and PV interface
> into a bonded interface and ping the bonded interface during plug
> and unplug VF.
> 1) About 100ms when add VF
> 2) About 30ms when del VF

OK and what's the source of the downtime?
I'm guessing that's just arp being repopulated.  So simply save and
re-populate it.

There would be a much cleaner solution.

Or maybe there's a timer there that just delays hotplug
for no reason. Fix it, everyone will benefit.

> It also requires guest to do switch configuration.

That's just wrong. if you want a switch, you need to
configure a switch.

> These are hard to
> manage and deploy from our customers.

So kernel want to remain flexible, and the stack is
configurable. Downside: customers need to deploy userspace
to configure it. Your solution: a hard-coded configuration
within kernel and hypervisor.  Sorry, this makes no sense.
If kernel is easier for you to deploy than userspace,
you need to rethink your deployment strategy.

> To maintain PV performance during
> migration, host side also needs to assign a VF to PV device. This
> affects scalability.

No idea what this means.

> These factors block SRIOV NIC passthough usage in the cloud service and
> OPNFV which require network high performance and stability a lot.

Everyone needs performance and scalability.

> 
> For other kind of devices, it's hard to work.
> We are also adding migration support for QAT(QuickAssist Technology) device.
> 
> QAT device user case introduction.
> Server, networking, big data, and storage applications use QuickAssist
> Technology to offload servers from handling compute-intensive operations,
> such as:
> 1) Symmetric cryptography functions including cipher operations and
> authentication operations
> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> cryptography
> 3) Compression and decompression functions including DEFLATE and LZS
> 
> PCI hotplug will not work for such devices during migration and these
> operations will fail when unplug device.
> 
> So we are trying implementing a new solution which really migrates
> device state to target machine and won't affect user during migration
> with low service down time.

Let's assume for the sake of the argument that there's a lot going on
and removing the device is just too slow (though you should figure out
what's going on before giving up and just building something new from
scratch).

I still don't think you should be migrating state.  That's just too
fragile, and it also means you depend on driver to be nice and shut down
device on source, so you can not migrate at will.  Instead, reset device
on destination and re-initialize it.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: live migration vs device assignment (motivation)
  2015-12-09 20:07       ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-12-10  3:04         ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-10  3:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela


On 12/10/2015 4:07 AM, Michael S. Tsirkin wrote:
> On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote:
>> On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
>>> I thought about what this is doing at the high level, and I do have some
>>> value in what you are trying to do, but I also think we need to clarify
>>> the motivation a bit more.  What you are saying is not really what the
>>> patches are doing.
>>>
>>> And with that clearer understanding of the motivation in mind (assuming
>>> it actually captures a real need), I would also like to suggest some
>>> changes.
>>
>> Motivation:
>> Most current solutions for migration with passthough device are based on
>> the PCI hotplug but it has side affect and can't work for all device.
>>
>> For NIC device:
>> PCI hotplug solution can work around Network device migration
>> via switching VF and PF.
>
> This is just more confusion. hotplug is just a way to add and remove
> devices. switching VF and PF is up to guest and hypervisor.

This is a combination. Because it's not able to migrate device state in
the current world during migration(What we are doing), Exist solutions
of migrating VM with passthough NIC relies on the PCI hotplug. Unplug VF
before starting migration and then switch network from VF NIC to PV NIC
in order to maintain the network connection. Plug VF again after
migration and then switch from PV back to VF. Bond driver provides a way 
to switch between PV and VF NIC automatically with save IP and MAC and 
so bond driver is more preferred.

>
>> But switching network interface will introduce service down time.
>>
>> I tested the service down time via putting VF and PV interface
>> into a bonded interface and ping the bonded interface during plug
>> and unplug VF.
>> 1) About 100ms when add VF
>> 2) About 30ms when del VF
>
> OK and what's the source of the downtime?
> I'm guessing that's just arp being repopulated.  So simply save and
> re-populate it.
>
> There would be a much cleaner solution.
>
> Or maybe there's a timer there that just delays hotplug
> for no reason. Fix it, everyone will benefit.
>
>> It also requires guest to do switch configuration.
>
> That's just wrong. if you want a switch, you need to
> configure a switch.

I meant the config of switching operation between PV and VF.

>
>> These are hard to
>> manage and deploy from our customers.
>
> So kernel want to remain flexible, and the stack is
> configurable. Downside: customers need to deploy userspace
> to configure it. Your solution: a hard-coded configuration
> within kernel and hypervisor.  Sorry, this makes no sense.
> If kernel is easier for you to deploy than userspace,
> you need to rethink your deployment strategy.

This is one factor.

>
>> To maintain PV performance during
>> migration, host side also needs to assign a VF to PV device. This
>> affects scalability.
>
> No idea what this means.
>
>> These factors block SRIOV NIC passthough usage in the cloud service and
>> OPNFV which require network high performance and stability a lot.
>
> Everyone needs performance and scalability.
>
>>
>> For other kind of devices, it's hard to work.
>> We are also adding migration support for QAT(QuickAssist Technology) device.
>>
>> QAT device user case introduction.
>> Server, networking, big data, and storage applications use QuickAssist
>> Technology to offload servers from handling compute-intensive operations,
>> such as:
>> 1) Symmetric cryptography functions including cipher operations and
>> authentication operations
>> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
>> cryptography
>> 3) Compression and decompression functions including DEFLATE and LZS
>>
>> PCI hotplug will not work for such devices during migration and these
>> operations will fail when unplug device.
>>
>> So we are trying implementing a new solution which really migrates
>> device state to target machine and won't affect user during migration
>> with low service down time.
>
> Let's assume for the sake of the argument that there's a lot going on
> and removing the device is just too slow (though you should figure out
> what's going on before giving up and just building something new from
> scratch).

No, we can find a PV NIC as backup for VF NIC during migration but it 
doesn't work for other kinds of device since there is no backup for 
them. E,G When migration happens during users compresses files via QAT, 
it's impossible to remove QAT at that point. If do that, the compress 
operation will fail and affect user experience.

>
> I still don't think you should be migrating state.  That's just too
> fragile, and it also means you depend on driver to be nice and shut down
> device on source, so you can not migrate at will.  Instead, reset device
> on destination and re-initialize it.
>

Yes, saving and restoring device state relies on the driver and so we 
reworks driver and make it more friend to migration.



^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-10  3:04         ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-10  3:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or


On 12/10/2015 4:07 AM, Michael S. Tsirkin wrote:
> On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote:
>> On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
>>> I thought about what this is doing at the high level, and I do have some
>>> value in what you are trying to do, but I also think we need to clarify
>>> the motivation a bit more.  What you are saying is not really what the
>>> patches are doing.
>>>
>>> And with that clearer understanding of the motivation in mind (assuming
>>> it actually captures a real need), I would also like to suggest some
>>> changes.
>>
>> Motivation:
>> Most current solutions for migration with passthough device are based on
>> the PCI hotplug but it has side affect and can't work for all device.
>>
>> For NIC device:
>> PCI hotplug solution can work around Network device migration
>> via switching VF and PF.
>
> This is just more confusion. hotplug is just a way to add and remove
> devices. switching VF and PF is up to guest and hypervisor.

This is a combination. Because it's not able to migrate device state in
the current world during migration(What we are doing), Exist solutions
of migrating VM with passthough NIC relies on the PCI hotplug. Unplug VF
before starting migration and then switch network from VF NIC to PV NIC
in order to maintain the network connection. Plug VF again after
migration and then switch from PV back to VF. Bond driver provides a way 
to switch between PV and VF NIC automatically with save IP and MAC and 
so bond driver is more preferred.

>
>> But switching network interface will introduce service down time.
>>
>> I tested the service down time via putting VF and PV interface
>> into a bonded interface and ping the bonded interface during plug
>> and unplug VF.
>> 1) About 100ms when add VF
>> 2) About 30ms when del VF
>
> OK and what's the source of the downtime?
> I'm guessing that's just arp being repopulated.  So simply save and
> re-populate it.
>
> There would be a much cleaner solution.
>
> Or maybe there's a timer there that just delays hotplug
> for no reason. Fix it, everyone will benefit.
>
>> It also requires guest to do switch configuration.
>
> That's just wrong. if you want a switch, you need to
> configure a switch.

I meant the config of switching operation between PV and VF.

>
>> These are hard to
>> manage and deploy from our customers.
>
> So kernel want to remain flexible, and the stack is
> configurable. Downside: customers need to deploy userspace
> to configure it. Your solution: a hard-coded configuration
> within kernel and hypervisor.  Sorry, this makes no sense.
> If kernel is easier for you to deploy than userspace,
> you need to rethink your deployment strategy.

This is one factor.

>
>> To maintain PV performance during
>> migration, host side also needs to assign a VF to PV device. This
>> affects scalability.
>
> No idea what this means.
>
>> These factors block SRIOV NIC passthough usage in the cloud service and
>> OPNFV which require network high performance and stability a lot.
>
> Everyone needs performance and scalability.
>
>>
>> For other kind of devices, it's hard to work.
>> We are also adding migration support for QAT(QuickAssist Technology) device.
>>
>> QAT device user case introduction.
>> Server, networking, big data, and storage applications use QuickAssist
>> Technology to offload servers from handling compute-intensive operations,
>> such as:
>> 1) Symmetric cryptography functions including cipher operations and
>> authentication operations
>> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
>> cryptography
>> 3) Compression and decompression functions including DEFLATE and LZS
>>
>> PCI hotplug will not work for such devices during migration and these
>> operations will fail when unplug device.
>>
>> So we are trying implementing a new solution which really migrates
>> device state to target machine and won't affect user during migration
>> with low service down time.
>
> Let's assume for the sake of the argument that there's a lot going on
> and removing the device is just too slow (though you should figure out
> what's going on before giving up and just building something new from
> scratch).

No, we can find a PV NIC as backup for VF NIC during migration but it 
doesn't work for other kinds of device since there is no backup for 
them. E,G When migration happens during users compresses files via QAT, 
it's impossible to remove QAT at that point. If do that, the compress 
operation will fail and affect user experience.

>
> I still don't think you should be migrating state.  That's just too
> fragile, and it also means you depend on driver to be nice and shut down
> device on source, so you can not migrate at will.  Instead, reset device
> on destination and re-initialize it.
>

Yes, saving and restoring device state relies on the driver and so we 
reworks driver and make it more friend to migration.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: live migration vs device assignment (motivation)
  2015-12-09 17:14       ` [Qemu-devel] " Alexander Duyck
@ 2015-12-10  3:15         ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-10  3:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, aik, Alex Williamson, amit.shah,
	Anthony Liguori, Ard Biesheuvel, Blue Swirl, cornelia.huck, Dong,
	Eddie, Jani, Nrupal, Alexander Graf, kvm, Paolo Bonzini,
	qemu-devel, Tantilov, Emil S, Or Gerlitz, Skidmore, Donald C,
	Rustad, Mark D, kraxel, lcapitulino, quintela



On 12/10/2015 1:14 AM, Alexander Duyck wrote:
> On Wed, Dec 9, 2015 at 8:26 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:
>
>> For other kind of devices, it's hard to work.
>> We are also adding migration support for QAT(QuickAssist Technology) device.
>>
>> QAT device user case introduction.
>> Server, networking, big data, and storage applications use QuickAssist
>> Technology to offload servers from handling compute-intensive operations,
>> such as:
>> 1) Symmetric cryptography functions including cipher operations and
>> authentication operations
>> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
>> cryptography
>> 3) Compression and decompression functions including DEFLATE and LZS
>>
>> PCI hotplug will not work for such devices during migration and these
>> operations will fail when unplug device.
>
> I assume the problem is that with a PCI hotplug event you are losing
> the state information for the device, do I have that right?
>
> Looking over the QAT drivers it doesn't seem like any of them support
> the suspend/resume PM calls.  I would imagine it makes it difficult
> for a system with a QAT card in it to be able to drop the system to a
> low power state.  You might want to try enabling suspend/resume
> support for the devices on bare metal before you attempt to take on
> migration as it would provide you with a good testing framework to see
> what needs to be saved/restored within the device and in what order
> before you attempt to do the same while migrating from one system to
> another.

Sure. The suspend/resume job is under way.
Actually, we have enabled QAT work for migration internally. Doing more 
test and fixing bugs.

>
> - Alex
>

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-10  3:15         ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-10  3:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: qemu-devel, Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore,
	Donald C, Michael S. Tsirkin, Dong, Eddie, Jani, Nrupal,
	quintela, Alexander Graf, Blue Swirl, cornelia.huck,
	Alex Williamson, kraxel, Anthony Liguori, amit.shah,
	Paolo Bonzini, Rustad, Mark D, lcapitulino, Or Gerlitz



On 12/10/2015 1:14 AM, Alexander Duyck wrote:
> On Wed, Dec 9, 2015 at 8:26 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:
>
>> For other kind of devices, it's hard to work.
>> We are also adding migration support for QAT(QuickAssist Technology) device.
>>
>> QAT device user case introduction.
>> Server, networking, big data, and storage applications use QuickAssist
>> Technology to offload servers from handling compute-intensive operations,
>> such as:
>> 1) Symmetric cryptography functions including cipher operations and
>> authentication operations
>> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
>> cryptography
>> 3) Compression and decompression functions including DEFLATE and LZS
>>
>> PCI hotplug will not work for such devices during migration and these
>> operations will fail when unplug device.
>
> I assume the problem is that with a PCI hotplug event you are losing
> the state information for the device, do I have that right?
>
> Looking over the QAT drivers it doesn't seem like any of them support
> the suspend/resume PM calls.  I would imagine it makes it difficult
> for a system with a QAT card in it to be able to drop the system to a
> low power state.  You might want to try enabling suspend/resume
> support for the devices on bare metal before you attempt to take on
> migration as it would provide you with a good testing framework to see
> what needs to be saved/restored within the device and in what order
> before you attempt to do the same while migrating from one system to
> another.

Sure. The suspend/resume job is under way.
Actually, we have enabled QAT work for migration internally. Doing more 
test and fixing bugs.

>
> - Alex
>

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: live migration vs device assignment (motivation)
  2015-12-10  3:04         ` [Qemu-devel] " Lan, Tianyu
@ 2015-12-10  8:38           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-10  8:38 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela

On Thu, Dec 10, 2015 at 11:04:54AM +0800, Lan, Tianyu wrote:
> 
> On 12/10/2015 4:07 AM, Michael S. Tsirkin wrote:
> >On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote:
> >>On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >>>I thought about what this is doing at the high level, and I do have some
> >>>value in what you are trying to do, but I also think we need to clarify
> >>>the motivation a bit more.  What you are saying is not really what the
> >>>patches are doing.
> >>>
> >>>And with that clearer understanding of the motivation in mind (assuming
> >>>it actually captures a real need), I would also like to suggest some
> >>>changes.
> >>
> >>Motivation:
> >>Most current solutions for migration with passthough device are based on
> >>the PCI hotplug but it has side affect and can't work for all device.
> >>
> >>For NIC device:
> >>PCI hotplug solution can work around Network device migration
> >>via switching VF and PF.
> >
> >This is just more confusion. hotplug is just a way to add and remove
> >devices. switching VF and PF is up to guest and hypervisor.
> 
> This is a combination. Because it's not able to migrate device state in
> the current world during migration(What we are doing), Exist solutions
> of migrating VM with passthough NIC relies on the PCI hotplug.

That's where you go wrong I think. This marketing speak about solution
of migrating VM with passthrough is just confusing people.

There's no way to do migration with device passthrough on KVM at the
moment, in particular because of lack of way for host to save and
restore device state, and you do not propose a way either.

So how do people migrate? Stop doing device passthrough.
So what I think your patches do is add ability to do the two things
in parallel: stop doing passthrough and start migration.
You still can not migrate with passthrough.

> Unplug VF
> before starting migration and then switch network from VF NIC to PV NIC
> in order to maintain the network connection.

Again, this is mixing unrelated things.  This switching is not really
related to migration. You can do this at any time for any number of
reasons.  If migration takes a lot of time and if you unplug before
migration, then switching to another interface might make sense.
But it's question of policy.

> Plug VF again after
> migration and then switch from PV back to VF. Bond driver provides a way to
> switch between PV and VF NIC automatically with save IP and MAC and so bond
> driver is more preferred.

Preferred over switching manually? As long as it works well, sure.  But
one can come up with other techniques.  For example, don't switch. Save
ip, mac etc, remove source device and add the destination one.  You were
also complaining that the switch took too long.

> >
> >>But switching network interface will introduce service down time.
> >>
> >>I tested the service down time via putting VF and PV interface
> >>into a bonded interface and ping the bonded interface during plug
> >>and unplug VF.
> >>1) About 100ms when add VF
> >>2) About 30ms when del VF
> >
> >OK and what's the source of the downtime?
> >I'm guessing that's just arp being repopulated.  So simply save and
> >re-populate it.
> >
> >There would be a much cleaner solution.
> >
> >Or maybe there's a timer there that just delays hotplug
> >for no reason. Fix it, everyone will benefit.
> >
> >>It also requires guest to do switch configuration.
> >
> >That's just wrong. if you want a switch, you need to
> >configure a switch.
> 
> I meant the config of switching operation between PV and VF.

I see. So sure, there are many ways to configure networking
on linux. You seem to see this as a downside and so want
to hardcode a single configuration into the driver.

> >
> >>These are hard to
> >>manage and deploy from our customers.
> >
> >So kernel want to remain flexible, and the stack is
> >configurable. Downside: customers need to deploy userspace
> >to configure it. Your solution: a hard-coded configuration
> >within kernel and hypervisor.  Sorry, this makes no sense.
> >If kernel is easier for you to deploy than userspace,
> >you need to rethink your deployment strategy.
> 
> This is one factor.
> 
> >
> >>To maintain PV performance during
> >>migration, host side also needs to assign a VF to PV device. This
> >>affects scalability.
> >
> >No idea what this means.
> >
> >>These factors block SRIOV NIC passthough usage in the cloud service and
> >>OPNFV which require network high performance and stability a lot.
> >
> >Everyone needs performance and scalability.
> >
> >>
> >>For other kind of devices, it's hard to work.
> >>We are also adding migration support for QAT(QuickAssist Technology) device.
> >>
> >>QAT device user case introduction.
> >>Server, networking, big data, and storage applications use QuickAssist
> >>Technology to offload servers from handling compute-intensive operations,
> >>such as:
> >>1) Symmetric cryptography functions including cipher operations and
> >>authentication operations
> >>2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> >>cryptography
> >>3) Compression and decompression functions including DEFLATE and LZS
> >>
> >>PCI hotplug will not work for such devices during migration and these
> >>operations will fail when unplug device.
> >>
> >>So we are trying implementing a new solution which really migrates
> >>device state to target machine and won't affect user during migration
> >>with low service down time.
> >
> >Let's assume for the sake of the argument that there's a lot going on
> >and removing the device is just too slow (though you should figure out
> >what's going on before giving up and just building something new from
> >scratch).
> 
> No, we can find a PV NIC as backup for VF NIC during migration but it
> doesn't work for other kinds of device since there is no backup for them.
> E,G When migration happens during users compresses files via QAT, it's
> impossible to remove QAT at that point. If do that, the compress operation
> will fail and affect user experience.

I absolutely agree here. Switching to a PV device is just something
people can do. It must not be a requirement. Some people like
doing that though. So it's policy, and should be left to userspace.

> >
> >I still don't think you should be migrating state.  That's just too
> >fragile, and it also means you depend on driver to be nice and shut down
> >device on source, so you can not migrate at will.  Instead, reset device
> >on destination and re-initialize it.
> >
> 
> Yes, saving and restoring device state relies on the driver and so we
> reworks driver and make it more friend to migration.

First, this remains very fragile. At your lab you probably have tens of
NIC devices with exactly the same hardware and firmware version, but out
there you can order a batch of 10 and get 10 slightly different
versions.  Will state saved from one work on the other?  One can be
pretty sure no one will test all possible combinations.

Let's assume you do save state and do have a way to detect
whether state matches a given hardware. For example,
driver could store firmware and hardware versions
in the state, and then on destination, retrieve them
and compare. It will be pretty common that you have a mismatch,
and you must not just fail migration. You need a way to recover,
maybe with more downtime.

Second, you can change the driver but you can not be sure it will have
the chance to run at all. Host overload is a common reason to migrate
out of the host.  You also can not trust guest to do the right thing.
So how long do you want to wait until you decide guest is not
cooperating and kill it?  Most people will probably experiment a bit and
then add a bit of a buffer. This is not robust at all.

Again, maybe you ask driver to save state, and if it does
not respond for a while, then you still migrate,
and driver has to recover on destination.

With the above in mind, you need to support two paths:
1. "good path": driver stores state on source, checks it on destination
   detects a match and restores state into the device
2. "bad path": driver does not store state, or detects a mismatch
   on destination. driver has to assume device was lost,
   and reset it

So what I am saying is, implement bad path first. Then good path
is an optimization - measure whether it's faster, and by how much.

Also, it would be nice if on the bad path there was a way
to switch to another driver entirely, even if that means
a bit more downtime. For example, have a way for driver to
tell Linux it has to re-do probing for the device.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-10  8:38           ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-10  8:38 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or

On Thu, Dec 10, 2015 at 11:04:54AM +0800, Lan, Tianyu wrote:
> 
> On 12/10/2015 4:07 AM, Michael S. Tsirkin wrote:
> >On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote:
> >>On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >>>I thought about what this is doing at the high level, and I do have some
> >>>value in what you are trying to do, but I also think we need to clarify
> >>>the motivation a bit more.  What you are saying is not really what the
> >>>patches are doing.
> >>>
> >>>And with that clearer understanding of the motivation in mind (assuming
> >>>it actually captures a real need), I would also like to suggest some
> >>>changes.
> >>
> >>Motivation:
> >>Most current solutions for migration with passthough device are based on
> >>the PCI hotplug but it has side affect and can't work for all device.
> >>
> >>For NIC device:
> >>PCI hotplug solution can work around Network device migration
> >>via switching VF and PF.
> >
> >This is just more confusion. hotplug is just a way to add and remove
> >devices. switching VF and PF is up to guest and hypervisor.
> 
> This is a combination. Because it's not able to migrate device state in
> the current world during migration(What we are doing), Exist solutions
> of migrating VM with passthough NIC relies on the PCI hotplug.

That's where you go wrong I think. This marketing speak about solution
of migrating VM with passthrough is just confusing people.

There's no way to do migration with device passthrough on KVM at the
moment, in particular because of lack of way for host to save and
restore device state, and you do not propose a way either.

So how do people migrate? Stop doing device passthrough.
So what I think your patches do is add ability to do the two things
in parallel: stop doing passthrough and start migration.
You still can not migrate with passthrough.

> Unplug VF
> before starting migration and then switch network from VF NIC to PV NIC
> in order to maintain the network connection.

Again, this is mixing unrelated things.  This switching is not really
related to migration. You can do this at any time for any number of
reasons.  If migration takes a lot of time and if you unplug before
migration, then switching to another interface might make sense.
But it's question of policy.

> Plug VF again after
> migration and then switch from PV back to VF. Bond driver provides a way to
> switch between PV and VF NIC automatically with save IP and MAC and so bond
> driver is more preferred.

Preferred over switching manually? As long as it works well, sure.  But
one can come up with other techniques.  For example, don't switch. Save
ip, mac etc, remove source device and add the destination one.  You were
also complaining that the switch took too long.

> >
> >>But switching network interface will introduce service down time.
> >>
> >>I tested the service down time via putting VF and PV interface
> >>into a bonded interface and ping the bonded interface during plug
> >>and unplug VF.
> >>1) About 100ms when add VF
> >>2) About 30ms when del VF
> >
> >OK and what's the source of the downtime?
> >I'm guessing that's just arp being repopulated.  So simply save and
> >re-populate it.
> >
> >There would be a much cleaner solution.
> >
> >Or maybe there's a timer there that just delays hotplug
> >for no reason. Fix it, everyone will benefit.
> >
> >>It also requires guest to do switch configuration.
> >
> >That's just wrong. if you want a switch, you need to
> >configure a switch.
> 
> I meant the config of switching operation between PV and VF.

I see. So sure, there are many ways to configure networking
on linux. You seem to see this as a downside and so want
to hardcode a single configuration into the driver.

> >
> >>These are hard to
> >>manage and deploy from our customers.
> >
> >So kernel want to remain flexible, and the stack is
> >configurable. Downside: customers need to deploy userspace
> >to configure it. Your solution: a hard-coded configuration
> >within kernel and hypervisor.  Sorry, this makes no sense.
> >If kernel is easier for you to deploy than userspace,
> >you need to rethink your deployment strategy.
> 
> This is one factor.
> 
> >
> >>To maintain PV performance during
> >>migration, host side also needs to assign a VF to PV device. This
> >>affects scalability.
> >
> >No idea what this means.
> >
> >>These factors block SRIOV NIC passthough usage in the cloud service and
> >>OPNFV which require network high performance and stability a lot.
> >
> >Everyone needs performance and scalability.
> >
> >>
> >>For other kind of devices, it's hard to work.
> >>We are also adding migration support for QAT(QuickAssist Technology) device.
> >>
> >>QAT device user case introduction.
> >>Server, networking, big data, and storage applications use QuickAssist
> >>Technology to offload servers from handling compute-intensive operations,
> >>such as:
> >>1) Symmetric cryptography functions including cipher operations and
> >>authentication operations
> >>2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> >>cryptography
> >>3) Compression and decompression functions including DEFLATE and LZS
> >>
> >>PCI hotplug will not work for such devices during migration and these
> >>operations will fail when unplug device.
> >>
> >>So we are trying implementing a new solution which really migrates
> >>device state to target machine and won't affect user during migration
> >>with low service down time.
> >
> >Let's assume for the sake of the argument that there's a lot going on
> >and removing the device is just too slow (though you should figure out
> >what's going on before giving up and just building something new from
> >scratch).
> 
> No, we can find a PV NIC as backup for VF NIC during migration but it
> doesn't work for other kinds of device since there is no backup for them.
> E,G When migration happens during users compresses files via QAT, it's
> impossible to remove QAT at that point. If do that, the compress operation
> will fail and affect user experience.

I absolutely agree here. Switching to a PV device is just something
people can do. It must not be a requirement. Some people like
doing that though. So it's policy, and should be left to userspace.

> >
> >I still don't think you should be migrating state.  That's just too
> >fragile, and it also means you depend on driver to be nice and shut down
> >device on source, so you can not migrate at will.  Instead, reset device
> >on destination and re-initialize it.
> >
> 
> Yes, saving and restoring device state relies on the driver and so we
> reworks driver and make it more friend to migration.

First, this remains very fragile. At your lab you probably have tens of
NIC devices with exactly the same hardware and firmware version, but out
there you can order a batch of 10 and get 10 slightly different
versions.  Will state saved from one work on the other?  One can be
pretty sure no one will test all possible combinations.

Let's assume you do save state and do have a way to detect
whether state matches a given hardware. For example,
driver could store firmware and hardware versions
in the state, and then on destination, retrieve them
and compare. It will be pretty common that you have a mismatch,
and you must not just fail migration. You need a way to recover,
maybe with more downtime.

Second, you can change the driver but you can not be sure it will have
the chance to run at all. Host overload is a common reason to migrate
out of the host.  You also can not trust guest to do the right thing.
So how long do you want to wait until you decide guest is not
cooperating and kill it?  Most people will probably experiment a bit and
then add a bit of a buffer. This is not robust at all.

Again, maybe you ask driver to save state, and if it does
not respond for a while, then you still migrate,
and driver has to recover on destination.

With the above in mind, you need to support two paths:
1. "good path": driver stores state on source, checks it on destination
   detects a match and restores state into the device
2. "bad path": driver does not store state, or detects a mismatch
   on destination. driver has to assume device was lost,
   and reset it

So what I am saying is, implement bad path first. Then good path
is an optimization - measure whether it's faster, and by how much.

Also, it would be nice if on the bad path there was a way
to switch to another driver entirely, even if that means
a bit more downtime. For example, have a way for driver to
tell Linux it has to re-do probing for the device.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-09 16:26     ` [Qemu-devel] " Lan, Tianyu
@ 2015-12-10 10:18       ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 142+ messages in thread
From: Dr. David Alan Gilbert @ 2015-12-10 10:18 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Michael S. Tsirkin, qemu-devel, emil.s.tantilov, kvm,
	ard.biesheuvel, aik, donald.c.skidmore, quintela, eddie.dong,
	nrupal.jani, agraf, blauwirbel, cornelia.huck, alex.williamson,
	kraxel, anthony, amit.shah, pbonzini, mark.d.rustad, lcapitulino,
	gerlitz.or

* Lan, Tianyu (tianyu.lan@intel.com) wrote:
> On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >I thought about what this is doing at the high level, and I do have some
> >value in what you are trying to do, but I also think we need to clarify
> >the motivation a bit more.  What you are saying is not really what the
> >patches are doing.
> >
> >And with that clearer understanding of the motivation in mind (assuming
> >it actually captures a real need), I would also like to suggest some
> >changes.
> 
> Motivation:
> Most current solutions for migration with passthough device are based on
> the PCI hotplug but it has side affect and can't work for all device.
> 
> For NIC device:
> PCI hotplug solution can work around Network device migration
> via switching VF and PF.
> 
> But switching network interface will introduce service down time.
> 
> I tested the service down time via putting VF and PV interface
> into a bonded interface and ping the bonded interface during plug
> and unplug VF.
> 1) About 100ms when add VF
> 2) About 30ms when del VF
> 
> It also requires guest to do switch configuration. These are hard to
> manage and deploy from our customers. To maintain PV performance during
> migration, host side also needs to assign a VF to PV device. This
> affects scalability.
> 
> These factors block SRIOV NIC passthough usage in the cloud service and
> OPNFV which require network high performance and stability a lot.

Right, that I'll agree it's hard to do migration of a VM which uses
an SRIOV device; and while I think it should be possible to bond a virtio device
to a VF for networking and then hotplug the SR-IOV device I agree it's hard to manage.

> For other kind of devices, it's hard to work.
> We are also adding migration support for QAT(QuickAssist Technology) device.
> 
> QAT device user case introduction.
> Server, networking, big data, and storage applications use QuickAssist
> Technology to offload servers from handling compute-intensive operations,
> such as:
> 1) Symmetric cryptography functions including cipher operations and
> authentication operations
> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> cryptography
> 3) Compression and decompression functions including DEFLATE and LZS
> 
> PCI hotplug will not work for such devices during migration and these
> operations will fail when unplug device.

I don't understand that QAT argument; if the device is purely an offload
engine for performance, then why can't you fall back to doing the
same operations in the VM or in QEMU if the card is unavailable?
The tricky bit is dealing with outstanding operations.

> So we are trying implementing a new solution which really migrates
> device state to target machine and won't affect user during migration
> with low service down time.

Right, that's a good aim - the only question is how to do it.

It looks like this is always going to need some device-specific code;
the question I see is whether that's in:
    1) qemu
    2) the host kernel
    3) the guest kernel driver

The objections to this series seem to be that it needs changes to (3);
I can see the worry that the guest kernel driver might not get a chance
to run during the right time in migration and it's painful having to
change every guest driver (although your change is small).

My question is what stage of the migration process do you expect to tell
the guest kernel driver to do this?

    If you do it at the start of the migration, and quiesce the device,
    the migration might take a long time (say 30 minutes) - are you
    intending the device to be quiesced for this long? And where are
    you going to send the traffic?
    If you are, then do you need to do it via this PCI trick, or could
    you just do it via something higher level to quiesce the device.

    Or are you intending to do it just near the end of the migration?
    But then how do we know how long it will take the guest driver to
    respond?

It would be great if we could avoid changing the guest; but at least your guest
driver changes don't actually seem to be that hardware specific; could your
changes actually be moved to generic PCI level so they could be made
to work for lots of drivers?

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-10 10:18       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 142+ messages in thread
From: Dr. David Alan Gilbert @ 2015-12-10 10:18 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: lcapitulino, alex.williamson, emil.s.tantilov, kvm,
	ard.biesheuvel, aik, donald.c.skidmore, Michael S. Tsirkin,
	eddie.dong, qemu-devel, agraf, blauwirbel, quintela, nrupal.jani,
	kraxel, anthony, cornelia.huck, pbonzini, mark.d.rustad,
	amit.shah, gerlitz.or

* Lan, Tianyu (tianyu.lan@intel.com) wrote:
> On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >I thought about what this is doing at the high level, and I do have some
> >value in what you are trying to do, but I also think we need to clarify
> >the motivation a bit more.  What you are saying is not really what the
> >patches are doing.
> >
> >And with that clearer understanding of the motivation in mind (assuming
> >it actually captures a real need), I would also like to suggest some
> >changes.
> 
> Motivation:
> Most current solutions for migration with passthough device are based on
> the PCI hotplug but it has side affect and can't work for all device.
> 
> For NIC device:
> PCI hotplug solution can work around Network device migration
> via switching VF and PF.
> 
> But switching network interface will introduce service down time.
> 
> I tested the service down time via putting VF and PV interface
> into a bonded interface and ping the bonded interface during plug
> and unplug VF.
> 1) About 100ms when add VF
> 2) About 30ms when del VF
> 
> It also requires guest to do switch configuration. These are hard to
> manage and deploy from our customers. To maintain PV performance during
> migration, host side also needs to assign a VF to PV device. This
> affects scalability.
> 
> These factors block SRIOV NIC passthough usage in the cloud service and
> OPNFV which require network high performance and stability a lot.

Right, that I'll agree it's hard to do migration of a VM which uses
an SRIOV device; and while I think it should be possible to bond a virtio device
to a VF for networking and then hotplug the SR-IOV device I agree it's hard to manage.

> For other kind of devices, it's hard to work.
> We are also adding migration support for QAT(QuickAssist Technology) device.
> 
> QAT device user case introduction.
> Server, networking, big data, and storage applications use QuickAssist
> Technology to offload servers from handling compute-intensive operations,
> such as:
> 1) Symmetric cryptography functions including cipher operations and
> authentication operations
> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> cryptography
> 3) Compression and decompression functions including DEFLATE and LZS
> 
> PCI hotplug will not work for such devices during migration and these
> operations will fail when unplug device.

I don't understand that QAT argument; if the device is purely an offload
engine for performance, then why can't you fall back to doing the
same operations in the VM or in QEMU if the card is unavailable?
The tricky bit is dealing with outstanding operations.

> So we are trying implementing a new solution which really migrates
> device state to target machine and won't affect user during migration
> with low service down time.

Right, that's a good aim - the only question is how to do it.

It looks like this is always going to need some device-specific code;
the question I see is whether that's in:
    1) qemu
    2) the host kernel
    3) the guest kernel driver

The objections to this series seem to be that it needs changes to (3);
I can see the worry that the guest kernel driver might not get a chance
to run during the right time in migration and it's painful having to
change every guest driver (although your change is small).

My question is what stage of the migration process do you expect to tell
the guest kernel driver to do this?

    If you do it at the start of the migration, and quiesce the device,
    the migration might take a long time (say 30 minutes) - are you
    intending the device to be quiesced for this long? And where are
    you going to send the traffic?
    If you are, then do you need to do it via this PCI trick, or could
    you just do it via something higher level to quiesce the device.

    Or are you intending to do it just near the end of the migration?
    But then how do we know how long it will take the guest driver to
    respond?

It would be great if we could avoid changing the guest; but at least your guest
driver changes don't actually seem to be that hardware specific; could your
changes actually be moved to generic PCI level so they could be made
to work for lots of drivers?

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-10 10:18       ` Dr. David Alan Gilbert
@ 2015-12-10 11:28         ` Yang Zhang
  -1 siblings, 0 replies; 142+ messages in thread
From: Yang Zhang @ 2015-12-10 11:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Lan, Tianyu
  Cc: Michael S. Tsirkin, qemu-devel, emil.s.tantilov, kvm,
	ard.biesheuvel, aik, donald.c.skidmore, quintela, eddie.dong,
	nrupal.jani, agraf, blauwirbel, cornelia.huck, alex.williamson,
	kraxel, anthony, amit.shah, pbonzini, mark.d.rustad, lcapitulino,
	gerlitz.or

On 2015/12/10 18:18, Dr. David Alan Gilbert wrote:
> * Lan, Tianyu (tianyu.lan@intel.com) wrote:
>> On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
>>> I thought about what this is doing at the high level, and I do have some
>>> value in what you are trying to do, but I also think we need to clarify
>>> the motivation a bit more.  What you are saying is not really what the
>>> patches are doing.
>>>
>>> And with that clearer understanding of the motivation in mind (assuming
>>> it actually captures a real need), I would also like to suggest some
>>> changes.
>>
>> Motivation:
>> Most current solutions for migration with passthough device are based on
>> the PCI hotplug but it has side affect and can't work for all device.
>>
>> For NIC device:
>> PCI hotplug solution can work around Network device migration
>> via switching VF and PF.
>>
>> But switching network interface will introduce service down time.
>>
>> I tested the service down time via putting VF and PV interface
>> into a bonded interface and ping the bonded interface during plug
>> and unplug VF.
>> 1) About 100ms when add VF
>> 2) About 30ms when del VF
>>
>> It also requires guest to do switch configuration. These are hard to
>> manage and deploy from our customers. To maintain PV performance during
>> migration, host side also needs to assign a VF to PV device. This
>> affects scalability.
>>
>> These factors block SRIOV NIC passthough usage in the cloud service and
>> OPNFV which require network high performance and stability a lot.
>
> Right, that I'll agree it's hard to do migration of a VM which uses
> an SRIOV device; and while I think it should be possible to bond a virtio device
> to a VF for networking and then hotplug the SR-IOV device I agree it's hard to manage.
>
>> For other kind of devices, it's hard to work.
>> We are also adding migration support for QAT(QuickAssist Technology) device.
>>
>> QAT device user case introduction.
>> Server, networking, big data, and storage applications use QuickAssist
>> Technology to offload servers from handling compute-intensive operations,
>> such as:
>> 1) Symmetric cryptography functions including cipher operations and
>> authentication operations
>> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
>> cryptography
>> 3) Compression and decompression functions including DEFLATE and LZS
>>
>> PCI hotplug will not work for such devices during migration and these
>> operations will fail when unplug device.
>
> I don't understand that QAT argument; if the device is purely an offload
> engine for performance, then why can't you fall back to doing the
> same operations in the VM or in QEMU if the card is unavailable?
> The tricky bit is dealing with outstanding operations.
>
>> So we are trying implementing a new solution which really migrates
>> device state to target machine and won't affect user during migration
>> with low service down time.
>
> Right, that's a good aim - the only question is how to do it.
>
> It looks like this is always going to need some device-specific code;
> the question I see is whether that's in:
>      1) qemu
>      2) the host kernel
>      3) the guest kernel driver
>
> The objections to this series seem to be that it needs changes to (3);
> I can see the worry that the guest kernel driver might not get a chance
> to run during the right time in migration and it's painful having to
> change every guest driver (although your change is small).
>
> My question is what stage of the migration process do you expect to tell
> the guest kernel driver to do this?
>
>      If you do it at the start of the migration, and quiesce the device,
>      the migration might take a long time (say 30 minutes) - are you
>      intending the device to be quiesced for this long? And where are
>      you going to send the traffic?
>      If you are, then do you need to do it via this PCI trick, or could
>      you just do it via something higher level to quiesce the device.
>
>      Or are you intending to do it just near the end of the migration?
>      But then how do we know how long it will take the guest driver to
>      respond?

Ideally, it is able to leave guest driver unmodified but it requires the 
hypervisor or qemu to aware the device which means we may need a driver 
in hypervisor or qemu to handle the device on behalf of guest driver.

>
> It would be great if we could avoid changing the guest; but at least your guest
> driver changes don't actually seem to be that hardware specific; could your
> changes actually be moved to generic PCI level so they could be made
> to work for lots of drivers?

It is impossible to use one common solution for all devices unless the 
PCIE spec documents it clearly and i think one day it will be there. But 
before that, we need some workarounds on guest driver to make it work 
even it looks ugly.


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-10 11:28         ` Yang Zhang
  0 siblings, 0 replies; 142+ messages in thread
From: Yang Zhang @ 2015-12-10 11:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Lan, Tianyu
  Cc: lcapitulino, alex.williamson, emil.s.tantilov, kvm,
	ard.biesheuvel, aik, donald.c.skidmore, Michael S. Tsirkin,
	eddie.dong, qemu-devel, agraf, blauwirbel, quintela, nrupal.jani,
	kraxel, anthony, cornelia.huck, pbonzini, mark.d.rustad,
	amit.shah, gerlitz.or

On 2015/12/10 18:18, Dr. David Alan Gilbert wrote:
> * Lan, Tianyu (tianyu.lan@intel.com) wrote:
>> On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
>>> I thought about what this is doing at the high level, and I do have some
>>> value in what you are trying to do, but I also think we need to clarify
>>> the motivation a bit more.  What you are saying is not really what the
>>> patches are doing.
>>>
>>> And with that clearer understanding of the motivation in mind (assuming
>>> it actually captures a real need), I would also like to suggest some
>>> changes.
>>
>> Motivation:
>> Most current solutions for migration with passthough device are based on
>> the PCI hotplug but it has side affect and can't work for all device.
>>
>> For NIC device:
>> PCI hotplug solution can work around Network device migration
>> via switching VF and PF.
>>
>> But switching network interface will introduce service down time.
>>
>> I tested the service down time via putting VF and PV interface
>> into a bonded interface and ping the bonded interface during plug
>> and unplug VF.
>> 1) About 100ms when add VF
>> 2) About 30ms when del VF
>>
>> It also requires guest to do switch configuration. These are hard to
>> manage and deploy from our customers. To maintain PV performance during
>> migration, host side also needs to assign a VF to PV device. This
>> affects scalability.
>>
>> These factors block SRIOV NIC passthough usage in the cloud service and
>> OPNFV which require network high performance and stability a lot.
>
> Right, that I'll agree it's hard to do migration of a VM which uses
> an SRIOV device; and while I think it should be possible to bond a virtio device
> to a VF for networking and then hotplug the SR-IOV device I agree it's hard to manage.
>
>> For other kind of devices, it's hard to work.
>> We are also adding migration support for QAT(QuickAssist Technology) device.
>>
>> QAT device user case introduction.
>> Server, networking, big data, and storage applications use QuickAssist
>> Technology to offload servers from handling compute-intensive operations,
>> such as:
>> 1) Symmetric cryptography functions including cipher operations and
>> authentication operations
>> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
>> cryptography
>> 3) Compression and decompression functions including DEFLATE and LZS
>>
>> PCI hotplug will not work for such devices during migration and these
>> operations will fail when unplug device.
>
> I don't understand that QAT argument; if the device is purely an offload
> engine for performance, then why can't you fall back to doing the
> same operations in the VM or in QEMU if the card is unavailable?
> The tricky bit is dealing with outstanding operations.
>
>> So we are trying implementing a new solution which really migrates
>> device state to target machine and won't affect user during migration
>> with low service down time.
>
> Right, that's a good aim - the only question is how to do it.
>
> It looks like this is always going to need some device-specific code;
> the question I see is whether that's in:
>      1) qemu
>      2) the host kernel
>      3) the guest kernel driver
>
> The objections to this series seem to be that it needs changes to (3);
> I can see the worry that the guest kernel driver might not get a chance
> to run during the right time in migration and it's painful having to
> change every guest driver (although your change is small).
>
> My question is what stage of the migration process do you expect to tell
> the guest kernel driver to do this?
>
>      If you do it at the start of the migration, and quiesce the device,
>      the migration might take a long time (say 30 minutes) - are you
>      intending the device to be quiesced for this long? And where are
>      you going to send the traffic?
>      If you are, then do you need to do it via this PCI trick, or could
>      you just do it via something higher level to quiesce the device.
>
>      Or are you intending to do it just near the end of the migration?
>      But then how do we know how long it will take the guest driver to
>      respond?

Ideally, it is able to leave guest driver unmodified but it requires the 
hypervisor or qemu to aware the device which means we may need a driver 
in hypervisor or qemu to handle the device on behalf of guest driver.

>
> It would be great if we could avoid changing the guest; but at least your guest
> driver changes don't actually seem to be that hardware specific; could your
> changes actually be moved to generic PCI level so they could be made
> to work for lots of drivers?

It is impossible to use one common solution for all devices unless the 
PCIE spec documents it clearly and i think one day it will be there. But 
before that, we need some workarounds on guest driver to make it work 
even it looks ugly.


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-10 11:28         ` Yang Zhang
@ 2015-12-10 11:41           ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 142+ messages in thread
From: Dr. David Alan Gilbert @ 2015-12-10 11:41 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Lan, Tianyu, Michael S. Tsirkin, qemu-devel, emil.s.tantilov,
	kvm, ard.biesheuvel, aik, donald.c.skidmore, quintela,
	eddie.dong, nrupal.jani, agraf, blauwirbel, cornelia.huck,
	alex.williamson, kraxel, anthony, amit.shah, pbonzini,
	mark.d.rustad, lcapitulino, gerlitz.or

* Yang Zhang (yang.zhang.wz@gmail.com) wrote:
> On 2015/12/10 18:18, Dr. David Alan Gilbert wrote:
> >* Lan, Tianyu (tianyu.lan@intel.com) wrote:
> >>On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >>>I thought about what this is doing at the high level, and I do have some
> >>>value in what you are trying to do, but I also think we need to clarify
> >>>the motivation a bit more.  What you are saying is not really what the
> >>>patches are doing.
> >>>
> >>>And with that clearer understanding of the motivation in mind (assuming
> >>>it actually captures a real need), I would also like to suggest some
> >>>changes.
> >>
> >>Motivation:
> >>Most current solutions for migration with passthough device are based on
> >>the PCI hotplug but it has side affect and can't work for all device.
> >>
> >>For NIC device:
> >>PCI hotplug solution can work around Network device migration
> >>via switching VF and PF.
> >>
> >>But switching network interface will introduce service down time.
> >>
> >>I tested the service down time via putting VF and PV interface
> >>into a bonded interface and ping the bonded interface during plug
> >>and unplug VF.
> >>1) About 100ms when add VF
> >>2) About 30ms when del VF
> >>
> >>It also requires guest to do switch configuration. These are hard to
> >>manage and deploy from our customers. To maintain PV performance during
> >>migration, host side also needs to assign a VF to PV device. This
> >>affects scalability.
> >>
> >>These factors block SRIOV NIC passthough usage in the cloud service and
> >>OPNFV which require network high performance and stability a lot.
> >
> >Right, that I'll agree it's hard to do migration of a VM which uses
> >an SRIOV device; and while I think it should be possible to bond a virtio device
> >to a VF for networking and then hotplug the SR-IOV device I agree it's hard to manage.
> >
> >>For other kind of devices, it's hard to work.
> >>We are also adding migration support for QAT(QuickAssist Technology) device.
> >>
> >>QAT device user case introduction.
> >>Server, networking, big data, and storage applications use QuickAssist
> >>Technology to offload servers from handling compute-intensive operations,
> >>such as:
> >>1) Symmetric cryptography functions including cipher operations and
> >>authentication operations
> >>2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> >>cryptography
> >>3) Compression and decompression functions including DEFLATE and LZS
> >>
> >>PCI hotplug will not work for such devices during migration and these
> >>operations will fail when unplug device.
> >
> >I don't understand that QAT argument; if the device is purely an offload
> >engine for performance, then why can't you fall back to doing the
> >same operations in the VM or in QEMU if the card is unavailable?
> >The tricky bit is dealing with outstanding operations.
> >
> >>So we are trying implementing a new solution which really migrates
> >>device state to target machine and won't affect user during migration
> >>with low service down time.
> >
> >Right, that's a good aim - the only question is how to do it.
> >
> >It looks like this is always going to need some device-specific code;
> >the question I see is whether that's in:
> >     1) qemu
> >     2) the host kernel
> >     3) the guest kernel driver
> >
> >The objections to this series seem to be that it needs changes to (3);
> >I can see the worry that the guest kernel driver might not get a chance
> >to run during the right time in migration and it's painful having to
> >change every guest driver (although your change is small).
> >
> >My question is what stage of the migration process do you expect to tell
> >the guest kernel driver to do this?
> >
> >     If you do it at the start of the migration, and quiesce the device,
> >     the migration might take a long time (say 30 minutes) - are you
> >     intending the device to be quiesced for this long? And where are
> >     you going to send the traffic?
> >     If you are, then do you need to do it via this PCI trick, or could
> >     you just do it via something higher level to quiesce the device.
> >
> >     Or are you intending to do it just near the end of the migration?
> >     But then how do we know how long it will take the guest driver to
> >     respond?
> 
> Ideally, it is able to leave guest driver unmodified but it requires the
> hypervisor or qemu to aware the device which means we may need a driver in
> hypervisor or qemu to handle the device on behalf of guest driver.

Can you answer the question of when do you use your code -
   at the start of migration or
   just before the end?

> >It would be great if we could avoid changing the guest; but at least your guest
> >driver changes don't actually seem to be that hardware specific; could your
> >changes actually be moved to generic PCI level so they could be made
> >to work for lots of drivers?
> 
> It is impossible to use one common solution for all devices unless the PCIE
> spec documents it clearly and i think one day it will be there. But before
> that, we need some workarounds on guest driver to make it work even it looks
> ugly.

Dave

> 
> -- 
> best regards
> yang
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-10 11:41           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 142+ messages in thread
From: Dr. David Alan Gilbert @ 2015-12-10 11:41 UTC (permalink / raw)
  To: Yang Zhang
  Cc: emil.s.tantilov, kvm, Michael S. Tsirkin, aik, qemu-devel,
	lcapitulino, blauwirbel, kraxel, mark.d.rustad, quintela,
	donald.c.skidmore, agraf, gerlitz.or, alex.williamson, anthony,
	cornelia.huck, Lan, Tianyu, ard.biesheuvel, eddie.dong,
	nrupal.jani, amit.shah, pbonzini

* Yang Zhang (yang.zhang.wz@gmail.com) wrote:
> On 2015/12/10 18:18, Dr. David Alan Gilbert wrote:
> >* Lan, Tianyu (tianyu.lan@intel.com) wrote:
> >>On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >>>I thought about what this is doing at the high level, and I do have some
> >>>value in what you are trying to do, but I also think we need to clarify
> >>>the motivation a bit more.  What you are saying is not really what the
> >>>patches are doing.
> >>>
> >>>And with that clearer understanding of the motivation in mind (assuming
> >>>it actually captures a real need), I would also like to suggest some
> >>>changes.
> >>
> >>Motivation:
> >>Most current solutions for migration with passthough device are based on
> >>the PCI hotplug but it has side affect and can't work for all device.
> >>
> >>For NIC device:
> >>PCI hotplug solution can work around Network device migration
> >>via switching VF and PF.
> >>
> >>But switching network interface will introduce service down time.
> >>
> >>I tested the service down time via putting VF and PV interface
> >>into a bonded interface and ping the bonded interface during plug
> >>and unplug VF.
> >>1) About 100ms when add VF
> >>2) About 30ms when del VF
> >>
> >>It also requires guest to do switch configuration. These are hard to
> >>manage and deploy from our customers. To maintain PV performance during
> >>migration, host side also needs to assign a VF to PV device. This
> >>affects scalability.
> >>
> >>These factors block SRIOV NIC passthough usage in the cloud service and
> >>OPNFV which require network high performance and stability a lot.
> >
> >Right, that I'll agree it's hard to do migration of a VM which uses
> >an SRIOV device; and while I think it should be possible to bond a virtio device
> >to a VF for networking and then hotplug the SR-IOV device I agree it's hard to manage.
> >
> >>For other kind of devices, it's hard to work.
> >>We are also adding migration support for QAT(QuickAssist Technology) device.
> >>
> >>QAT device user case introduction.
> >>Server, networking, big data, and storage applications use QuickAssist
> >>Technology to offload servers from handling compute-intensive operations,
> >>such as:
> >>1) Symmetric cryptography functions including cipher operations and
> >>authentication operations
> >>2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> >>cryptography
> >>3) Compression and decompression functions including DEFLATE and LZS
> >>
> >>PCI hotplug will not work for such devices during migration and these
> >>operations will fail when unplug device.
> >
> >I don't understand that QAT argument; if the device is purely an offload
> >engine for performance, then why can't you fall back to doing the
> >same operations in the VM or in QEMU if the card is unavailable?
> >The tricky bit is dealing with outstanding operations.
> >
> >>So we are trying implementing a new solution which really migrates
> >>device state to target machine and won't affect user during migration
> >>with low service down time.
> >
> >Right, that's a good aim - the only question is how to do it.
> >
> >It looks like this is always going to need some device-specific code;
> >the question I see is whether that's in:
> >     1) qemu
> >     2) the host kernel
> >     3) the guest kernel driver
> >
> >The objections to this series seem to be that it needs changes to (3);
> >I can see the worry that the guest kernel driver might not get a chance
> >to run during the right time in migration and it's painful having to
> >change every guest driver (although your change is small).
> >
> >My question is what stage of the migration process do you expect to tell
> >the guest kernel driver to do this?
> >
> >     If you do it at the start of the migration, and quiesce the device,
> >     the migration might take a long time (say 30 minutes) - are you
> >     intending the device to be quiesced for this long? And where are
> >     you going to send the traffic?
> >     If you are, then do you need to do it via this PCI trick, or could
> >     you just do it via something higher level to quiesce the device.
> >
> >     Or are you intending to do it just near the end of the migration?
> >     But then how do we know how long it will take the guest driver to
> >     respond?
> 
> Ideally, it is able to leave guest driver unmodified but it requires the
> hypervisor or qemu to aware the device which means we may need a driver in
> hypervisor or qemu to handle the device on behalf of guest driver.

Can you answer the question of when do you use your code -
   at the start of migration or
   just before the end?

> >It would be great if we could avoid changing the guest; but at least your guest
> >driver changes don't actually seem to be that hardware specific; could your
> >changes actually be moved to generic PCI level so they could be made
> >to work for lots of drivers?
> 
> It is impossible to use one common solution for all devices unless the PCIE
> spec documents it clearly and i think one day it will be there. But before
> that, we need some workarounds on guest driver to make it work even it looks
> ugly.

Dave

> 
> -- 
> best regards
> yang
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-10 11:41           ` Dr. David Alan Gilbert
@ 2015-12-10 13:07             ` Yang Zhang
  -1 siblings, 0 replies; 142+ messages in thread
From: Yang Zhang @ 2015-12-10 13:07 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Lan, Tianyu, Michael S. Tsirkin, qemu-devel, emil.s.tantilov,
	kvm, ard.biesheuvel, aik, donald.c.skidmore, quintela,
	eddie.dong, nrupal.jani, agraf, blauwirbel, cornelia.huck,
	alex.williamson, kraxel, anthony, amit.shah, pbonzini,
	mark.d.rustad, lcapitulino, gerlitz.or

On 2015/12/10 19:41, Dr. David Alan Gilbert wrote:
> * Yang Zhang (yang.zhang.wz@gmail.com) wrote:
>> On 2015/12/10 18:18, Dr. David Alan Gilbert wrote:
>>> * Lan, Tianyu (tianyu.lan@intel.com) wrote:
>>>> On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
>>>>> I thought about what this is doing at the high level, and I do have some
>>>>> value in what you are trying to do, but I also think we need to clarify
>>>>> the motivation a bit more.  What you are saying is not really what the
>>>>> patches are doing.
>>>>>
>>>>> And with that clearer understanding of the motivation in mind (assuming
>>>>> it actually captures a real need), I would also like to suggest some
>>>>> changes.
>>>>
>>>> Motivation:
>>>> Most current solutions for migration with passthough device are based on
>>>> the PCI hotplug but it has side affect and can't work for all device.
>>>>
>>>> For NIC device:
>>>> PCI hotplug solution can work around Network device migration
>>>> via switching VF and PF.
>>>>
>>>> But switching network interface will introduce service down time.
>>>>
>>>> I tested the service down time via putting VF and PV interface
>>>> into a bonded interface and ping the bonded interface during plug
>>>> and unplug VF.
>>>> 1) About 100ms when add VF
>>>> 2) About 30ms when del VF
>>>>
>>>> It also requires guest to do switch configuration. These are hard to
>>>> manage and deploy from our customers. To maintain PV performance during
>>>> migration, host side also needs to assign a VF to PV device. This
>>>> affects scalability.
>>>>
>>>> These factors block SRIOV NIC passthough usage in the cloud service and
>>>> OPNFV which require network high performance and stability a lot.
>>>
>>> Right, that I'll agree it's hard to do migration of a VM which uses
>>> an SRIOV device; and while I think it should be possible to bond a virtio device
>>> to a VF for networking and then hotplug the SR-IOV device I agree it's hard to manage.
>>>
>>>> For other kind of devices, it's hard to work.
>>>> We are also adding migration support for QAT(QuickAssist Technology) device.
>>>>
>>>> QAT device user case introduction.
>>>> Server, networking, big data, and storage applications use QuickAssist
>>>> Technology to offload servers from handling compute-intensive operations,
>>>> such as:
>>>> 1) Symmetric cryptography functions including cipher operations and
>>>> authentication operations
>>>> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
>>>> cryptography
>>>> 3) Compression and decompression functions including DEFLATE and LZS
>>>>
>>>> PCI hotplug will not work for such devices during migration and these
>>>> operations will fail when unplug device.
>>>
>>> I don't understand that QAT argument; if the device is purely an offload
>>> engine for performance, then why can't you fall back to doing the
>>> same operations in the VM or in QEMU if the card is unavailable?
>>> The tricky bit is dealing with outstanding operations.
>>>
>>>> So we are trying implementing a new solution which really migrates
>>>> device state to target machine and won't affect user during migration
>>>> with low service down time.
>>>
>>> Right, that's a good aim - the only question is how to do it.
>>>
>>> It looks like this is always going to need some device-specific code;
>>> the question I see is whether that's in:
>>>      1) qemu
>>>      2) the host kernel
>>>      3) the guest kernel driver
>>>
>>> The objections to this series seem to be that it needs changes to (3);
>>> I can see the worry that the guest kernel driver might not get a chance
>>> to run during the right time in migration and it's painful having to
>>> change every guest driver (although your change is small).
>>>
>>> My question is what stage of the migration process do you expect to tell
>>> the guest kernel driver to do this?
>>>
>>>      If you do it at the start of the migration, and quiesce the device,
>>>      the migration might take a long time (say 30 minutes) - are you
>>>      intending the device to be quiesced for this long? And where are
>>>      you going to send the traffic?
>>>      If you are, then do you need to do it via this PCI trick, or could
>>>      you just do it via something higher level to quiesce the device.
>>>
>>>      Or are you intending to do it just near the end of the migration?
>>>      But then how do we know how long it will take the guest driver to
>>>      respond?
>>
>> Ideally, it is able to leave guest driver unmodified but it requires the
>> hypervisor or qemu to aware the device which means we may need a driver in
>> hypervisor or qemu to handle the device on behalf of guest driver.
>
> Can you answer the question of when do you use your code -
>     at the start of migration or
>     just before the end?

Tianyu can answer this question. In my initial design, i prefer to put 
more modifications in hypervisor and Qemu, and the only involvement from 
guest driver is how to restore the state after migration. But I don't 
know the later implementation since i have left Intel.

>
>>> It would be great if we could avoid changing the guest; but at least your guest
>>> driver changes don't actually seem to be that hardware specific; could your
>>> changes actually be moved to generic PCI level so they could be made
>>> to work for lots of drivers?
>>
>> It is impossible to use one common solution for all devices unless the PCIE
>> spec documents it clearly and i think one day it will be there. But before
>> that, we need some workarounds on guest driver to make it work even it looks
>> ugly.
>
> Dave
>
>>
>> --
>> best regards
>> yang
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-10 13:07             ` Yang Zhang
  0 siblings, 0 replies; 142+ messages in thread
From: Yang Zhang @ 2015-12-10 13:07 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: emil.s.tantilov, kvm, Michael S. Tsirkin, aik, qemu-devel,
	lcapitulino, blauwirbel, kraxel, mark.d.rustad, quintela,
	donald.c.skidmore, agraf, gerlitz.or, alex.williamson, anthony,
	cornelia.huck, Lan, Tianyu, ard.biesheuvel, eddie.dong,
	nrupal.jani, amit.shah, pbonzini

On 2015/12/10 19:41, Dr. David Alan Gilbert wrote:
> * Yang Zhang (yang.zhang.wz@gmail.com) wrote:
>> On 2015/12/10 18:18, Dr. David Alan Gilbert wrote:
>>> * Lan, Tianyu (tianyu.lan@intel.com) wrote:
>>>> On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
>>>>> I thought about what this is doing at the high level, and I do have some
>>>>> value in what you are trying to do, but I also think we need to clarify
>>>>> the motivation a bit more.  What you are saying is not really what the
>>>>> patches are doing.
>>>>>
>>>>> And with that clearer understanding of the motivation in mind (assuming
>>>>> it actually captures a real need), I would also like to suggest some
>>>>> changes.
>>>>
>>>> Motivation:
>>>> Most current solutions for migration with passthough device are based on
>>>> the PCI hotplug but it has side affect and can't work for all device.
>>>>
>>>> For NIC device:
>>>> PCI hotplug solution can work around Network device migration
>>>> via switching VF and PF.
>>>>
>>>> But switching network interface will introduce service down time.
>>>>
>>>> I tested the service down time via putting VF and PV interface
>>>> into a bonded interface and ping the bonded interface during plug
>>>> and unplug VF.
>>>> 1) About 100ms when add VF
>>>> 2) About 30ms when del VF
>>>>
>>>> It also requires guest to do switch configuration. These are hard to
>>>> manage and deploy from our customers. To maintain PV performance during
>>>> migration, host side also needs to assign a VF to PV device. This
>>>> affects scalability.
>>>>
>>>> These factors block SRIOV NIC passthough usage in the cloud service and
>>>> OPNFV which require network high performance and stability a lot.
>>>
>>> Right, that I'll agree it's hard to do migration of a VM which uses
>>> an SRIOV device; and while I think it should be possible to bond a virtio device
>>> to a VF for networking and then hotplug the SR-IOV device I agree it's hard to manage.
>>>
>>>> For other kind of devices, it's hard to work.
>>>> We are also adding migration support for QAT(QuickAssist Technology) device.
>>>>
>>>> QAT device user case introduction.
>>>> Server, networking, big data, and storage applications use QuickAssist
>>>> Technology to offload servers from handling compute-intensive operations,
>>>> such as:
>>>> 1) Symmetric cryptography functions including cipher operations and
>>>> authentication operations
>>>> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
>>>> cryptography
>>>> 3) Compression and decompression functions including DEFLATE and LZS
>>>>
>>>> PCI hotplug will not work for such devices during migration and these
>>>> operations will fail when unplug device.
>>>
>>> I don't understand that QAT argument; if the device is purely an offload
>>> engine for performance, then why can't you fall back to doing the
>>> same operations in the VM or in QEMU if the card is unavailable?
>>> The tricky bit is dealing with outstanding operations.
>>>
>>>> So we are trying implementing a new solution which really migrates
>>>> device state to target machine and won't affect user during migration
>>>> with low service down time.
>>>
>>> Right, that's a good aim - the only question is how to do it.
>>>
>>> It looks like this is always going to need some device-specific code;
>>> the question I see is whether that's in:
>>>      1) qemu
>>>      2) the host kernel
>>>      3) the guest kernel driver
>>>
>>> The objections to this series seem to be that it needs changes to (3);
>>> I can see the worry that the guest kernel driver might not get a chance
>>> to run during the right time in migration and it's painful having to
>>> change every guest driver (although your change is small).
>>>
>>> My question is what stage of the migration process do you expect to tell
>>> the guest kernel driver to do this?
>>>
>>>      If you do it at the start of the migration, and quiesce the device,
>>>      the migration might take a long time (say 30 minutes) - are you
>>>      intending the device to be quiesced for this long? And where are
>>>      you going to send the traffic?
>>>      If you are, then do you need to do it via this PCI trick, or could
>>>      you just do it via something higher level to quiesce the device.
>>>
>>>      Or are you intending to do it just near the end of the migration?
>>>      But then how do we know how long it will take the guest driver to
>>>      respond?
>>
>> Ideally, it is able to leave guest driver unmodified but it requires the
>> hypervisor or qemu to aware the device which means we may need a driver in
>> hypervisor or qemu to handle the device on behalf of guest driver.
>
> Can you answer the question of when do you use your code -
>     at the start of migration or
>     just before the end?

Tianyu can answer this question. In my initial design, i prefer to put 
more modifications in hypervisor and Qemu, and the only involvement from 
guest driver is how to restore the state after migration. But I don't 
know the later implementation since i have left Intel.

>
>>> It would be great if we could avoid changing the guest; but at least your guest
>>> driver changes don't actually seem to be that hardware specific; could your
>>> changes actually be moved to generic PCI level so they could be made
>>> to work for lots of drivers?
>>
>> It is impossible to use one common solution for all devices unless the PCIE
>> spec documents it clearly and i think one day it will be there. But before
>> that, we need some workarounds on guest driver to make it work even it looks
>> ugly.
>
> Dave
>
>>
>> --
>> best regards
>> yang
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: live migration vs device assignment (motivation)
  2015-12-10  8:38           ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-12-10 14:23             ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-10 14:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, kraxel, lcapitulino, quintela

On 12/10/2015 4:38 PM, Michael S. Tsirkin wrote:
> Let's assume you do save state and do have a way to detect
> whether state matches a given hardware. For example,
> driver could store firmware and hardware versions
> in the state, and then on destination, retrieve them
> and compare. It will be pretty common that you have a mismatch,
> and you must not just fail migration. You need a way to recover,
> maybe with more downtime.
>
>
> Second, you can change the driver but you can not be sure it will have
> the chance to run at all. Host overload is a common reason to migrate
> out of the host.  You also can not trust guest to do the right thing.
> So how long do you want to wait until you decide guest is not
> cooperating and kill it?  Most people will probably experiment a bit and
> then add a bit of a buffer. This is not robust at all.
>
> Again, maybe you ask driver to save state, and if it does
> not respond for a while, then you still migrate,
> and driver has to recover on destination.
>
>
> With the above in mind, you need to support two paths:
> 1. "good path": driver stores state on source, checks it on destination
>     detects a match and restores state into the device
> 2. "bad path": driver does not store state, or detects a mismatch
>     on destination. driver has to assume device was lost,
>     and reset it
>
> So what I am saying is, implement bad path first. Then good path
> is an optimization - measure whether it's faster, and by how much.
>

These sound reasonable. Driver should have ability to do such check
to ensure hardware or firmware coherence after migration and reset 
device when migration happens at some unexpected position.


> Also, it would be nice if on the bad path there was a way
> to switch to another driver entirely, even if that means
> a bit more downtime. For example, have a way for driver to
> tell Linux it has to re-do probing for the device.

Just glace the code of device core. device_reprobe() does what you said.

/**
  * device_reprobe - remove driver for a device and probe for a new 	driver
  * @dev: the device to reprobe
  *
  * This function detaches the attached driver (if any) for the given
  * device and restarts the driver probing process.  It is intended
  * to use if probing criteria changed during a devices lifetime and
  * driver attachment should change accordingly.
  */
int device_reprobe(struct device *dev)






^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-10 14:23             ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-10 14:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, quintela, eddie.dong, nrupal.jani, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or

On 12/10/2015 4:38 PM, Michael S. Tsirkin wrote:
> Let's assume you do save state and do have a way to detect
> whether state matches a given hardware. For example,
> driver could store firmware and hardware versions
> in the state, and then on destination, retrieve them
> and compare. It will be pretty common that you have a mismatch,
> and you must not just fail migration. You need a way to recover,
> maybe with more downtime.
>
>
> Second, you can change the driver but you can not be sure it will have
> the chance to run at all. Host overload is a common reason to migrate
> out of the host.  You also can not trust guest to do the right thing.
> So how long do you want to wait until you decide guest is not
> cooperating and kill it?  Most people will probably experiment a bit and
> then add a bit of a buffer. This is not robust at all.
>
> Again, maybe you ask driver to save state, and if it does
> not respond for a while, then you still migrate,
> and driver has to recover on destination.
>
>
> With the above in mind, you need to support two paths:
> 1. "good path": driver stores state on source, checks it on destination
>     detects a match and restores state into the device
> 2. "bad path": driver does not store state, or detects a mismatch
>     on destination. driver has to assume device was lost,
>     and reset it
>
> So what I am saying is, implement bad path first. Then good path
> is an optimization - measure whether it's faster, and by how much.
>

These sound reasonable. Driver should have ability to do such check
to ensure hardware or firmware coherence after migration and reset 
device when migration happens at some unexpected position.


> Also, it would be nice if on the bad path there was a way
> to switch to another driver entirely, even if that means
> a bit more downtime. For example, have a way for driver to
> tell Linux it has to re-do probing for the device.

Just glace the code of device core. device_reprobe() does what you said.

/**
  * device_reprobe - remove driver for a device and probe for a new 	driver
  * @dev: the device to reprobe
  *
  * This function detaches the attached driver (if any) for the given
  * device and restarts the driver probing process.  It is intended
  * to use if probing criteria changed during a devices lifetime and
  * driver attachment should change accordingly.
  */
int device_reprobe(struct device *dev)

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: live migration vs device assignment (motivation)
  2015-12-10 11:41           ` Dr. David Alan Gilbert
@ 2015-12-10 14:38             ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-10 14:38 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Yang Zhang
  Cc: lcapitulino, alex.williamson, emil.s.tantilov, kvm,
	ard.biesheuvel, aik, donald.c.skidmore, Michael S. Tsirkin,
	eddie.dong, qemu-devel, agraf, blauwirbel, quintela, nrupal.jani,
	kraxel, anthony, cornelia.huck, pbonzini, mark.d.rustad,
	amit.shah, gerlitz.or



On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>> Ideally, it is able to leave guest driver unmodified but it requires the
>> >hypervisor or qemu to aware the device which means we may need a driver in
>> >hypervisor or qemu to handle the device on behalf of guest driver.
> Can you answer the question of when do you use your code -
>     at the start of migration or
>     just before the end?

Just before stopping VCPU in this version and inject VF mailbox irq to
notify the driver if the irq handler is installed.
Qemu side also will check this via the faked PCI migration capability
and driver will set the status during device open() or resume() callback.

>
>>> > >It would be great if we could avoid changing the guest; but at least your guest
>>> > >driver changes don't actually seem to be that hardware specific; could your
>>> > >changes actually be moved to generic PCI level so they could be made
>>> > >to work for lots of drivers?
>> >
>> >It is impossible to use one common solution for all devices unless the PCIE
>> >spec documents it clearly and i think one day it will be there. But before
>> >that, we need some workarounds on guest driver to make it work even it looks
>> >ugly.

Yes, so far there is not hardware migration support and it's hard to 
modify bus level code. It also will block implementation on the Windows.

> Dave
>

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-10 14:38             ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-10 14:38 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Yang Zhang
  Cc: lcapitulino, alex.williamson, emil.s.tantilov, kvm,
	ard.biesheuvel, aik, donald.c.skidmore, Michael S. Tsirkin,
	eddie.dong, qemu-devel, agraf, blauwirbel, quintela, nrupal.jani,
	kraxel, anthony, cornelia.huck, pbonzini, mark.d.rustad,
	amit.shah, gerlitz.or



On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>> Ideally, it is able to leave guest driver unmodified but it requires the
>> >hypervisor or qemu to aware the device which means we may need a driver in
>> >hypervisor or qemu to handle the device on behalf of guest driver.
> Can you answer the question of when do you use your code -
>     at the start of migration or
>     just before the end?

Just before stopping VCPU in this version and inject VF mailbox irq to
notify the driver if the irq handler is installed.
Qemu side also will check this via the faked PCI migration capability
and driver will set the status during device open() or resume() callback.

>
>>> > >It would be great if we could avoid changing the guest; but at least your guest
>>> > >driver changes don't actually seem to be that hardware specific; could your
>>> > >changes actually be moved to generic PCI level so they could be made
>>> > >to work for lots of drivers?
>> >
>> >It is impossible to use one common solution for all devices unless the PCIE
>> >spec documents it clearly and i think one day it will be there. But before
>> >that, we need some workarounds on guest driver to make it work even it looks
>> >ugly.

Yes, so far there is not hardware migration support and it's hard to 
modify bus level code. It also will block implementation on the Windows.

> Dave
>

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-10 14:38             ` [Qemu-devel] " Lan, Tianyu
@ 2015-12-10 16:11               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-10 16:11 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Dr. David Alan Gilbert, Yang Zhang, qemu-devel, emil.s.tantilov,
	kvm, ard.biesheuvel, aik, donald.c.skidmore, quintela,
	eddie.dong, nrupal.jani, agraf, blauwirbel, cornelia.huck,
	alex.williamson, kraxel, anthony, amit.shah, pbonzini,
	mark.d.rustad, lcapitulino, gerlitz.or

On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> >>Ideally, it is able to leave guest driver unmodified but it requires the
> >>>hypervisor or qemu to aware the device which means we may need a driver in
> >>>hypervisor or qemu to handle the device on behalf of guest driver.
> >Can you answer the question of when do you use your code -
> >    at the start of migration or
> >    just before the end?
> 
> Just before stopping VCPU in this version and inject VF mailbox irq to
> notify the driver if the irq handler is installed.
> Qemu side also will check this via the faked PCI migration capability
> and driver will set the status during device open() or resume() callback.

Right, this is the "good path" optimization. Whether this buys anything
as compared to just sending reset to the device when VCPU is stopped
needs to be measured. In any case, we probably do need a way to
interrupt driver on destination to make it reconfigure the device -
otherwise it might take seconds for it to notice.  And a way to make
sure driver can handle this surprise reset so we can block migration if
it can't.

> >
> >>>> >It would be great if we could avoid changing the guest; but at least your guest
> >>>> >driver changes don't actually seem to be that hardware specific; could your
> >>>> >changes actually be moved to generic PCI level so they could be made
> >>>> >to work for lots of drivers?
> >>>
> >>>It is impossible to use one common solution for all devices unless the PCIE
> >>>spec documents it clearly and i think one day it will be there. But before
> >>>that, we need some workarounds on guest driver to make it work even it looks
> >>>ugly.
> 
> Yes, so far there is not hardware migration support

VT-D supports setting dirty bit in the PTE in hardware.

> and it's hard to modify
> bus level code.

Why is it hard?

> It also will block implementation on the Windows.

Implementation of what?  We are discussing motivation here, not
implementation.  E.g. windows drivers typically support surprise
removal, should you use that, you get some working code for free.  Just
stop worrying about it.  Make it work, worry about closed source
software later.

> >Dave
> >

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-10 16:11               ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-10 16:11 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Yang Zhang, emil.s.tantilov, kvm, aik, qemu-devel, lcapitulino,
	blauwirbel, kraxel, mark.d.rustad, quintela, donald.c.skidmore,
	agraf, gerlitz.or, Dr. David Alan Gilbert, alex.williamson,
	anthony, cornelia.huck, ard.biesheuvel, eddie.dong, nrupal.jani,
	amit.shah, pbonzini

On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> >>Ideally, it is able to leave guest driver unmodified but it requires the
> >>>hypervisor or qemu to aware the device which means we may need a driver in
> >>>hypervisor or qemu to handle the device on behalf of guest driver.
> >Can you answer the question of when do you use your code -
> >    at the start of migration or
> >    just before the end?
> 
> Just before stopping VCPU in this version and inject VF mailbox irq to
> notify the driver if the irq handler is installed.
> Qemu side also will check this via the faked PCI migration capability
> and driver will set the status during device open() or resume() callback.

Right, this is the "good path" optimization. Whether this buys anything
as compared to just sending reset to the device when VCPU is stopped
needs to be measured. In any case, we probably do need a way to
interrupt driver on destination to make it reconfigure the device -
otherwise it might take seconds for it to notice.  And a way to make
sure driver can handle this surprise reset so we can block migration if
it can't.

> >
> >>>> >It would be great if we could avoid changing the guest; but at least your guest
> >>>> >driver changes don't actually seem to be that hardware specific; could your
> >>>> >changes actually be moved to generic PCI level so they could be made
> >>>> >to work for lots of drivers?
> >>>
> >>>It is impossible to use one common solution for all devices unless the PCIE
> >>>spec documents it clearly and i think one day it will be there. But before
> >>>that, we need some workarounds on guest driver to make it work even it looks
> >>>ugly.
> 
> Yes, so far there is not hardware migration support

VT-D supports setting dirty bit in the PTE in hardware.

> and it's hard to modify
> bus level code.

Why is it hard?

> It also will block implementation on the Windows.

Implementation of what?  We are discussing motivation here, not
implementation.  E.g. windows drivers typically support surprise
removal, should you use that, you get some working code for free.  Just
stop worrying about it.  Make it work, worry about closed source
software later.

> >Dave
> >

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-10 14:38             ` [Qemu-devel] " Lan, Tianyu
@ 2015-12-10 16:23               ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 142+ messages in thread
From: Dr. David Alan Gilbert @ 2015-12-10 16:23 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Yang Zhang, Michael S. Tsirkin, qemu-devel, emil.s.tantilov, kvm,
	ard.biesheuvel, aik, donald.c.skidmore, quintela, eddie.dong,
	nrupal.jani, agraf, blauwirbel, cornelia.huck, alex.williamson,
	kraxel, anthony, amit.shah, pbonzini, mark.d.rustad, lcapitulino,
	gerlitz.or

* Lan, Tianyu (tianyu.lan@intel.com) wrote:
> 
> 
> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> >>Ideally, it is able to leave guest driver unmodified but it requires the
> >>>hypervisor or qemu to aware the device which means we may need a driver in
> >>>hypervisor or qemu to handle the device on behalf of guest driver.
> >Can you answer the question of when do you use your code -
> >    at the start of migration or
> >    just before the end?
> 
> Just before stopping VCPU in this version and inject VF mailbox irq to
> notify the driver if the irq handler is installed.
> Qemu side also will check this via the faked PCI migration capability
> and driver will set the status during device open() or resume() callback.

OK, hmm - I can see that would work in some cases; but:
   a) It wouldn't work if the guest was paused, the management can pause it before
     starting migration or during migration - so you might need to hook the pause
     as well;  so that's a bit complicated.

   b) How long does qemu wait for the guest to respond, and what does it do if
      the guest doesn't respond ?  How do we recover?

   c) How much work does the guest need to do at this point?

   d) It would be great if we could find a more generic way of telling the guest
      it's about to migrate rather than via the PCI registers of one device; imagine
      what happens if you have a few different devices using SR-IOV, we'd have to tell
      them all with separate interrupts.   Perhaps we could use a virtio channel or
      an ACPI event or something?

> >>>> >It would be great if we could avoid changing the guest; but at least your guest
> >>>> >driver changes don't actually seem to be that hardware specific; could your
> >>>> >changes actually be moved to generic PCI level so they could be made
> >>>> >to work for lots of drivers?
> >>>
> >>>It is impossible to use one common solution for all devices unless the PCIE
> >>>spec documents it clearly and i think one day it will be there. But before
> >>>that, we need some workarounds on guest driver to make it work even it looks
> >>>ugly.
> 
> Yes, so far there is not hardware migration support and it's hard to modify
> bus level code. It also will block implementation on the Windows.

Well, there was agraf's trick, although that's a lot more complicated at the qemu
level, but it should work with no guest modifications.  Michael's point about
dirty page tracking is neat, I think that simplifies it a bit if it can track dirty
pages.

Dave

> >Dave
> >
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-10 16:23               ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 142+ messages in thread
From: Dr. David Alan Gilbert @ 2015-12-10 16:23 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Yang Zhang, emil.s.tantilov, kvm, Michael S. Tsirkin, aik,
	qemu-devel, lcapitulino, blauwirbel, kraxel, mark.d.rustad,
	quintela, donald.c.skidmore, agraf, gerlitz.or, alex.williamson,
	anthony, cornelia.huck, ard.biesheuvel, eddie.dong, nrupal.jani,
	amit.shah, pbonzini

* Lan, Tianyu (tianyu.lan@intel.com) wrote:
> 
> 
> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> >>Ideally, it is able to leave guest driver unmodified but it requires the
> >>>hypervisor or qemu to aware the device which means we may need a driver in
> >>>hypervisor or qemu to handle the device on behalf of guest driver.
> >Can you answer the question of when do you use your code -
> >    at the start of migration or
> >    just before the end?
> 
> Just before stopping VCPU in this version and inject VF mailbox irq to
> notify the driver if the irq handler is installed.
> Qemu side also will check this via the faked PCI migration capability
> and driver will set the status during device open() or resume() callback.

OK, hmm - I can see that would work in some cases; but:
   a) It wouldn't work if the guest was paused, the management can pause it before
     starting migration or during migration - so you might need to hook the pause
     as well;  so that's a bit complicated.

   b) How long does qemu wait for the guest to respond, and what does it do if
      the guest doesn't respond ?  How do we recover?

   c) How much work does the guest need to do at this point?

   d) It would be great if we could find a more generic way of telling the guest
      it's about to migrate rather than via the PCI registers of one device; imagine
      what happens if you have a few different devices using SR-IOV, we'd have to tell
      them all with separate interrupts.   Perhaps we could use a virtio channel or
      an ACPI event or something?

> >>>> >It would be great if we could avoid changing the guest; but at least your guest
> >>>> >driver changes don't actually seem to be that hardware specific; could your
> >>>> >changes actually be moved to generic PCI level so they could be made
> >>>> >to work for lots of drivers?
> >>>
> >>>It is impossible to use one common solution for all devices unless the PCIE
> >>>spec documents it clearly and i think one day it will be there. But before
> >>>that, we need some workarounds on guest driver to make it work even it looks
> >>>ugly.
> 
> Yes, so far there is not hardware migration support and it's hard to modify
> bus level code. It also will block implementation on the Windows.

Well, there was agraf's trick, although that's a lot more complicated at the qemu
level, but it should work with no guest modifications.  Michael's point about
dirty page tracking is neat, I think that simplifies it a bit if it can track dirty
pages.

Dave

> >Dave
> >
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-10 14:38             ` [Qemu-devel] " Lan, Tianyu
@ 2015-12-10 17:16               ` Alexander Duyck
  -1 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-10 17:16 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Dr. David Alan Gilbert, Yang Zhang, Michael S. Tsirkin,
	qemu-devel, Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore,
	Donald C, quintela, Dong, Eddie, Jani, Nrupal, Alexander Graf,
	Blue Swirl, cornelia.huck, Alex Williamson, kraxel,
	Anthony Liguori, amit.shah, Paolo Bonzini, Rustad, Mark D,
	lcapitulino, Or Gerlitz

On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:
>
>
> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>>>
>>> Ideally, it is able to leave guest driver unmodified but it requires the
>>> >hypervisor or qemu to aware the device which means we may need a driver
>>> > in
>>> >hypervisor or qemu to handle the device on behalf of guest driver.
>>
>> Can you answer the question of when do you use your code -
>>     at the start of migration or
>>     just before the end?
>
>
> Just before stopping VCPU in this version and inject VF mailbox irq to
> notify the driver if the irq handler is installed.
> Qemu side also will check this via the faked PCI migration capability
> and driver will set the status during device open() or resume() callback.

The VF mailbox interrupt is a very bad idea.  Really the device should
be in a reset state on the other side of a migration.  It doesn't make
sense to have the interrupt firing if the device is not configured.
This is one of the things that is preventing you from being able to
migrate the device while the interface is administratively down or the
VF driver is not loaded.

My thought on all this is that it might make sense to move this
functionality into a PCI-to-PCI bridge device and make it a
requirement that all direct-assigned devices have to exist behind that
device in order to support migration.  That way you would be working
with a directly emulated device that would likely already be
supporting hot-plug anyway.  Then it would just be a matter of coming
up with a few Qemu specific extensions that you would need to add to
the device itself.  The same approach would likely be portable enough
that you could achieve it with PCIe as well via the same configuration
space being present on the upstream side of a PCIe port or maybe a
PCIe switch of some sort.

It would then be possible to signal via your vendor-specific PCI
capability on that device that all devices behind this bridge require
DMA page dirtying, you could use the configuration in addition to the
interrupt already provided for hot-plug to signal things like when you
are starting migration, and possibly even just extend the shpc
functionality so that if this capability is present you have the
option to pause/resume instead of remove/probe the device in the case
of certain hot-plug events.  The fact is there may be some use for a
pause/resume type approach for PCIe hot-plug in the near future
anyway.  From the sounds of it Apple has required it for all
Thunderbolt device drivers so that they can halt the device in order
to shuffle resources around, perhaps we should look at something
similar for Linux.

The other advantage behind grouping functions on one bridge is things
like reset domains.  The PCI error handling logic will want to be able
to reset any devices that experienced an error in the event of
something such as a surprise removal.  By grouping all of the devices
you could disable/reset/enable them as one logical group in the event
of something such as the "bad path" approach Michael has mentioned.

>>
>>>> > >It would be great if we could avoid changing the guest; but at least
>>>> > > your guest
>>>> > >driver changes don't actually seem to be that hardware specific;
>>>> > > could your
>>>> > >changes actually be moved to generic PCI level so they could be made
>>>> > >to work for lots of drivers?
>>>
>>> >
>>> >It is impossible to use one common solution for all devices unless the
>>> > PCIE
>>> >spec documents it clearly and i think one day it will be there. But
>>> > before
>>> >that, we need some workarounds on guest driver to make it work even it
>>> > looks
>>> >ugly.
>
>
> Yes, so far there is not hardware migration support and it's hard to modify
> bus level code. It also will block implementation on the Windows.

Please don't assume things.  Unless you have hard data from Microsoft
that says they want it this way lets just try to figure out what works
best for us for now and then we can start worrying about third party
implementations after we have figured out a solution that actually
works.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-10 17:16               ` Alexander Duyck
  0 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-10 17:16 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Yang Zhang, Tantilov, Emil S, kvm, Michael S. Tsirkin, aik,
	qemu-devel, lcapitulino, Blue Swirl, kraxel, Rustad, Mark D,
	quintela, Skidmore, Donald C, Alexander Graf, Or Gerlitz,
	Dr. David Alan Gilbert, Alex Williamson, Anthony Liguori,
	cornelia.huck, Ard Biesheuvel, Dong, Eddie, Jani, Nrupal,
	amit.shah, Paolo Bonzini

On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:
>
>
> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>>>
>>> Ideally, it is able to leave guest driver unmodified but it requires the
>>> >hypervisor or qemu to aware the device which means we may need a driver
>>> > in
>>> >hypervisor or qemu to handle the device on behalf of guest driver.
>>
>> Can you answer the question of when do you use your code -
>>     at the start of migration or
>>     just before the end?
>
>
> Just before stopping VCPU in this version and inject VF mailbox irq to
> notify the driver if the irq handler is installed.
> Qemu side also will check this via the faked PCI migration capability
> and driver will set the status during device open() or resume() callback.

The VF mailbox interrupt is a very bad idea.  Really the device should
be in a reset state on the other side of a migration.  It doesn't make
sense to have the interrupt firing if the device is not configured.
This is one of the things that is preventing you from being able to
migrate the device while the interface is administratively down or the
VF driver is not loaded.

My thought on all this is that it might make sense to move this
functionality into a PCI-to-PCI bridge device and make it a
requirement that all direct-assigned devices have to exist behind that
device in order to support migration.  That way you would be working
with a directly emulated device that would likely already be
supporting hot-plug anyway.  Then it would just be a matter of coming
up with a few Qemu specific extensions that you would need to add to
the device itself.  The same approach would likely be portable enough
that you could achieve it with PCIe as well via the same configuration
space being present on the upstream side of a PCIe port or maybe a
PCIe switch of some sort.

It would then be possible to signal via your vendor-specific PCI
capability on that device that all devices behind this bridge require
DMA page dirtying, you could use the configuration in addition to the
interrupt already provided for hot-plug to signal things like when you
are starting migration, and possibly even just extend the shpc
functionality so that if this capability is present you have the
option to pause/resume instead of remove/probe the device in the case
of certain hot-plug events.  The fact is there may be some use for a
pause/resume type approach for PCIe hot-plug in the near future
anyway.  From the sounds of it Apple has required it for all
Thunderbolt device drivers so that they can halt the device in order
to shuffle resources around, perhaps we should look at something
similar for Linux.

The other advantage behind grouping functions on one bridge is things
like reset domains.  The PCI error handling logic will want to be able
to reset any devices that experienced an error in the event of
something such as a surprise removal.  By grouping all of the devices
you could disable/reset/enable them as one logical group in the event
of something such as the "bad path" approach Michael has mentioned.

>>
>>>> > >It would be great if we could avoid changing the guest; but at least
>>>> > > your guest
>>>> > >driver changes don't actually seem to be that hardware specific;
>>>> > > could your
>>>> > >changes actually be moved to generic PCI level so they could be made
>>>> > >to work for lots of drivers?
>>>
>>> >
>>> >It is impossible to use one common solution for all devices unless the
>>> > PCIE
>>> >spec documents it clearly and i think one day it will be there. But
>>> > before
>>> >that, we need some workarounds on guest driver to make it work even it
>>> > looks
>>> >ugly.
>
>
> Yes, so far there is not hardware migration support and it's hard to modify
> bus level code. It also will block implementation on the Windows.

Please don't assume things.  Unless you have hard data from Microsoft
that says they want it this way lets just try to figure out what works
best for us for now and then we can start worrying about third party
implementations after we have figured out a solution that actually
works.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-10 16:11               ` Michael S. Tsirkin
@ 2015-12-10 19:17                 ` Alexander Duyck
  -1 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-10 19:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lan, Tianyu, Dr. David Alan Gilbert, Yang Zhang, qemu-devel,
	Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore, Donald C,
	quintela, Dong, Eddie, Jani, Nrupal, Alexander Graf, Blue Swirl,
	cornelia.huck, Alex Williamson, kraxel, Anthony Liguori,
	amit.shah, Paolo Bonzini, Rustad, Mark D, lcapitulino,
	Or Gerlitz

On Thu, Dec 10, 2015 at 8:11 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
>>
>>
>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>> >>Ideally, it is able to leave guest driver unmodified but it requires the
>> >>>hypervisor or qemu to aware the device which means we may need a driver in
>> >>>hypervisor or qemu to handle the device on behalf of guest driver.
>> >Can you answer the question of when do you use your code -
>> >    at the start of migration or
>> >    just before the end?
>>
>> Just before stopping VCPU in this version and inject VF mailbox irq to
>> notify the driver if the irq handler is installed.
>> Qemu side also will check this via the faked PCI migration capability
>> and driver will set the status during device open() or resume() callback.
>
> Right, this is the "good path" optimization. Whether this buys anything
> as compared to just sending reset to the device when VCPU is stopped
> needs to be measured. In any case, we probably do need a way to
> interrupt driver on destination to make it reconfigure the device -
> otherwise it might take seconds for it to notice.  And a way to make
> sure driver can handle this surprise reset so we can block migration if
> it can't.

The question is how do we handle the "bad path"?  From what I can tell
it seems like we would have to have the dirty page tracking for DMA
handled in the host in order to support that.  Otherwise we risk
corrupting the memory in the guest as there are going to be a few
stale pages that end up being in the guest.

The easiest way to probably flag a "bad path" migration would be to
emulate a Manually-operated Retention Latch being opened and closed on
the device.  It may even allow us to work with the desire to support a
means for doing a pause/resume as that would be a hot-plug event where
the latch was never actually opened.  Basically if the retention latch
is released and then re-closed it can be assumed that the device has
lost power and as a result been reset.  As such a normal hot-plug
controller would have to reconfigure the device in such an event.  The
key bit being that with the power being cycled on the port the
assumption is that the device has lost any existing state, and we
should emulate that as well by clearing any state Qemu might be
carrying such as the shadow of the MSI-X table.  In addition we could
also signal if the host supports the dirty page tracking via the IOMMU
so if needed the guest could trigger some sort of memory exception
handling due to the risk of memory corruption.

I would argue that we don't necessarily have to provide a means to
guarantee the driver can support a surprise removal/reset.  Worst case
scenario is that it would be equivalent to somebody pulling the plug
on an externally connected PCIe cage in a physical host.  I know the
Intel Ethernet drivers have already had to add support for surprise
removal due to the fact that such a scenario can occur on Thunderbolt
enabled platforms.  Since it is acceptable for physical hosts to have
such an event occur I think we could support the same type of failure
for direct assigned devices in guests.  That would be the one spot
where I would say it is up to the drivers to figure out how they are
going to deal with it since this is something that can occur for any
given driver on any given OS assuming it can be plugged into an
externally removable cage.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-10 19:17                 ` Alexander Duyck
  0 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-10 19:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Yang Zhang, Tantilov, Emil S, kvm, aik, qemu-devel, lcapitulino,
	Blue Swirl, kraxel, Rustad, Mark D, quintela, Skidmore, Donald C,
	Alexander Graf, Or Gerlitz, Dr. David Alan Gilbert,
	Alex Williamson, Anthony Liguori, cornelia.huck, Lan, Tianyu,
	Ard Biesheuvel, Dong, Eddie, Jani, Nrupal, amit.shah,
	Paolo Bonzini

On Thu, Dec 10, 2015 at 8:11 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
>>
>>
>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>> >>Ideally, it is able to leave guest driver unmodified but it requires the
>> >>>hypervisor or qemu to aware the device which means we may need a driver in
>> >>>hypervisor or qemu to handle the device on behalf of guest driver.
>> >Can you answer the question of when do you use your code -
>> >    at the start of migration or
>> >    just before the end?
>>
>> Just before stopping VCPU in this version and inject VF mailbox irq to
>> notify the driver if the irq handler is installed.
>> Qemu side also will check this via the faked PCI migration capability
>> and driver will set the status during device open() or resume() callback.
>
> Right, this is the "good path" optimization. Whether this buys anything
> as compared to just sending reset to the device when VCPU is stopped
> needs to be measured. In any case, we probably do need a way to
> interrupt driver on destination to make it reconfigure the device -
> otherwise it might take seconds for it to notice.  And a way to make
> sure driver can handle this surprise reset so we can block migration if
> it can't.

The question is how do we handle the "bad path"?  From what I can tell
it seems like we would have to have the dirty page tracking for DMA
handled in the host in order to support that.  Otherwise we risk
corrupting the memory in the guest as there are going to be a few
stale pages that end up being in the guest.

The easiest way to probably flag a "bad path" migration would be to
emulate a Manually-operated Retention Latch being opened and closed on
the device.  It may even allow us to work with the desire to support a
means for doing a pause/resume as that would be a hot-plug event where
the latch was never actually opened.  Basically if the retention latch
is released and then re-closed it can be assumed that the device has
lost power and as a result been reset.  As such a normal hot-plug
controller would have to reconfigure the device in such an event.  The
key bit being that with the power being cycled on the port the
assumption is that the device has lost any existing state, and we
should emulate that as well by clearing any state Qemu might be
carrying such as the shadow of the MSI-X table.  In addition we could
also signal if the host supports the dirty page tracking via the IOMMU
so if needed the guest could trigger some sort of memory exception
handling due to the risk of memory corruption.

I would argue that we don't necessarily have to provide a means to
guarantee the driver can support a surprise removal/reset.  Worst case
scenario is that it would be equivalent to somebody pulling the plug
on an externally connected PCIe cage in a physical host.  I know the
Intel Ethernet drivers have already had to add support for surprise
removal due to the fact that such a scenario can occur on Thunderbolt
enabled platforms.  Since it is acceptable for physical hosts to have
such an event occur I think we could support the same type of failure
for direct assigned devices in guests.  That would be the one spot
where I would say it is up to the drivers to figure out how they are
going to deal with it since this is something that can occur for any
given driver on any given OS assuming it can be plugged into an
externally removable cage.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-10 16:11               ` Michael S. Tsirkin
@ 2015-12-11  7:32                 ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-11  7:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Dr. David Alan Gilbert, Yang Zhang, qemu-devel, emil.s.tantilov,
	kvm, ard.biesheuvel, aik, donald.c.skidmore, quintela,
	eddie.dong, nrupal.jani, agraf, blauwirbel, cornelia.huck,
	alex.williamson, kraxel, anthony, amit.shah, pbonzini,
	mark.d.rustad, lcapitulino, gerlitz.or



On 12/11/2015 12:11 AM, Michael S. Tsirkin wrote:
> On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
>>
>>
>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>>>> Ideally, it is able to leave guest driver unmodified but it requires the
>>>>> hypervisor or qemu to aware the device which means we may need a driver in
>>>>> hypervisor or qemu to handle the device on behalf of guest driver.
>>> Can you answer the question of when do you use your code -
>>>     at the start of migration or
>>>     just before the end?
>>
>> Just before stopping VCPU in this version and inject VF mailbox irq to
>> notify the driver if the irq handler is installed.
>> Qemu side also will check this via the faked PCI migration capability
>> and driver will set the status during device open() or resume() callback.
>
> Right, this is the "good path" optimization. Whether this buys anything
> as compared to just sending reset to the device when VCPU is stopped
> needs to be measured. In any case, we probably do need a way to
> interrupt driver on destination to make it reconfigure the device -
> otherwise it might take seconds for it to notice.  And a way to make
> sure driver can handle this surprise reset so we can block migration if
> it can't.
>

Yes, we need such a way to notify driver about migration status and do
reset or restore operation on the destination machine. My original
design is to take advantage of device's irq to do that. Driver can tell
Qemu that which irq it prefers to handle such task and whether the irq
is enabled or bound with handler. We may discuss the detail in the other
thread.

>>>
>>>>>>> It would be great if we could avoid changing the guest; but at least your guest
>>>>>>> driver changes don't actually seem to be that hardware specific; could your
>>>>>>> changes actually be moved to generic PCI level so they could be made
>>>>>>> to work for lots of drivers?
>>>>>
>>>>> It is impossible to use one common solution for all devices unless the PCIE
>>>>> spec documents it clearly and i think one day it will be there. But before
>>>>> that, we need some workarounds on guest driver to make it work even it looks
>>>>> ugly.
>>
>> Yes, so far there is not hardware migration support
>
> VT-D supports setting dirty bit in the PTE in hardware.

Actually, this doesn't support in the current hardware.
VTD spec documents the dirty bit for first level translation which
requires devices to support DMA request with PASID(process
address space identifier). Most device don't support the feature.

>
>> and it's hard to modify
>> bus level code.
>
> Why is it hard?

As Yang said, the concern is that PCI Spec doesn't document about how to 
do migration.

>
>> It also will block implementation on the Windows.
>
> Implementation of what?  We are discussing motivation here, not
> implementation.  E.g. windows drivers typically support surprise
> removal, should you use that, you get some working code for free.  Just
> stop worrying about it.  Make it work, worry about closed source
> software later.
>
>>> Dave
>>>

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-11  7:32                 ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-11  7:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Yang Zhang, emil.s.tantilov, kvm, aik, qemu-devel, lcapitulino,
	blauwirbel, kraxel, mark.d.rustad, quintela, donald.c.skidmore,
	agraf, gerlitz.or, Dr. David Alan Gilbert, alex.williamson,
	anthony, cornelia.huck, ard.biesheuvel, eddie.dong, nrupal.jani,
	amit.shah, pbonzini



On 12/11/2015 12:11 AM, Michael S. Tsirkin wrote:
> On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
>>
>>
>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>>>> Ideally, it is able to leave guest driver unmodified but it requires the
>>>>> hypervisor or qemu to aware the device which means we may need a driver in
>>>>> hypervisor or qemu to handle the device on behalf of guest driver.
>>> Can you answer the question of when do you use your code -
>>>     at the start of migration or
>>>     just before the end?
>>
>> Just before stopping VCPU in this version and inject VF mailbox irq to
>> notify the driver if the irq handler is installed.
>> Qemu side also will check this via the faked PCI migration capability
>> and driver will set the status during device open() or resume() callback.
>
> Right, this is the "good path" optimization. Whether this buys anything
> as compared to just sending reset to the device when VCPU is stopped
> needs to be measured. In any case, we probably do need a way to
> interrupt driver on destination to make it reconfigure the device -
> otherwise it might take seconds for it to notice.  And a way to make
> sure driver can handle this surprise reset so we can block migration if
> it can't.
>

Yes, we need such a way to notify driver about migration status and do
reset or restore operation on the destination machine. My original
design is to take advantage of device's irq to do that. Driver can tell
Qemu that which irq it prefers to handle such task and whether the irq
is enabled or bound with handler. We may discuss the detail in the other
thread.

>>>
>>>>>>> It would be great if we could avoid changing the guest; but at least your guest
>>>>>>> driver changes don't actually seem to be that hardware specific; could your
>>>>>>> changes actually be moved to generic PCI level so they could be made
>>>>>>> to work for lots of drivers?
>>>>>
>>>>> It is impossible to use one common solution for all devices unless the PCIE
>>>>> spec documents it clearly and i think one day it will be there. But before
>>>>> that, we need some workarounds on guest driver to make it work even it looks
>>>>> ugly.
>>
>> Yes, so far there is not hardware migration support
>
> VT-D supports setting dirty bit in the PTE in hardware.

Actually, this doesn't support in the current hardware.
VTD spec documents the dirty bit for first level translation which
requires devices to support DMA request with PASID(process
address space identifier). Most device don't support the feature.

>
>> and it's hard to modify
>> bus level code.
>
> Why is it hard?

As Yang said, the concern is that PCI Spec doesn't document about how to 
do migration.

>
>> It also will block implementation on the Windows.
>
> Implementation of what?  We are discussing motivation here, not
> implementation.  E.g. windows drivers typically support surprise
> removal, should you use that, you get some working code for free.  Just
> stop worrying about it.  Make it work, worry about closed source
> software later.
>
>>> Dave
>>>

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-10 17:16               ` Alexander Duyck
@ 2015-12-13 15:47                 ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-13 15:47 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Dr. David Alan Gilbert, Yang Zhang, Michael S. Tsirkin,
	qemu-devel, Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore,
	Donald C, quintela, Dong, Eddie, Jani, Nrupal, Alexander Graf,
	Blue Swirl, cornelia.huck, Alex Williamson, kraxel,
	Anthony Liguori, amit.shah, Paolo Bonzini, Rustad, Mark D,
	lcapitulino, Or Gerlitz



On 12/11/2015 1:16 AM, Alexander Duyck wrote:
> On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:
>>
>>
>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>>>>
>>>> Ideally, it is able to leave guest driver unmodified but it requires the
>>>>> hypervisor or qemu to aware the device which means we may need a driver
>>>>> in
>>>>> hypervisor or qemu to handle the device on behalf of guest driver.
>>>
>>> Can you answer the question of when do you use your code -
>>>      at the start of migration or
>>>      just before the end?
>>
>>
>> Just before stopping VCPU in this version and inject VF mailbox irq to
>> notify the driver if the irq handler is installed.
>> Qemu side also will check this via the faked PCI migration capability
>> and driver will set the status during device open() or resume() callback.
>
> The VF mailbox interrupt is a very bad idea.  Really the device should
> be in a reset state on the other side of a migration.  It doesn't make
> sense to have the interrupt firing if the device is not configured.
> This is one of the things that is preventing you from being able to
> migrate the device while the interface is administratively down or the
> VF driver is not loaded.

 From my opinion, if VF driver is not loaded and hardware doesn't start
to work, the device state doesn't need to be migrated.

We may add a flag for driver to check whether migration happened during 
it's down and reinitialize the hardware and clear the flag when system 
try to put it up.

We may add migration core in the Linux kernel and provide some helps 
functions to facilitate to add migration support for drivers.
Migration core is in charge to sync status with Qemu.

Example.
migration_register()
Driver provides
- Callbacks to be called before and after migration or for bad path
- Its irq which it prefers to deal with migration event.

migration_event_check()
Driver calls it in the irq handler. Migration core code will check
migration status and call its callbacks when migration happens.


>
> My thought on all this is that it might make sense to move this
> functionality into a PCI-to-PCI bridge device and make it a
> requirement that all direct-assigned devices have to exist behind that
> device in order to support migration.  That way you would be working
> with a directly emulated device that would likely already be
> supporting hot-plug anyway.  Then it would just be a matter of coming
> up with a few Qemu specific extensions that you would need to add to
> the device itself.  The same approach would likely be portable enough
> that you could achieve it with PCIe as well via the same configuration
> space being present on the upstream side of a PCIe port or maybe a
> PCIe switch of some sort.
>
> It would then be possible to signal via your vendor-specific PCI
> capability on that device that all devices behind this bridge require
> DMA page dirtying, you could use the configuration in addition to the
> interrupt already provided for hot-plug to signal things like when you
> are starting migration, and possibly even just extend the shpc
> functionality so that if this capability is present you have the
> option to pause/resume instead of remove/probe the device in the case
> of certain hot-plug events.  The fact is there may be some use for a
> pause/resume type approach for PCIe hot-plug in the near future
> anyway.  From the sounds of it Apple has required it for all
> Thunderbolt device drivers so that they can halt the device in order
> to shuffle resources around, perhaps we should look at something
> similar for Linux.
>
> The other advantage behind grouping functions on one bridge is things
> like reset domains.  The PCI error handling logic will want to be able
> to reset any devices that experienced an error in the event of
> something such as a surprise removal.  By grouping all of the devices
> you could disable/reset/enable them as one logical group in the event
> of something such as the "bad path" approach Michael has mentioned.
>

These sounds we need to add a faked bridge for migration and adding a
driver in the guest for it. It also needs to extend PCI bus/hotplug
driver to do pause/resume other devices, right?

My concern is still that whether we can change PCI bus/hotplug like that
without spec change.

IRQ should be general for any devices and we may extend it for
migration. Device driver also can make decision to support migration
or not.



>>>
>>>>>>> It would be great if we could avoid changing the guest; but at least
>>>>>>> your guest
>>>>>>> driver changes don't actually seem to be that hardware specific;
>>>>>>> could your
>>>>>>> changes actually be moved to generic PCI level so they could be made
>>>>>>> to work for lots of drivers?
>>>>
>>>>>
>>>>> It is impossible to use one common solution for all devices unless the
>>>>> PCIE
>>>>> spec documents it clearly and i think one day it will be there. But
>>>>> before
>>>>> that, we need some workarounds on guest driver to make it work even it
>>>>> looks
>>>>> ugly.
>>
>>
>> Yes, so far there is not hardware migration support and it's hard to modify
>> bus level code. It also will block implementation on the Windows.
>
> Please don't assume things.  Unless you have hard data from Microsoft
> that says they want it this way lets just try to figure out what works
> best for us for now and then we can start worrying about third party
> implementations after we have figured out a solution that actually
> works.
>
> - Alex
>

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-13 15:47                 ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-13 15:47 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Yang Zhang, Tantilov, Emil S, kvm, Michael S. Tsirkin, aik,
	qemu-devel, lcapitulino, Blue Swirl, kraxel, Rustad, Mark D,
	quintela, Skidmore, Donald C, Alexander Graf, Or Gerlitz,
	Dr. David Alan Gilbert, Alex Williamson, Anthony Liguori,
	cornelia.huck, Ard Biesheuvel, Dong, Eddie, Jani, Nrupal,
	amit.shah, Paolo Bonzini



On 12/11/2015 1:16 AM, Alexander Duyck wrote:
> On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:
>>
>>
>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>>>>
>>>> Ideally, it is able to leave guest driver unmodified but it requires the
>>>>> hypervisor or qemu to aware the device which means we may need a driver
>>>>> in
>>>>> hypervisor or qemu to handle the device on behalf of guest driver.
>>>
>>> Can you answer the question of when do you use your code -
>>>      at the start of migration or
>>>      just before the end?
>>
>>
>> Just before stopping VCPU in this version and inject VF mailbox irq to
>> notify the driver if the irq handler is installed.
>> Qemu side also will check this via the faked PCI migration capability
>> and driver will set the status during device open() or resume() callback.
>
> The VF mailbox interrupt is a very bad idea.  Really the device should
> be in a reset state on the other side of a migration.  It doesn't make
> sense to have the interrupt firing if the device is not configured.
> This is one of the things that is preventing you from being able to
> migrate the device while the interface is administratively down or the
> VF driver is not loaded.

 From my opinion, if VF driver is not loaded and hardware doesn't start
to work, the device state doesn't need to be migrated.

We may add a flag for driver to check whether migration happened during 
it's down and reinitialize the hardware and clear the flag when system 
try to put it up.

We may add migration core in the Linux kernel and provide some helps 
functions to facilitate to add migration support for drivers.
Migration core is in charge to sync status with Qemu.

Example.
migration_register()
Driver provides
- Callbacks to be called before and after migration or for bad path
- Its irq which it prefers to deal with migration event.

migration_event_check()
Driver calls it in the irq handler. Migration core code will check
migration status and call its callbacks when migration happens.


>
> My thought on all this is that it might make sense to move this
> functionality into a PCI-to-PCI bridge device and make it a
> requirement that all direct-assigned devices have to exist behind that
> device in order to support migration.  That way you would be working
> with a directly emulated device that would likely already be
> supporting hot-plug anyway.  Then it would just be a matter of coming
> up with a few Qemu specific extensions that you would need to add to
> the device itself.  The same approach would likely be portable enough
> that you could achieve it with PCIe as well via the same configuration
> space being present on the upstream side of a PCIe port or maybe a
> PCIe switch of some sort.
>
> It would then be possible to signal via your vendor-specific PCI
> capability on that device that all devices behind this bridge require
> DMA page dirtying, you could use the configuration in addition to the
> interrupt already provided for hot-plug to signal things like when you
> are starting migration, and possibly even just extend the shpc
> functionality so that if this capability is present you have the
> option to pause/resume instead of remove/probe the device in the case
> of certain hot-plug events.  The fact is there may be some use for a
> pause/resume type approach for PCIe hot-plug in the near future
> anyway.  From the sounds of it Apple has required it for all
> Thunderbolt device drivers so that they can halt the device in order
> to shuffle resources around, perhaps we should look at something
> similar for Linux.
>
> The other advantage behind grouping functions on one bridge is things
> like reset domains.  The PCI error handling logic will want to be able
> to reset any devices that experienced an error in the event of
> something such as a surprise removal.  By grouping all of the devices
> you could disable/reset/enable them as one logical group in the event
> of something such as the "bad path" approach Michael has mentioned.
>

These sounds we need to add a faked bridge for migration and adding a
driver in the guest for it. It also needs to extend PCI bus/hotplug
driver to do pause/resume other devices, right?

My concern is still that whether we can change PCI bus/hotplug like that
without spec change.

IRQ should be general for any devices and we may extend it for
migration. Device driver also can make decision to support migration
or not.



>>>
>>>>>>> It would be great if we could avoid changing the guest; but at least
>>>>>>> your guest
>>>>>>> driver changes don't actually seem to be that hardware specific;
>>>>>>> could your
>>>>>>> changes actually be moved to generic PCI level so they could be made
>>>>>>> to work for lots of drivers?
>>>>
>>>>>
>>>>> It is impossible to use one common solution for all devices unless the
>>>>> PCIE
>>>>> spec documents it clearly and i think one day it will be there. But
>>>>> before
>>>>> that, we need some workarounds on guest driver to make it work even it
>>>>> looks
>>>>> ugly.
>>
>>
>> Yes, so far there is not hardware migration support and it's hard to modify
>> bus level code. It also will block implementation on the Windows.
>
> Please don't assume things.  Unless you have hard data from Microsoft
> that says they want it this way lets just try to figure out what works
> best for us for now and then we can start worrying about third party
> implementations after we have figured out a solution that actually
> works.
>
> - Alex
>

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-13 15:47                 ` Lan, Tianyu
@ 2015-12-13 19:30                   ` Alexander Duyck
  -1 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-13 19:30 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Dr. David Alan Gilbert, Yang Zhang, Michael S. Tsirkin,
	qemu-devel, Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore,
	Donald C, quintela, Dong, Eddie, Jani, Nrupal, Alexander Graf,
	Blue Swirl, cornelia.huck, Alex Williamson, kraxel,
	Anthony Liguori, amit.shah, Paolo Bonzini, Rustad, Mark D,
	lcapitulino, Or Gerlitz

On Sun, Dec 13, 2015 at 7:47 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:
>
>
> On 12/11/2015 1:16 AM, Alexander Duyck wrote:
>>
>> On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:
>>>
>>>
>>>
>>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>>>>>
>>>>>
>>>>> Ideally, it is able to leave guest driver unmodified but it requires
>>>>> the
>>>>>>
>>>>>> hypervisor or qemu to aware the device which means we may need a
>>>>>> driver
>>>>>> in
>>>>>> hypervisor or qemu to handle the device on behalf of guest driver.
>>>>
>>>>
>>>> Can you answer the question of when do you use your code -
>>>>      at the start of migration or
>>>>      just before the end?
>>>
>>>
>>>
>>> Just before stopping VCPU in this version and inject VF mailbox irq to
>>> notify the driver if the irq handler is installed.
>>> Qemu side also will check this via the faked PCI migration capability
>>> and driver will set the status during device open() or resume() callback.
>>
>>
>> The VF mailbox interrupt is a very bad idea.  Really the device should
>> be in a reset state on the other side of a migration.  It doesn't make
>> sense to have the interrupt firing if the device is not configured.
>> This is one of the things that is preventing you from being able to
>> migrate the device while the interface is administratively down or the
>> VF driver is not loaded.
>
>
> From my opinion, if VF driver is not loaded and hardware doesn't start
> to work, the device state doesn't need to be migrated.
>
> We may add a flag for driver to check whether migration happened during it's
> down and reinitialize the hardware and clear the flag when system try to put
> it up.
>
> We may add migration core in the Linux kernel and provide some helps
> functions to facilitate to add migration support for drivers.
> Migration core is in charge to sync status with Qemu.
>
> Example.
> migration_register()
> Driver provides
> - Callbacks to be called before and after migration or for bad path
> - Its irq which it prefers to deal with migration event.

You would be better off just using function pointers in the pci_driver
struct and let the PCI driver registration take care of all that.

> migration_event_check()
> Driver calls it in the irq handler. Migration core code will check
> migration status and call its callbacks when migration happens.

No, this is still a bad idea.  You haven't addressed what you do when
the device has had interrupts disabled such as being in the down
state.

This is the biggest issue I see with your whole patch set.  It
requires the driver containing certain changes and being in a certain
state.  You cannot put those expectations on the guest.  You really
need to try and move as much of this out to existing functionality as
possible.

>>
>> My thought on all this is that it might make sense to move this
>> functionality into a PCI-to-PCI bridge device and make it a
>> requirement that all direct-assigned devices have to exist behind that
>> device in order to support migration.  That way you would be working
>> with a directly emulated device that would likely already be
>> supporting hot-plug anyway.  Then it would just be a matter of coming
>> up with a few Qemu specific extensions that you would need to add to
>> the device itself.  The same approach would likely be portable enough
>> that you could achieve it with PCIe as well via the same configuration
>> space being present on the upstream side of a PCIe port or maybe a
>> PCIe switch of some sort.
>>
>> It would then be possible to signal via your vendor-specific PCI
>> capability on that device that all devices behind this bridge require
>> DMA page dirtying, you could use the configuration in addition to the
>> interrupt already provided for hot-plug to signal things like when you
>> are starting migration, and possibly even just extend the shpc
>> functionality so that if this capability is present you have the
>> option to pause/resume instead of remove/probe the device in the case
>> of certain hot-plug events.  The fact is there may be some use for a
>> pause/resume type approach for PCIe hot-plug in the near future
>> anyway.  From the sounds of it Apple has required it for all
>> Thunderbolt device drivers so that they can halt the device in order
>> to shuffle resources around, perhaps we should look at something
>> similar for Linux.
>>
>> The other advantage behind grouping functions on one bridge is things
>> like reset domains.  The PCI error handling logic will want to be able
>> to reset any devices that experienced an error in the event of
>> something such as a surprise removal.  By grouping all of the devices
>> you could disable/reset/enable them as one logical group in the event
>> of something such as the "bad path" approach Michael has mentioned.
>>
>
> These sounds we need to add a faked bridge for migration and adding a
> driver in the guest for it. It also needs to extend PCI bus/hotplug
> driver to do pause/resume other devices, right?
>
> My concern is still that whether we can change PCI bus/hotplug like that
> without spec change.
>
> IRQ should be general for any devices and we may extend it for
> migration. Device driver also can make decision to support migration
> or not.

The device should have no say in the matter.  Either we are going to
migrate or we will not.  This is why I have suggested my approach as
it allows for the least amount of driver intrusion while providing the
maximum number of ways to still perform migration even if the device
doesn't support it.

The solution I have proposed is simple:

1.  Extend swiotlb to allow for a page dirtying functionality.

     This part is pretty straight forward.  I'll submit a few patches
later today as RFC that can provided the minimal functionality needed
for this.

2.  Provide a vendor specific configuration space option on the QEMU
implementation of a PCI bridge to act as a bridge between direct
assigned devices and the host bridge.

     My thought was to add some vendor specific block that includes a
capabilities, status, and control register so you could go through and
synchronize things like the DMA page dirtying feature.  The bridge
itself could manage the migration capable bit inside QEMU for all
devices assigned to it.  So if you added a VF to the bridge it would
flag that you can support migration in QEMU, while the bridge would
indicate you cannot until the DMA page dirtying control bit is set by
the guest.

     We could also go through and optimize the DMA page dirtying after
this is added so that we can narrow down the scope of use, and as a
result improve the performance for other devices that don't need to
support migration.  It would then be a matter of adding an interrupt
in the device to handle an event such as the DMA page dirtying status
bit being set in the config space status register, while the bit is
not set in the control register.  If it doesn't get set then we would
have to evict the devices before the warm-up phase of the migration,
otherwise we can defer it until the end of the warm-up phase.

3.  Extend existing shpc driver to support the optional "pause"
functionality as called out in section 4.1.2 of the Revision 1.1 PCI
hot-plug specification.

     Note I call out "extend" here instead of saying to add this.
Basically what we should do is provide a means of quiescing the device
without unloading the driver.  This is called out as something the OS
vendor can optionally implement in the PCI hot-plug specification.  On
OSes that wouldn't support this it would just be treated as a standard
hot-plug event.   We could add a capability, status, and control bit
in the vendor specific configuration block for this as well and if we
set the status bit would indicate the host wants to pause instead of
remove and the control bit would indicate the guest supports "pause"
in the OS.  We then could optionally disable guest migration while the
VF is present and pause is not supported.

     To support this we would need to add a timer and if a new device
is not inserted in some period of time (60 seconds for example), or if
a different device is inserted, we need to unload the original driver
from the device.  In addition we would need to verify if drivers can
call the remove function after having called suspend without resume.
If not, we could look at adding a recovery function to remove the
driver from the device in the case of a suspend with either a failed
resume or no resume call.  Once again it would probably be useful to
have for those cases where power management suspend/resume runs into
an issue like somebody causing a surprise removal while a device was
suspended.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-13 19:30                   ` Alexander Duyck
  0 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-13 19:30 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Yang Zhang, Tantilov, Emil S, kvm, Michael S. Tsirkin, aik,
	qemu-devel, lcapitulino, Blue Swirl, kraxel, Rustad, Mark D,
	quintela, Skidmore, Donald C, Alexander Graf, Or Gerlitz,
	Dr. David Alan Gilbert, Alex Williamson, Anthony Liguori,
	cornelia.huck, Ard Biesheuvel, Dong, Eddie, Jani, Nrupal,
	amit.shah, Paolo Bonzini

On Sun, Dec 13, 2015 at 7:47 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:
>
>
> On 12/11/2015 1:16 AM, Alexander Duyck wrote:
>>
>> On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:
>>>
>>>
>>>
>>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>>>>>
>>>>>
>>>>> Ideally, it is able to leave guest driver unmodified but it requires
>>>>> the
>>>>>>
>>>>>> hypervisor or qemu to aware the device which means we may need a
>>>>>> driver
>>>>>> in
>>>>>> hypervisor or qemu to handle the device on behalf of guest driver.
>>>>
>>>>
>>>> Can you answer the question of when do you use your code -
>>>>      at the start of migration or
>>>>      just before the end?
>>>
>>>
>>>
>>> Just before stopping VCPU in this version and inject VF mailbox irq to
>>> notify the driver if the irq handler is installed.
>>> Qemu side also will check this via the faked PCI migration capability
>>> and driver will set the status during device open() or resume() callback.
>>
>>
>> The VF mailbox interrupt is a very bad idea.  Really the device should
>> be in a reset state on the other side of a migration.  It doesn't make
>> sense to have the interrupt firing if the device is not configured.
>> This is one of the things that is preventing you from being able to
>> migrate the device while the interface is administratively down or the
>> VF driver is not loaded.
>
>
> From my opinion, if VF driver is not loaded and hardware doesn't start
> to work, the device state doesn't need to be migrated.
>
> We may add a flag for driver to check whether migration happened during it's
> down and reinitialize the hardware and clear the flag when system try to put
> it up.
>
> We may add migration core in the Linux kernel and provide some helps
> functions to facilitate to add migration support for drivers.
> Migration core is in charge to sync status with Qemu.
>
> Example.
> migration_register()
> Driver provides
> - Callbacks to be called before and after migration or for bad path
> - Its irq which it prefers to deal with migration event.

You would be better off just using function pointers in the pci_driver
struct and let the PCI driver registration take care of all that.

> migration_event_check()
> Driver calls it in the irq handler. Migration core code will check
> migration status and call its callbacks when migration happens.

No, this is still a bad idea.  You haven't addressed what you do when
the device has had interrupts disabled such as being in the down
state.

This is the biggest issue I see with your whole patch set.  It
requires the driver containing certain changes and being in a certain
state.  You cannot put those expectations on the guest.  You really
need to try and move as much of this out to existing functionality as
possible.

>>
>> My thought on all this is that it might make sense to move this
>> functionality into a PCI-to-PCI bridge device and make it a
>> requirement that all direct-assigned devices have to exist behind that
>> device in order to support migration.  That way you would be working
>> with a directly emulated device that would likely already be
>> supporting hot-plug anyway.  Then it would just be a matter of coming
>> up with a few Qemu specific extensions that you would need to add to
>> the device itself.  The same approach would likely be portable enough
>> that you could achieve it with PCIe as well via the same configuration
>> space being present on the upstream side of a PCIe port or maybe a
>> PCIe switch of some sort.
>>
>> It would then be possible to signal via your vendor-specific PCI
>> capability on that device that all devices behind this bridge require
>> DMA page dirtying, you could use the configuration in addition to the
>> interrupt already provided for hot-plug to signal things like when you
>> are starting migration, and possibly even just extend the shpc
>> functionality so that if this capability is present you have the
>> option to pause/resume instead of remove/probe the device in the case
>> of certain hot-plug events.  The fact is there may be some use for a
>> pause/resume type approach for PCIe hot-plug in the near future
>> anyway.  From the sounds of it Apple has required it for all
>> Thunderbolt device drivers so that they can halt the device in order
>> to shuffle resources around, perhaps we should look at something
>> similar for Linux.
>>
>> The other advantage behind grouping functions on one bridge is things
>> like reset domains.  The PCI error handling logic will want to be able
>> to reset any devices that experienced an error in the event of
>> something such as a surprise removal.  By grouping all of the devices
>> you could disable/reset/enable them as one logical group in the event
>> of something such as the "bad path" approach Michael has mentioned.
>>
>
> These sounds we need to add a faked bridge for migration and adding a
> driver in the guest for it. It also needs to extend PCI bus/hotplug
> driver to do pause/resume other devices, right?
>
> My concern is still that whether we can change PCI bus/hotplug like that
> without spec change.
>
> IRQ should be general for any devices and we may extend it for
> migration. Device driver also can make decision to support migration
> or not.

The device should have no say in the matter.  Either we are going to
migrate or we will not.  This is why I have suggested my approach as
it allows for the least amount of driver intrusion while providing the
maximum number of ways to still perform migration even if the device
doesn't support it.

The solution I have proposed is simple:

1.  Extend swiotlb to allow for a page dirtying functionality.

     This part is pretty straight forward.  I'll submit a few patches
later today as RFC that can provided the minimal functionality needed
for this.

2.  Provide a vendor specific configuration space option on the QEMU
implementation of a PCI bridge to act as a bridge between direct
assigned devices and the host bridge.

     My thought was to add some vendor specific block that includes a
capabilities, status, and control register so you could go through and
synchronize things like the DMA page dirtying feature.  The bridge
itself could manage the migration capable bit inside QEMU for all
devices assigned to it.  So if you added a VF to the bridge it would
flag that you can support migration in QEMU, while the bridge would
indicate you cannot until the DMA page dirtying control bit is set by
the guest.

     We could also go through and optimize the DMA page dirtying after
this is added so that we can narrow down the scope of use, and as a
result improve the performance for other devices that don't need to
support migration.  It would then be a matter of adding an interrupt
in the device to handle an event such as the DMA page dirtying status
bit being set in the config space status register, while the bit is
not set in the control register.  If it doesn't get set then we would
have to evict the devices before the warm-up phase of the migration,
otherwise we can defer it until the end of the warm-up phase.

3.  Extend existing shpc driver to support the optional "pause"
functionality as called out in section 4.1.2 of the Revision 1.1 PCI
hot-plug specification.

     Note I call out "extend" here instead of saying to add this.
Basically what we should do is provide a means of quiescing the device
without unloading the driver.  This is called out as something the OS
vendor can optionally implement in the PCI hot-plug specification.  On
OSes that wouldn't support this it would just be treated as a standard
hot-plug event.   We could add a capability, status, and control bit
in the vendor specific configuration block for this as well and if we
set the status bit would indicate the host wants to pause instead of
remove and the control bit would indicate the guest supports "pause"
in the OS.  We then could optionally disable guest migration while the
VF is present and pause is not supported.

     To support this we would need to add a timer and if a new device
is not inserted in some period of time (60 seconds for example), or if
a different device is inserted, we need to unload the original driver
from the device.  In addition we would need to verify if drivers can
call the remove function after having called suspend without resume.
If not, we could look at adding a recovery function to remove the
driver from the device in the case of a suspend with either a failed
resume or no resume call.  Once again it would probably be useful to
have for those cases where power management suspend/resume runs into
an issue like somebody causing a surprise removal while a device was
suspended.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-11  7:32                 ` Lan, Tianyu
@ 2015-12-14  9:12                   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-14  9:12 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Dr. David Alan Gilbert, Yang Zhang, qemu-devel, emil.s.tantilov,
	kvm, ard.biesheuvel, aik, donald.c.skidmore, quintela,
	eddie.dong, nrupal.jani, agraf, blauwirbel, cornelia.huck,
	alex.williamson, kraxel, anthony, amit.shah, pbonzini,
	mark.d.rustad, lcapitulino, gerlitz.or

On Fri, Dec 11, 2015 at 03:32:04PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/11/2015 12:11 AM, Michael S. Tsirkin wrote:
> >On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
> >>
> >>
> >>On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> >>>>Ideally, it is able to leave guest driver unmodified but it requires the
> >>>>>hypervisor or qemu to aware the device which means we may need a driver in
> >>>>>hypervisor or qemu to handle the device on behalf of guest driver.
> >>>Can you answer the question of when do you use your code -
> >>>    at the start of migration or
> >>>    just before the end?
> >>
> >>Just before stopping VCPU in this version and inject VF mailbox irq to
> >>notify the driver if the irq handler is installed.
> >>Qemu side also will check this via the faked PCI migration capability
> >>and driver will set the status during device open() or resume() callback.
> >
> >Right, this is the "good path" optimization. Whether this buys anything
> >as compared to just sending reset to the device when VCPU is stopped
> >needs to be measured. In any case, we probably do need a way to
> >interrupt driver on destination to make it reconfigure the device -
> >otherwise it might take seconds for it to notice.  And a way to make
> >sure driver can handle this surprise reset so we can block migration if
> >it can't.
> >
> 
> Yes, we need such a way to notify driver about migration status and do
> reset or restore operation on the destination machine. My original
> design is to take advantage of device's irq to do that. Driver can tell
> Qemu that which irq it prefers to handle such task and whether the irq
> is enabled or bound with handler. We may discuss the detail in the other
> thread.
> 
> >>>
> >>>>>>>It would be great if we could avoid changing the guest; but at least your guest
> >>>>>>>driver changes don't actually seem to be that hardware specific; could your
> >>>>>>>changes actually be moved to generic PCI level so they could be made
> >>>>>>>to work for lots of drivers?
> >>>>>
> >>>>>It is impossible to use one common solution for all devices unless the PCIE
> >>>>>spec documents it clearly and i think one day it will be there. But before
> >>>>>that, we need some workarounds on guest driver to make it work even it looks
> >>>>>ugly.
> >>
> >>Yes, so far there is not hardware migration support
> >
> >VT-D supports setting dirty bit in the PTE in hardware.
> 
> Actually, this doesn't support in the current hardware.
> VTD spec documents the dirty bit for first level translation which
> requires devices to support DMA request with PASID(process
> address space identifier). Most device don't support the feature.

True, I missed this.  It's generally unfortunate that first level
translation only applies to requests with PASID.  All other features
limited to requests with PASID like nested translation would be very
useful for all requests, not just requests with PASID.


> >
> >>and it's hard to modify
> >>bus level code.
> >
> >Why is it hard?
> 
> As Yang said, the concern is that PCI Spec doesn't document about how to do
> migration.

We can submit a PCI spec ECN documenting a new capability.

I think for existing devices which lack it, adding this capability to
the bridge to which the device is attached is preferable to trying to
add it to the device itself.

> >
> >>It also will block implementation on the Windows.
> >
> >Implementation of what?  We are discussing motivation here, not
> >implementation.  E.g. windows drivers typically support surprise
> >removal, should you use that, you get some working code for free.  Just
> >stop worrying about it.  Make it work, worry about closed source
> >software later.
> >
> >>>Dave
> >>>

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-14  9:12                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-14  9:12 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Yang Zhang, emil.s.tantilov, kvm, aik, qemu-devel, lcapitulino,
	blauwirbel, kraxel, mark.d.rustad, quintela, donald.c.skidmore,
	agraf, gerlitz.or, Dr. David Alan Gilbert, alex.williamson,
	anthony, cornelia.huck, ard.biesheuvel, eddie.dong, nrupal.jani,
	amit.shah, pbonzini

On Fri, Dec 11, 2015 at 03:32:04PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/11/2015 12:11 AM, Michael S. Tsirkin wrote:
> >On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
> >>
> >>
> >>On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> >>>>Ideally, it is able to leave guest driver unmodified but it requires the
> >>>>>hypervisor or qemu to aware the device which means we may need a driver in
> >>>>>hypervisor or qemu to handle the device on behalf of guest driver.
> >>>Can you answer the question of when do you use your code -
> >>>    at the start of migration or
> >>>    just before the end?
> >>
> >>Just before stopping VCPU in this version and inject VF mailbox irq to
> >>notify the driver if the irq handler is installed.
> >>Qemu side also will check this via the faked PCI migration capability
> >>and driver will set the status during device open() or resume() callback.
> >
> >Right, this is the "good path" optimization. Whether this buys anything
> >as compared to just sending reset to the device when VCPU is stopped
> >needs to be measured. In any case, we probably do need a way to
> >interrupt driver on destination to make it reconfigure the device -
> >otherwise it might take seconds for it to notice.  And a way to make
> >sure driver can handle this surprise reset so we can block migration if
> >it can't.
> >
> 
> Yes, we need such a way to notify driver about migration status and do
> reset or restore operation on the destination machine. My original
> design is to take advantage of device's irq to do that. Driver can tell
> Qemu that which irq it prefers to handle such task and whether the irq
> is enabled or bound with handler. We may discuss the detail in the other
> thread.
> 
> >>>
> >>>>>>>It would be great if we could avoid changing the guest; but at least your guest
> >>>>>>>driver changes don't actually seem to be that hardware specific; could your
> >>>>>>>changes actually be moved to generic PCI level so they could be made
> >>>>>>>to work for lots of drivers?
> >>>>>
> >>>>>It is impossible to use one common solution for all devices unless the PCIE
> >>>>>spec documents it clearly and i think one day it will be there. But before
> >>>>>that, we need some workarounds on guest driver to make it work even it looks
> >>>>>ugly.
> >>
> >>Yes, so far there is not hardware migration support
> >
> >VT-D supports setting dirty bit in the PTE in hardware.
> 
> Actually, this doesn't support in the current hardware.
> VTD spec documents the dirty bit for first level translation which
> requires devices to support DMA request with PASID(process
> address space identifier). Most device don't support the feature.

True, I missed this.  It's generally unfortunate that first level
translation only applies to requests with PASID.  All other features
limited to requests with PASID like nested translation would be very
useful for all requests, not just requests with PASID.


> >
> >>and it's hard to modify
> >>bus level code.
> >
> >Why is it hard?
> 
> As Yang said, the concern is that PCI Spec doesn't document about how to do
> migration.

We can submit a PCI spec ECN documenting a new capability.

I think for existing devices which lack it, adding this capability to
the bridge to which the device is attached is preferable to trying to
add it to the device itself.

> >
> >>It also will block implementation on the Windows.
> >
> >Implementation of what?  We are discussing motivation here, not
> >implementation.  E.g. windows drivers typically support surprise
> >removal, should you use that, you get some working code for free.  Just
> >stop worrying about it.  Make it work, worry about closed source
> >software later.
> >
> >>>Dave
> >>>

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-13 15:47                 ` Lan, Tianyu
@ 2015-12-14  9:26                   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-14  9:26 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Alexander Duyck, Dr. David Alan Gilbert, Yang Zhang, qemu-devel,
	Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore, Donald C,
	quintela, Dong, Eddie, Jani, Nrupal, Alexander Graf, Blue Swirl,
	cornelia.huck, Alex Williamson, kraxel, Anthony Liguori,
	amit.shah, Paolo Bonzini, Rustad, Mark D, lcapitulino,
	Or Gerlitz

On Sun, Dec 13, 2015 at 11:47:44PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/11/2015 1:16 AM, Alexander Duyck wrote:
> >On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:
> >>
> >>
> >>On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> >>>>
> >>>>Ideally, it is able to leave guest driver unmodified but it requires the
> >>>>>hypervisor or qemu to aware the device which means we may need a driver
> >>>>>in
> >>>>>hypervisor or qemu to handle the device on behalf of guest driver.
> >>>
> >>>Can you answer the question of when do you use your code -
> >>>     at the start of migration or
> >>>     just before the end?
> >>
> >>
> >>Just before stopping VCPU in this version and inject VF mailbox irq to
> >>notify the driver if the irq handler is installed.
> >>Qemu side also will check this via the faked PCI migration capability
> >>and driver will set the status during device open() or resume() callback.
> >
> >The VF mailbox interrupt is a very bad idea.  Really the device should
> >be in a reset state on the other side of a migration.  It doesn't make
> >sense to have the interrupt firing if the device is not configured.
> >This is one of the things that is preventing you from being able to
> >migrate the device while the interface is administratively down or the
> >VF driver is not loaded.
> 
> From my opinion, if VF driver is not loaded and hardware doesn't start
> to work, the device state doesn't need to be migrated.
> 
> We may add a flag for driver to check whether migration happened during it's
> down and reinitialize the hardware and clear the flag when system try to put
> it up.
> 
> We may add migration core in the Linux kernel and provide some helps
> functions to facilitate to add migration support for drivers.
> Migration core is in charge to sync status with Qemu.
> 
> Example.
> migration_register()
> Driver provides
> - Callbacks to be called before and after migration or for bad path
> - Its irq which it prefers to deal with migration event.
> 
> migration_event_check()
> Driver calls it in the irq handler. Migration core code will check
> migration status and call its callbacks when migration happens.
> 
> 
> >
> >My thought on all this is that it might make sense to move this
> >functionality into a PCI-to-PCI bridge device and make it a
> >requirement that all direct-assigned devices have to exist behind that
> >device in order to support migration.  That way you would be working
> >with a directly emulated device that would likely already be
> >supporting hot-plug anyway.  Then it would just be a matter of coming
> >up with a few Qemu specific extensions that you would need to add to
> >the device itself.  The same approach would likely be portable enough
> >that you could achieve it with PCIe as well via the same configuration
> >space being present on the upstream side of a PCIe port or maybe a
> >PCIe switch of some sort.
> >
> >It would then be possible to signal via your vendor-specific PCI
> >capability on that device that all devices behind this bridge require
> >DMA page dirtying, you could use the configuration in addition to the
> >interrupt already provided for hot-plug to signal things like when you
> >are starting migration, and possibly even just extend the shpc
> >functionality so that if this capability is present you have the
> >option to pause/resume instead of remove/probe the device in the case
> >of certain hot-plug events.  The fact is there may be some use for a
> >pause/resume type approach for PCIe hot-plug in the near future
> >anyway.  From the sounds of it Apple has required it for all
> >Thunderbolt device drivers so that they can halt the device in order
> >to shuffle resources around, perhaps we should look at something
> >similar for Linux.
> >
> >The other advantage behind grouping functions on one bridge is things
> >like reset domains.  The PCI error handling logic will want to be able
> >to reset any devices that experienced an error in the event of
> >something such as a surprise removal.  By grouping all of the devices
> >you could disable/reset/enable them as one logical group in the event
> >of something such as the "bad path" approach Michael has mentioned.
> >
> 
> These sounds we need to add a faked bridge for migration and adding a
> driver in the guest for it. It also needs to extend PCI bus/hotplug
> driver to do pause/resume other devices, right?
> 
> My concern is still that whether we can change PCI bus/hotplug like that
> without spec change.
> 
> IRQ should be general for any devices and we may extend it for
> migration. Device driver also can make decision to support migration
> or not.

A dedicated IRQ per device for something that is a system wide event
sounds like a waste.  I don't understand why a spec change is strictly
required, we only need to support this with the specific virtual bridge
used by QEMU, so I think that a vendor specific capability will do.
Once this works well in the field, a PCI spec ECN might make sense
to standardise the capability.

> 
> 
> >>>
> >>>>>>>It would be great if we could avoid changing the guest; but at least
> >>>>>>>your guest
> >>>>>>>driver changes don't actually seem to be that hardware specific;
> >>>>>>>could your
> >>>>>>>changes actually be moved to generic PCI level so they could be made
> >>>>>>>to work for lots of drivers?
> >>>>
> >>>>>
> >>>>>It is impossible to use one common solution for all devices unless the
> >>>>>PCIE
> >>>>>spec documents it clearly and i think one day it will be there. But
> >>>>>before
> >>>>>that, we need some workarounds on guest driver to make it work even it
> >>>>>looks
> >>>>>ugly.
> >>
> >>
> >>Yes, so far there is not hardware migration support and it's hard to modify
> >>bus level code. It also will block implementation on the Windows.
> >
> >Please don't assume things.  Unless you have hard data from Microsoft
> >that says they want it this way lets just try to figure out what works
> >best for us for now and then we can start worrying about third party
> >implementations after we have figured out a solution that actually
> >works.
> >
> >- Alex
> >

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-14  9:26                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-14  9:26 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Yang Zhang, Tantilov, Emil S, kvm, aik, qemu-devel,
	Alexander Duyck, lcapitulino, Blue Swirl, kraxel, Rustad, Mark D,
	quintela, Skidmore, Donald C, Alexander Graf, Or Gerlitz,
	Dr. David Alan Gilbert, Alex Williamson, Anthony Liguori,
	cornelia.huck, Ard Biesheuvel, Dong, Eddie, Jani, Nrupal,
	amit.shah, Paolo Bonzini

On Sun, Dec 13, 2015 at 11:47:44PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/11/2015 1:16 AM, Alexander Duyck wrote:
> >On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu <tianyu.lan@intel.com> wrote:
> >>
> >>
> >>On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> >>>>
> >>>>Ideally, it is able to leave guest driver unmodified but it requires the
> >>>>>hypervisor or qemu to aware the device which means we may need a driver
> >>>>>in
> >>>>>hypervisor or qemu to handle the device on behalf of guest driver.
> >>>
> >>>Can you answer the question of when do you use your code -
> >>>     at the start of migration or
> >>>     just before the end?
> >>
> >>
> >>Just before stopping VCPU in this version and inject VF mailbox irq to
> >>notify the driver if the irq handler is installed.
> >>Qemu side also will check this via the faked PCI migration capability
> >>and driver will set the status during device open() or resume() callback.
> >
> >The VF mailbox interrupt is a very bad idea.  Really the device should
> >be in a reset state on the other side of a migration.  It doesn't make
> >sense to have the interrupt firing if the device is not configured.
> >This is one of the things that is preventing you from being able to
> >migrate the device while the interface is administratively down or the
> >VF driver is not loaded.
> 
> From my opinion, if VF driver is not loaded and hardware doesn't start
> to work, the device state doesn't need to be migrated.
> 
> We may add a flag for driver to check whether migration happened during it's
> down and reinitialize the hardware and clear the flag when system try to put
> it up.
> 
> We may add migration core in the Linux kernel and provide some helps
> functions to facilitate to add migration support for drivers.
> Migration core is in charge to sync status with Qemu.
> 
> Example.
> migration_register()
> Driver provides
> - Callbacks to be called before and after migration or for bad path
> - Its irq which it prefers to deal with migration event.
> 
> migration_event_check()
> Driver calls it in the irq handler. Migration core code will check
> migration status and call its callbacks when migration happens.
> 
> 
> >
> >My thought on all this is that it might make sense to move this
> >functionality into a PCI-to-PCI bridge device and make it a
> >requirement that all direct-assigned devices have to exist behind that
> >device in order to support migration.  That way you would be working
> >with a directly emulated device that would likely already be
> >supporting hot-plug anyway.  Then it would just be a matter of coming
> >up with a few Qemu specific extensions that you would need to add to
> >the device itself.  The same approach would likely be portable enough
> >that you could achieve it with PCIe as well via the same configuration
> >space being present on the upstream side of a PCIe port or maybe a
> >PCIe switch of some sort.
> >
> >It would then be possible to signal via your vendor-specific PCI
> >capability on that device that all devices behind this bridge require
> >DMA page dirtying, you could use the configuration in addition to the
> >interrupt already provided for hot-plug to signal things like when you
> >are starting migration, and possibly even just extend the shpc
> >functionality so that if this capability is present you have the
> >option to pause/resume instead of remove/probe the device in the case
> >of certain hot-plug events.  The fact is there may be some use for a
> >pause/resume type approach for PCIe hot-plug in the near future
> >anyway.  From the sounds of it Apple has required it for all
> >Thunderbolt device drivers so that they can halt the device in order
> >to shuffle resources around, perhaps we should look at something
> >similar for Linux.
> >
> >The other advantage behind grouping functions on one bridge is things
> >like reset domains.  The PCI error handling logic will want to be able
> >to reset any devices that experienced an error in the event of
> >something such as a surprise removal.  By grouping all of the devices
> >you could disable/reset/enable them as one logical group in the event
> >of something such as the "bad path" approach Michael has mentioned.
> >
> 
> These sounds we need to add a faked bridge for migration and adding a
> driver in the guest for it. It also needs to extend PCI bus/hotplug
> driver to do pause/resume other devices, right?
> 
> My concern is still that whether we can change PCI bus/hotplug like that
> without spec change.
> 
> IRQ should be general for any devices and we may extend it for
> migration. Device driver also can make decision to support migration
> or not.

A dedicated IRQ per device for something that is a system wide event
sounds like a waste.  I don't understand why a spec change is strictly
required, we only need to support this with the specific virtual bridge
used by QEMU, so I think that a vendor specific capability will do.
Once this works well in the field, a PCI spec ECN might make sense
to standardise the capability.

> 
> 
> >>>
> >>>>>>>It would be great if we could avoid changing the guest; but at least
> >>>>>>>your guest
> >>>>>>>driver changes don't actually seem to be that hardware specific;
> >>>>>>>could your
> >>>>>>>changes actually be moved to generic PCI level so they could be made
> >>>>>>>to work for lots of drivers?
> >>>>
> >>>>>
> >>>>>It is impossible to use one common solution for all devices unless the
> >>>>>PCIE
> >>>>>spec documents it clearly and i think one day it will be there. But
> >>>>>before
> >>>>>that, we need some workarounds on guest driver to make it work even it
> >>>>>looks
> >>>>>ugly.
> >>
> >>
> >>Yes, so far there is not hardware migration support and it's hard to modify
> >>bus level code. It also will block implementation on the Windows.
> >
> >Please don't assume things.  Unless you have hard data from Microsoft
> >that says they want it this way lets just try to figure out what works
> >best for us for now and then we can start worrying about third party
> >implementations after we have figured out a solution that actually
> >works.
> >
> >- Alex
> >

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: live migration vs device assignment (motivation)
  2015-12-13 19:30                   ` Alexander Duyck
@ 2015-12-25  7:03                     ` Lan Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-12-25  7:03 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Yang Zhang, Tantilov, Emil S, kvm, Michael S. Tsirkin, aik,
	qemu-devel, lcapitulino, Blue Swirl, kraxel, Rustad, Mark D,
	quintela, Skidmore, Donald C, Alexander Graf, Or Gerlitz,
	Dr. David Alan Gilbert, Alex Williamson, Anthony Liguori,
	cornelia.huck, Ard Biesheuvel, Dong, Eddie, Jani, Nrupal,
	amit.shah, Paolo Bonzini

Merry Christmas.
Sorry for later response due to personal affair.

On 2015年12月14日 03:30, Alexander Duyck wrote:
>> > These sounds we need to add a faked bridge for migration and adding a
>> > driver in the guest for it. It also needs to extend PCI bus/hotplug
>> > driver to do pause/resume other devices, right?
>> >
>> > My concern is still that whether we can change PCI bus/hotplug like that
>> > without spec change.
>> >
>> > IRQ should be general for any devices and we may extend it for
>> > migration. Device driver also can make decision to support migration
>> > or not.
> The device should have no say in the matter.  Either we are going to
> migrate or we will not.  This is why I have suggested my approach as
> it allows for the least amount of driver intrusion while providing the
> maximum number of ways to still perform migration even if the device
> doesn't support it.

Even if the device driver doesn't support migration, you still want to
migrate VM? That maybe risk and we should add the "bad path" for the
driver at least.

> 
> The solution I have proposed is simple:
> 
> 1.  Extend swiotlb to allow for a page dirtying functionality.
> 
>      This part is pretty straight forward.  I'll submit a few patches
> later today as RFC that can provided the minimal functionality needed
> for this.

Very appreciate to do that.

> 
> 2.  Provide a vendor specific configuration space option on the QEMU
> implementation of a PCI bridge to act as a bridge between direct
> assigned devices and the host bridge.
> 
>      My thought was to add some vendor specific block that includes a
> capabilities, status, and control register so you could go through and
> synchronize things like the DMA page dirtying feature.  The bridge
> itself could manage the migration capable bit inside QEMU for all
> devices assigned to it.  So if you added a VF to the bridge it would
> flag that you can support migration in QEMU, while the bridge would
> indicate you cannot until the DMA page dirtying control bit is set by
> the guest.
> 
>      We could also go through and optimize the DMA page dirtying after
> this is added so that we can narrow down the scope of use, and as a
> result improve the performance for other devices that don't need to
> support migration.  It would then be a matter of adding an interrupt
> in the device to handle an event such as the DMA page dirtying status
> bit being set in the config space status register, while the bit is
> not set in the control register.  If it doesn't get set then we would
> have to evict the devices before the warm-up phase of the migration,
> otherwise we can defer it until the end of the warm-up phase.
> 
> 3.  Extend existing shpc driver to support the optional "pause"
> functionality as called out in section 4.1.2 of the Revision 1.1 PCI
> hot-plug specification.

Since your solution has added a faked PCI bridge. Why not notify the
bridge directly during migration via irq and call device driver's
callback in the new bridge driver?

Otherwise, the new bridge driver also can check whether the device
driver provides migration callback or not and call them to improve the
passthough device's performance during migration.

> 
>      Note I call out "extend" here instead of saying to add this.
> Basically what we should do is provide a means of quiescing the device
> without unloading the driver.  This is called out as something the OS
> vendor can optionally implement in the PCI hot-plug specification.  On
> OSes that wouldn't support this it would just be treated as a standard
> hot-plug event.   We could add a capability, status, and control bit
> in the vendor specific configuration block for this as well and if we
> set the status bit would indicate the host wants to pause instead of
> remove and the control bit would indicate the guest supports "pause"
> in the OS.  We then could optionally disable guest migration while the
> VF is present and pause is not supported.
> 
>      To support this we would need to add a timer and if a new device
> is not inserted in some period of time (60 seconds for example), or if
> a different device is inserted,
> we need to unload the original driver
> from the device.  In addition we would need to verify if drivers can
> call the remove function after having called suspend without resume.
> If not, we could look at adding a recovery function to remove the
> driver from the device in the case of a suspend with either a failed
> resume or no resume call.  Once again it would probably be useful to
> have for those cases where power management suspend/resume runs into
> an issue like somebody causing a surprise removal while a device was
> suspended.


-- 
Best regards
Tianyu Lan

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-25  7:03                     ` Lan Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2015-12-25  7:03 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Yang Zhang, Tantilov, Emil S, kvm, Michael S. Tsirkin, aik,
	qemu-devel, lcapitulino, Blue Swirl, kraxel, Rustad, Mark D,
	quintela, Skidmore, Donald C, Alexander Graf, Or Gerlitz,
	Dr. David Alan Gilbert, Alex Williamson, Anthony Liguori,
	cornelia.huck, Ard Biesheuvel, Dong, Eddie, Jani, Nrupal,
	amit.shah, Paolo Bonzini

Merry Christmas.
Sorry for later response due to personal affair.

On 2015年12月14日 03:30, Alexander Duyck wrote:
>> > These sounds we need to add a faked bridge for migration and adding a
>> > driver in the guest for it. It also needs to extend PCI bus/hotplug
>> > driver to do pause/resume other devices, right?
>> >
>> > My concern is still that whether we can change PCI bus/hotplug like that
>> > without spec change.
>> >
>> > IRQ should be general for any devices and we may extend it for
>> > migration. Device driver also can make decision to support migration
>> > or not.
> The device should have no say in the matter.  Either we are going to
> migrate or we will not.  This is why I have suggested my approach as
> it allows for the least amount of driver intrusion while providing the
> maximum number of ways to still perform migration even if the device
> doesn't support it.

Even if the device driver doesn't support migration, you still want to
migrate VM? That maybe risk and we should add the "bad path" for the
driver at least.

> 
> The solution I have proposed is simple:
> 
> 1.  Extend swiotlb to allow for a page dirtying functionality.
> 
>      This part is pretty straight forward.  I'll submit a few patches
> later today as RFC that can provided the minimal functionality needed
> for this.

Very appreciate to do that.

> 
> 2.  Provide a vendor specific configuration space option on the QEMU
> implementation of a PCI bridge to act as a bridge between direct
> assigned devices and the host bridge.
> 
>      My thought was to add some vendor specific block that includes a
> capabilities, status, and control register so you could go through and
> synchronize things like the DMA page dirtying feature.  The bridge
> itself could manage the migration capable bit inside QEMU for all
> devices assigned to it.  So if you added a VF to the bridge it would
> flag that you can support migration in QEMU, while the bridge would
> indicate you cannot until the DMA page dirtying control bit is set by
> the guest.
> 
>      We could also go through and optimize the DMA page dirtying after
> this is added so that we can narrow down the scope of use, and as a
> result improve the performance for other devices that don't need to
> support migration.  It would then be a matter of adding an interrupt
> in the device to handle an event such as the DMA page dirtying status
> bit being set in the config space status register, while the bit is
> not set in the control register.  If it doesn't get set then we would
> have to evict the devices before the warm-up phase of the migration,
> otherwise we can defer it until the end of the warm-up phase.
> 
> 3.  Extend existing shpc driver to support the optional "pause"
> functionality as called out in section 4.1.2 of the Revision 1.1 PCI
> hot-plug specification.

Since your solution has added a faked PCI bridge. Why not notify the
bridge directly during migration via irq and call device driver's
callback in the new bridge driver?

Otherwise, the new bridge driver also can check whether the device
driver provides migration callback or not and call them to improve the
passthough device's performance during migration.

> 
>      Note I call out "extend" here instead of saying to add this.
> Basically what we should do is provide a means of quiescing the device
> without unloading the driver.  This is called out as something the OS
> vendor can optionally implement in the PCI hot-plug specification.  On
> OSes that wouldn't support this it would just be treated as a standard
> hot-plug event.   We could add a capability, status, and control bit
> in the vendor specific configuration block for this as well and if we
> set the status bit would indicate the host wants to pause instead of
> remove and the control bit would indicate the guest supports "pause"
> in the OS.  We then could optionally disable guest migration while the
> VF is present and pause is not supported.
> 
>      To support this we would need to add a timer and if a new device
> is not inserted in some period of time (60 seconds for example), or if
> a different device is inserted,
> we need to unload the original driver
> from the device.  In addition we would need to verify if drivers can
> call the remove function after having called suspend without resume.
> If not, we could look at adding a recovery function to remove the
> driver from the device in the case of a suspend with either a failed
> resume or no resume call.  Once again it would probably be useful to
> have for those cases where power management suspend/resume runs into
> an issue like somebody causing a surprise removal while a device was
> suspended.


-- 
Best regards
Tianyu Lan

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-25  7:03                     ` [Qemu-devel] " Lan Tianyu
@ 2015-12-25 12:11                       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-25 12:11 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: Alexander Duyck, Dr. David Alan Gilbert, Yang Zhang, qemu-devel,
	Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore, Donald C,
	quintela, Dong, Eddie, Jani, Nrupal, Alexander Graf, Blue Swirl,
	cornelia.huck, Alex Williamson, kraxel, Anthony Liguori,
	amit.shah, Paolo Bonzini, Rustad, Mark D, lcapitulino,
	Or Gerlitz

On Fri, Dec 25, 2015 at 03:03:47PM +0800, Lan Tianyu wrote:
> Merry Christmas.
> Sorry for later response due to personal affair.
> 
> On 2015年12月14日 03:30, Alexander Duyck wrote:
> >> > These sounds we need to add a faked bridge for migration and adding a
> >> > driver in the guest for it. It also needs to extend PCI bus/hotplug
> >> > driver to do pause/resume other devices, right?
> >> >
> >> > My concern is still that whether we can change PCI bus/hotplug like that
> >> > without spec change.
> >> >
> >> > IRQ should be general for any devices and we may extend it for
> >> > migration. Device driver also can make decision to support migration
> >> > or not.
> > The device should have no say in the matter.  Either we are going to
> > migrate or we will not.  This is why I have suggested my approach as
> > it allows for the least amount of driver intrusion while providing the
> > maximum number of ways to still perform migration even if the device
> > doesn't support it.
> 
> Even if the device driver doesn't support migration, you still want to
> migrate VM? That maybe risk and we should add the "bad path" for the
> driver at least.
> 
> > 
> > The solution I have proposed is simple:
> > 
> > 1.  Extend swiotlb to allow for a page dirtying functionality.
> > 
> >      This part is pretty straight forward.  I'll submit a few patches
> > later today as RFC that can provided the minimal functionality needed
> > for this.
> 
> Very appreciate to do that.
> 
> > 
> > 2.  Provide a vendor specific configuration space option on the QEMU
> > implementation of a PCI bridge to act as a bridge between direct
> > assigned devices and the host bridge.
> > 
> >      My thought was to add some vendor specific block that includes a
> > capabilities, status, and control register so you could go through and
> > synchronize things like the DMA page dirtying feature.  The bridge
> > itself could manage the migration capable bit inside QEMU for all
> > devices assigned to it.  So if you added a VF to the bridge it would
> > flag that you can support migration in QEMU, while the bridge would
> > indicate you cannot until the DMA page dirtying control bit is set by
> > the guest.
> > 
> >      We could also go through and optimize the DMA page dirtying after
> > this is added so that we can narrow down the scope of use, and as a
> > result improve the performance for other devices that don't need to
> > support migration.  It would then be a matter of adding an interrupt
> > in the device to handle an event such as the DMA page dirtying status
> > bit being set in the config space status register, while the bit is
> > not set in the control register.  If it doesn't get set then we would
> > have to evict the devices before the warm-up phase of the migration,
> > otherwise we can defer it until the end of the warm-up phase.
> > 
> > 3.  Extend existing shpc driver to support the optional "pause"
> > functionality as called out in section 4.1.2 of the Revision 1.1 PCI
> > hot-plug specification.
> 
> Since your solution has added a faked PCI bridge. Why not notify the
> bridge directly during migration via irq and call device driver's
> callback in the new bridge driver?
> 
> Otherwise, the new bridge driver also can check whether the device
> driver provides migration callback or not and call them to improve the
> passthough device's performance during migration.

As long as you keep up this vague talk about performance during
migration, without even bothering with any measurements, this patchset
will keep going nowhere.




There's Alex's patch that tracks memory changes during migration.  It
needs some simple enhancements to be useful in production (e.g. add a
host/guest handshake to both enable tracking in guest and to detect the
support in host), then it can allow starting migration with an assigned
device, by invoking hot-unplug after most of memory have been migrated.

Please implement this in qemu and measure the speed.
I will not be surprised if destroying/creating netdev in linux
turns out to take too long, but before anyone bothered
checking, it does not make sense to discuss further enhancements.



> > 
> >      Note I call out "extend" here instead of saying to add this.
> > Basically what we should do is provide a means of quiescing the device
> > without unloading the driver.  This is called out as something the OS
> > vendor can optionally implement in the PCI hot-plug specification.  On
> > OSes that wouldn't support this it would just be treated as a standard
> > hot-plug event.   We could add a capability, status, and control bit
> > in the vendor specific configuration block for this as well and if we
> > set the status bit would indicate the host wants to pause instead of
> > remove and the control bit would indicate the guest supports "pause"
> > in the OS.  We then could optionally disable guest migration while the
> > VF is present and pause is not supported.
> > 
> >      To support this we would need to add a timer and if a new device
> > is not inserted in some period of time (60 seconds for example), or if
> > a different device is inserted,
> > we need to unload the original driver
> > from the device.  In addition we would need to verify if drivers can
> > call the remove function after having called suspend without resume.
> > If not, we could look at adding a recovery function to remove the
> > driver from the device in the case of a suspend with either a failed
> > resume or no resume call.  Once again it would probably be useful to
> > have for those cases where power management suspend/resume runs into
> > an issue like somebody causing a surprise removal while a device was
> > suspended.
> 
> 
> -- 
> Best regards
> Tianyu Lan

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-25 12:11                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-25 12:11 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: Yang Zhang, Tantilov, Emil S, kvm, aik, qemu-devel,
	Alexander Duyck, lcapitulino, Blue Swirl, kraxel, Rustad, Mark D,
	quintela, Skidmore, Donald C, Alexander Graf, Or Gerlitz,
	Dr. David Alan Gilbert, Alex Williamson, Anthony Liguori,
	cornelia.huck, Ard Biesheuvel, Dong, Eddie, Jani, Nrupal,
	amit.shah, Paolo Bonzini

On Fri, Dec 25, 2015 at 03:03:47PM +0800, Lan Tianyu wrote:
> Merry Christmas.
> Sorry for later response due to personal affair.
> 
> On 2015年12月14日 03:30, Alexander Duyck wrote:
> >> > These sounds we need to add a faked bridge for migration and adding a
> >> > driver in the guest for it. It also needs to extend PCI bus/hotplug
> >> > driver to do pause/resume other devices, right?
> >> >
> >> > My concern is still that whether we can change PCI bus/hotplug like that
> >> > without spec change.
> >> >
> >> > IRQ should be general for any devices and we may extend it for
> >> > migration. Device driver also can make decision to support migration
> >> > or not.
> > The device should have no say in the matter.  Either we are going to
> > migrate or we will not.  This is why I have suggested my approach as
> > it allows for the least amount of driver intrusion while providing the
> > maximum number of ways to still perform migration even if the device
> > doesn't support it.
> 
> Even if the device driver doesn't support migration, you still want to
> migrate VM? That maybe risk and we should add the "bad path" for the
> driver at least.
> 
> > 
> > The solution I have proposed is simple:
> > 
> > 1.  Extend swiotlb to allow for a page dirtying functionality.
> > 
> >      This part is pretty straight forward.  I'll submit a few patches
> > later today as RFC that can provided the minimal functionality needed
> > for this.
> 
> Very appreciate to do that.
> 
> > 
> > 2.  Provide a vendor specific configuration space option on the QEMU
> > implementation of a PCI bridge to act as a bridge between direct
> > assigned devices and the host bridge.
> > 
> >      My thought was to add some vendor specific block that includes a
> > capabilities, status, and control register so you could go through and
> > synchronize things like the DMA page dirtying feature.  The bridge
> > itself could manage the migration capable bit inside QEMU for all
> > devices assigned to it.  So if you added a VF to the bridge it would
> > flag that you can support migration in QEMU, while the bridge would
> > indicate you cannot until the DMA page dirtying control bit is set by
> > the guest.
> > 
> >      We could also go through and optimize the DMA page dirtying after
> > this is added so that we can narrow down the scope of use, and as a
> > result improve the performance for other devices that don't need to
> > support migration.  It would then be a matter of adding an interrupt
> > in the device to handle an event such as the DMA page dirtying status
> > bit being set in the config space status register, while the bit is
> > not set in the control register.  If it doesn't get set then we would
> > have to evict the devices before the warm-up phase of the migration,
> > otherwise we can defer it until the end of the warm-up phase.
> > 
> > 3.  Extend existing shpc driver to support the optional "pause"
> > functionality as called out in section 4.1.2 of the Revision 1.1 PCI
> > hot-plug specification.
> 
> Since your solution has added a faked PCI bridge. Why not notify the
> bridge directly during migration via irq and call device driver's
> callback in the new bridge driver?
> 
> Otherwise, the new bridge driver also can check whether the device
> driver provides migration callback or not and call them to improve the
> passthough device's performance during migration.

As long as you keep up this vague talk about performance during
migration, without even bothering with any measurements, this patchset
will keep going nowhere.




There's Alex's patch that tracks memory changes during migration.  It
needs some simple enhancements to be useful in production (e.g. add a
host/guest handshake to both enable tracking in guest and to detect the
support in host), then it can allow starting migration with an assigned
device, by invoking hot-unplug after most of memory have been migrated.

Please implement this in qemu and measure the speed.
I will not be surprised if destroying/creating netdev in linux
turns out to take too long, but before anyone bothered
checking, it does not make sense to discuss further enhancements.



> > 
> >      Note I call out "extend" here instead of saying to add this.
> > Basically what we should do is provide a means of quiescing the device
> > without unloading the driver.  This is called out as something the OS
> > vendor can optionally implement in the PCI hot-plug specification.  On
> > OSes that wouldn't support this it would just be treated as a standard
> > hot-plug event.   We could add a capability, status, and control bit
> > in the vendor specific configuration block for this as well and if we
> > set the status bit would indicate the host wants to pause instead of
> > remove and the control bit would indicate the guest supports "pause"
> > in the OS.  We then could optionally disable guest migration while the
> > VF is present and pause is not supported.
> > 
> >      To support this we would need to add a timer and if a new device
> > is not inserted in some period of time (60 seconds for example), or if
> > a different device is inserted,
> > we need to unload the original driver
> > from the device.  In addition we would need to verify if drivers can
> > call the remove function after having called suspend without resume.
> > If not, we could look at adding a recovery function to remove the
> > driver from the device in the case of a suspend with either a failed
> > resume or no resume call.  Once again it would probably be useful to
> > have for those cases where power management suspend/resume runs into
> > an issue like somebody causing a surprise removal while a device was
> > suspended.
> 
> 
> -- 
> Best regards
> Tianyu Lan

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-25  7:03                     ` [Qemu-devel] " Lan Tianyu
@ 2015-12-25 22:31                       ` Alexander Duyck
  -1 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-25 22:31 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: Dr. David Alan Gilbert, Yang Zhang, Michael S. Tsirkin,
	qemu-devel, Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore,
	Donald C, quintela, Dong, Eddie, Jani, Nrupal, Alexander Graf,
	Blue Swirl, cornelia.huck, Alex Williamson, kraxel,
	Anthony Liguori, amit.shah, Paolo Bonzini, Rustad, Mark D,
	lcapitulino, Or Gerlitz

On Thu, Dec 24, 2015 at 11:03 PM, Lan Tianyu <tianyu.lan@intel.com> wrote:
> Merry Christmas.
> Sorry for later response due to personal affair.
>
> On 2015年12月14日 03:30, Alexander Duyck wrote:
>>> > These sounds we need to add a faked bridge for migration and adding a
>>> > driver in the guest for it. It also needs to extend PCI bus/hotplug
>>> > driver to do pause/resume other devices, right?
>>> >
>>> > My concern is still that whether we can change PCI bus/hotplug like that
>>> > without spec change.
>>> >
>>> > IRQ should be general for any devices and we may extend it for
>>> > migration. Device driver also can make decision to support migration
>>> > or not.
>> The device should have no say in the matter.  Either we are going to
>> migrate or we will not.  This is why I have suggested my approach as
>> it allows for the least amount of driver intrusion while providing the
>> maximum number of ways to still perform migration even if the device
>> doesn't support it.
>
> Even if the device driver doesn't support migration, you still want to
> migrate VM? That maybe risk and we should add the "bad path" for the
> driver at least.

At a minimum we should have support for hot-plug if we are expecting
to support migration.  You would simply have to hot-plug the device
before you start migration and then return it after.  That is how the
current bonding approach for this works if I am not mistaken.

The advantage we are looking to gain is to avoid removing/disabling
the device for as long as possible.  Ideally we want to keep the
device active through the warm-up period, but if the guest doesn't do
that we should still be able to fall back on the older approaches if
needed.

>>
>> The solution I have proposed is simple:
>>
>> 1.  Extend swiotlb to allow for a page dirtying functionality.
>>
>>      This part is pretty straight forward.  I'll submit a few patches
>> later today as RFC that can provided the minimal functionality needed
>> for this.
>
> Very appreciate to do that.
>
>>
>> 2.  Provide a vendor specific configuration space option on the QEMU
>> implementation of a PCI bridge to act as a bridge between direct
>> assigned devices and the host bridge.
>>
>>      My thought was to add some vendor specific block that includes a
>> capabilities, status, and control register so you could go through and
>> synchronize things like the DMA page dirtying feature.  The bridge
>> itself could manage the migration capable bit inside QEMU for all
>> devices assigned to it.  So if you added a VF to the bridge it would
>> flag that you can support migration in QEMU, while the bridge would
>> indicate you cannot until the DMA page dirtying control bit is set by
>> the guest.
>>
>>      We could also go through and optimize the DMA page dirtying after
>> this is added so that we can narrow down the scope of use, and as a
>> result improve the performance for other devices that don't need to
>> support migration.  It would then be a matter of adding an interrupt
>> in the device to handle an event such as the DMA page dirtying status
>> bit being set in the config space status register, while the bit is
>> not set in the control register.  If it doesn't get set then we would
>> have to evict the devices before the warm-up phase of the migration,
>> otherwise we can defer it until the end of the warm-up phase.
>>
>> 3.  Extend existing shpc driver to support the optional "pause"
>> functionality as called out in section 4.1.2 of the Revision 1.1 PCI
>> hot-plug specification.
>
> Since your solution has added a faked PCI bridge. Why not notify the
> bridge directly during migration via irq and call device driver's
> callback in the new bridge driver?
>
> Otherwise, the new bridge driver also can check whether the device
> driver provides migration callback or not and call them to improve the
> passthough device's performance during migration.

This is basically what I had in mind.  Though I would take things one
step further.  You don't need to add any new call-backs if you make
use of the existing suspend/resume logic.  For a VF this does exactly
what you would need since the VFs don't support wake on LAN so it will
simply clear the bus master enable and put the netdev in a suspended
state until the resume can be called.

The PCI hot-plug specification calls out that the OS can optionally
implement a "pause" mechanism which is meant to be used for high
availability type environments.  What I am proposing is basically
extending the standard SHPC capable PCI bridge so that we can support
the DMA page dirtying for everything hosted on it, add a vendor
specific block to the config space so that the guest can notify the
host that it will do page dirtying, and add a mechanism to indicate
that all hot-plug events during the warm-up phase of the migration are
pause events instead of full removals.

I've been poking around in the kernel and QEMU code and the part I
have been trying to sort out is how to get QEMU based pci-bridge to
use the SHPC driver because from what I can tell the driver never
actually gets loaded on the device as it is left in the control of
ACPI hot-plug.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-25 22:31                       ` Alexander Duyck
  0 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-25 22:31 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: Yang Zhang, Tantilov, Emil S, kvm, Michael S. Tsirkin, aik,
	qemu-devel, lcapitulino, Blue Swirl, kraxel, Rustad, Mark D,
	quintela, Skidmore, Donald C, Alexander Graf, Or Gerlitz,
	Dr. David Alan Gilbert, Alex Williamson, Anthony Liguori,
	cornelia.huck, Ard Biesheuvel, Dong, Eddie, Jani, Nrupal,
	amit.shah, Paolo Bonzini

On Thu, Dec 24, 2015 at 11:03 PM, Lan Tianyu <tianyu.lan@intel.com> wrote:
> Merry Christmas.
> Sorry for later response due to personal affair.
>
> On 2015年12月14日 03:30, Alexander Duyck wrote:
>>> > These sounds we need to add a faked bridge for migration and adding a
>>> > driver in the guest for it. It also needs to extend PCI bus/hotplug
>>> > driver to do pause/resume other devices, right?
>>> >
>>> > My concern is still that whether we can change PCI bus/hotplug like that
>>> > without spec change.
>>> >
>>> > IRQ should be general for any devices and we may extend it for
>>> > migration. Device driver also can make decision to support migration
>>> > or not.
>> The device should have no say in the matter.  Either we are going to
>> migrate or we will not.  This is why I have suggested my approach as
>> it allows for the least amount of driver intrusion while providing the
>> maximum number of ways to still perform migration even if the device
>> doesn't support it.
>
> Even if the device driver doesn't support migration, you still want to
> migrate VM? That maybe risk and we should add the "bad path" for the
> driver at least.

At a minimum we should have support for hot-plug if we are expecting
to support migration.  You would simply have to hot-plug the device
before you start migration and then return it after.  That is how the
current bonding approach for this works if I am not mistaken.

The advantage we are looking to gain is to avoid removing/disabling
the device for as long as possible.  Ideally we want to keep the
device active through the warm-up period, but if the guest doesn't do
that we should still be able to fall back on the older approaches if
needed.

>>
>> The solution I have proposed is simple:
>>
>> 1.  Extend swiotlb to allow for a page dirtying functionality.
>>
>>      This part is pretty straight forward.  I'll submit a few patches
>> later today as RFC that can provided the minimal functionality needed
>> for this.
>
> Very appreciate to do that.
>
>>
>> 2.  Provide a vendor specific configuration space option on the QEMU
>> implementation of a PCI bridge to act as a bridge between direct
>> assigned devices and the host bridge.
>>
>>      My thought was to add some vendor specific block that includes a
>> capabilities, status, and control register so you could go through and
>> synchronize things like the DMA page dirtying feature.  The bridge
>> itself could manage the migration capable bit inside QEMU for all
>> devices assigned to it.  So if you added a VF to the bridge it would
>> flag that you can support migration in QEMU, while the bridge would
>> indicate you cannot until the DMA page dirtying control bit is set by
>> the guest.
>>
>>      We could also go through and optimize the DMA page dirtying after
>> this is added so that we can narrow down the scope of use, and as a
>> result improve the performance for other devices that don't need to
>> support migration.  It would then be a matter of adding an interrupt
>> in the device to handle an event such as the DMA page dirtying status
>> bit being set in the config space status register, while the bit is
>> not set in the control register.  If it doesn't get set then we would
>> have to evict the devices before the warm-up phase of the migration,
>> otherwise we can defer it until the end of the warm-up phase.
>>
>> 3.  Extend existing shpc driver to support the optional "pause"
>> functionality as called out in section 4.1.2 of the Revision 1.1 PCI
>> hot-plug specification.
>
> Since your solution has added a faked PCI bridge. Why not notify the
> bridge directly during migration via irq and call device driver's
> callback in the new bridge driver?
>
> Otherwise, the new bridge driver also can check whether the device
> driver provides migration callback or not and call them to improve the
> passthough device's performance during migration.

This is basically what I had in mind.  Though I would take things one
step further.  You don't need to add any new call-backs if you make
use of the existing suspend/resume logic.  For a VF this does exactly
what you would need since the VFs don't support wake on LAN so it will
simply clear the bus master enable and put the netdev in a suspended
state until the resume can be called.

The PCI hot-plug specification calls out that the OS can optionally
implement a "pause" mechanism which is meant to be used for high
availability type environments.  What I am proposing is basically
extending the standard SHPC capable PCI bridge so that we can support
the DMA page dirtying for everything hosted on it, add a vendor
specific block to the config space so that the guest can notify the
host that it will do page dirtying, and add a mechanism to indicate
that all hot-plug events during the warm-up phase of the migration are
pause events instead of full removals.

I've been poking around in the kernel and QEMU code and the part I
have been trying to sort out is how to get QEMU based pci-bridge to
use the SHPC driver because from what I can tell the driver never
actually gets loaded on the device as it is left in the control of
ACPI hot-plug.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: live migration vs device assignment (motivation)
  2015-12-25 22:31                       ` Alexander Duyck
@ 2015-12-27  9:21                         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-27  9:21 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Yang Zhang, Tantilov, Emil S, kvm, aik, qemu-devel, lcapitulino,
	Blue Swirl, kraxel, Rustad, Mark D, quintela, Skidmore, Donald C,
	Alexander Graf, Or Gerlitz, Dr. David Alan Gilbert,
	Alex Williamson, Anthony Liguori, cornelia.huck, Lan Tianyu,
	Ard Biesheuvel, Dong, Eddie, Jani, Nrupal, amit.shah,
	Paolo Bonzini

On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote:
> The PCI hot-plug specification calls out that the OS can optionally
> implement a "pause" mechanism which is meant to be used for high
> availability type environments.  What I am proposing is basically
> extending the standard SHPC capable PCI bridge so that we can support
> the DMA page dirtying for everything hosted on it, add a vendor
> specific block to the config space so that the guest can notify the
> host that it will do page dirtying, and add a mechanism to indicate
> that all hot-plug events during the warm-up phase of the migration are
> pause events instead of full removals.

Two comments:

1. A vendor specific capability will always be problematic.
Better to register a capability id with pci sig.

2. There are actually several capabilities:

A. support for memory dirtying
	if not supported, we must stop device before migration

	This is supported by core guest OS code,
	using patches similar to posted by you.


B. support for device replacement
	This is a faster form of hotplug, where device is removed and
	later another device using same driver is inserted in the same slot.

	This is a possible optimization, but I am convinced
	(A) should be implemented independently of (B).
	
	


> I've been poking around in the kernel and QEMU code and the part I
> have been trying to sort out is how to get QEMU based pci-bridge to
> use the SHPC driver because from what I can tell the driver never
> actually gets loaded on the device as it is left in the control of
> ACPI hot-plug.
> 
> - Alex

There are ways, but you can just use pci express, it's easier.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-27  9:21                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-27  9:21 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Yang Zhang, Tantilov, Emil S, kvm, aik, qemu-devel, lcapitulino,
	Blue Swirl, kraxel, Rustad, Mark D, quintela, Skidmore, Donald C,
	Alexander Graf, Or Gerlitz, Dr. David Alan Gilbert,
	Alex Williamson, Anthony Liguori, cornelia.huck, Lan Tianyu,
	Ard Biesheuvel, Dong, Eddie, Jani, Nrupal, amit.shah,
	Paolo Bonzini

On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote:
> The PCI hot-plug specification calls out that the OS can optionally
> implement a "pause" mechanism which is meant to be used for high
> availability type environments.  What I am proposing is basically
> extending the standard SHPC capable PCI bridge so that we can support
> the DMA page dirtying for everything hosted on it, add a vendor
> specific block to the config space so that the guest can notify the
> host that it will do page dirtying, and add a mechanism to indicate
> that all hot-plug events during the warm-up phase of the migration are
> pause events instead of full removals.

Two comments:

1. A vendor specific capability will always be problematic.
Better to register a capability id with pci sig.

2. There are actually several capabilities:

A. support for memory dirtying
	if not supported, we must stop device before migration

	This is supported by core guest OS code,
	using patches similar to posted by you.


B. support for device replacement
	This is a faster form of hotplug, where device is removed and
	later another device using same driver is inserted in the same slot.

	This is a possible optimization, but I am convinced
	(A) should be implemented independently of (B).
	
	


> I've been poking around in the kernel and QEMU code and the part I
> have been trying to sort out is how to get QEMU based pci-bridge to
> use the SHPC driver because from what I can tell the driver never
> actually gets loaded on the device as it is left in the control of
> ACPI hot-plug.
> 
> - Alex

There are ways, but you can just use pci express, it's easier.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-27  9:21                         ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-12-27 21:45                           ` Alexander Duyck
  -1 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-27 21:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lan Tianyu, Dr. David Alan Gilbert, Yang Zhang, qemu-devel,
	Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore, Donald C,
	quintela, Dong, Eddie, Jani, Nrupal, Alexander Graf, Blue Swirl,
	cornelia.huck, Alex Williamson, kraxel, Anthony Liguori,
	amit.shah, Paolo Bonzini, Rustad, Mark D, lcapitulino,
	Or Gerlitz

On Sun, Dec 27, 2015 at 1:21 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote:
>> The PCI hot-plug specification calls out that the OS can optionally
>> implement a "pause" mechanism which is meant to be used for high
>> availability type environments.  What I am proposing is basically
>> extending the standard SHPC capable PCI bridge so that we can support
>> the DMA page dirtying for everything hosted on it, add a vendor
>> specific block to the config space so that the guest can notify the
>> host that it will do page dirtying, and add a mechanism to indicate
>> that all hot-plug events during the warm-up phase of the migration are
>> pause events instead of full removals.
>
> Two comments:
>
> 1. A vendor specific capability will always be problematic.
> Better to register a capability id with pci sig.
>
> 2. There are actually several capabilities:
>
> A. support for memory dirtying
>         if not supported, we must stop device before migration
>
>         This is supported by core guest OS code,
>         using patches similar to posted by you.
>
>
> B. support for device replacement
>         This is a faster form of hotplug, where device is removed and
>         later another device using same driver is inserted in the same slot.
>
>         This is a possible optimization, but I am convinced
>         (A) should be implemented independently of (B).
>

My thought on this was that we don't need much to really implement
either feature.  Really only a bit or two for either one.  I had
thought about extending the PCI Advanced Features, but for now it
might make more sense to just implement it as a vendor capability for
the QEMU based bridges instead of trying to make this a true PCI
capability since I am not sure if this in any way would apply to
physical hardware.  The fact is the PCI Advanced Features capability
is essentially just a vendor specific capability with a different ID
so if we were to use 2 bits that are currently reserved in the
capability we could later merge the functionality without much
overhead.

I fully agree that the two implementations should be separate but
nothing says we have to implement them completely different.  If we
are just using 3 bits for capability, status, and control of each
feature there is no reason for them to need to be stored in separate
locations.

>> I've been poking around in the kernel and QEMU code and the part I
>> have been trying to sort out is how to get QEMU based pci-bridge to
>> use the SHPC driver because from what I can tell the driver never
>> actually gets loaded on the device as it is left in the control of
>> ACPI hot-plug.
>
> There are ways, but you can just use pci express, it's easier.

That's true.  I should probably just give up on trying to do an
implementation that works with the i440fx implementation.  I could
probably move over to the q35 and once that is done then we could look
at something like the PCI Advanced Features solution for something
like the PCI-bridge drivers.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-27 21:45                           ` Alexander Duyck
  0 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-27 21:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Yang Zhang, Tantilov, Emil S, kvm, aik, qemu-devel, lcapitulino,
	Blue Swirl, kraxel, Rustad, Mark D, quintela, Skidmore, Donald C,
	Alexander Graf, Or Gerlitz, Dr. David Alan Gilbert,
	Alex Williamson, Anthony Liguori, cornelia.huck, Lan Tianyu,
	Ard Biesheuvel, Dong, Eddie, Jani, Nrupal, amit.shah,
	Paolo Bonzini

On Sun, Dec 27, 2015 at 1:21 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote:
>> The PCI hot-plug specification calls out that the OS can optionally
>> implement a "pause" mechanism which is meant to be used for high
>> availability type environments.  What I am proposing is basically
>> extending the standard SHPC capable PCI bridge so that we can support
>> the DMA page dirtying for everything hosted on it, add a vendor
>> specific block to the config space so that the guest can notify the
>> host that it will do page dirtying, and add a mechanism to indicate
>> that all hot-plug events during the warm-up phase of the migration are
>> pause events instead of full removals.
>
> Two comments:
>
> 1. A vendor specific capability will always be problematic.
> Better to register a capability id with pci sig.
>
> 2. There are actually several capabilities:
>
> A. support for memory dirtying
>         if not supported, we must stop device before migration
>
>         This is supported by core guest OS code,
>         using patches similar to posted by you.
>
>
> B. support for device replacement
>         This is a faster form of hotplug, where device is removed and
>         later another device using same driver is inserted in the same slot.
>
>         This is a possible optimization, but I am convinced
>         (A) should be implemented independently of (B).
>

My thought on this was that we don't need much to really implement
either feature.  Really only a bit or two for either one.  I had
thought about extending the PCI Advanced Features, but for now it
might make more sense to just implement it as a vendor capability for
the QEMU based bridges instead of trying to make this a true PCI
capability since I am not sure if this in any way would apply to
physical hardware.  The fact is the PCI Advanced Features capability
is essentially just a vendor specific capability with a different ID
so if we were to use 2 bits that are currently reserved in the
capability we could later merge the functionality without much
overhead.

I fully agree that the two implementations should be separate but
nothing says we have to implement them completely different.  If we
are just using 3 bits for capability, status, and control of each
feature there is no reason for them to need to be stored in separate
locations.

>> I've been poking around in the kernel and QEMU code and the part I
>> have been trying to sort out is how to get QEMU based pci-bridge to
>> use the SHPC driver because from what I can tell the driver never
>> actually gets loaded on the device as it is left in the control of
>> ACPI hot-plug.
>
> There are ways, but you can just use pci express, it's easier.

That's true.  I should probably just give up on trying to do an
implementation that works with the i440fx implementation.  I could
probably move over to the q35 and once that is done then we could look
at something like the PCI Advanced Features solution for something
like the PCI-bridge drivers.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* RE: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-25 22:31                       ` Alexander Duyck
@ 2015-12-28  3:20                         ` Dong, Eddie
  -1 siblings, 0 replies; 142+ messages in thread
From: Dong, Eddie @ 2015-12-28  3:20 UTC (permalink / raw)
  To: Alexander Duyck, Lan, Tianyu
  Cc: Dr. David Alan Gilbert, Yang Zhang, Michael S. Tsirkin,
	qemu-devel, Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore,
	Donald C, quintela, Jani, Nrupal, Alexander Graf, Blue Swirl,
	cornelia.huck, Alex Williamson, kraxel, Anthony Liguori,
	amit.shah, Paolo Bonzini, Rustad, Mark D,

> >
> > Even if the device driver doesn't support migration, you still want to
> > migrate VM? That maybe risk and we should add the "bad path" for the
> > driver at least.
> 
> At a minimum we should have support for hot-plug if we are expecting to
> support migration.  You would simply have to hot-plug the device before you
> start migration and then return it after.  That is how the current bonding
> approach for this works if I am not mistaken.

Hotplug is good to eliminate the device spefic state clone, but bonding approach is very network specific, it doesn’t work for other devices such as FPGA device, QaT devices & GPU devices, which we plan to support gradually :)

> 
> The advantage we are looking to gain is to avoid removing/disabling the
> device for as long as possible.  Ideally we want to keep the device active
> through the warm-up period, but if the guest doesn't do that we should still
> be able to fall back on the older approaches if needed.
> 

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-28  3:20                         ` Dong, Eddie
  0 siblings, 0 replies; 142+ messages in thread
From: Dong, Eddie @ 2015-12-28  3:20 UTC (permalink / raw)
  To: Alexander Duyck, Lan, Tianyu
  Cc: Yang Zhang, Tantilov, Emil S, kvm, Michael S. Tsirkin, aik,
	qemu-devel, lcapitulino, Blue Swirl, kraxel, Rustad, Mark D,
	quintela, Skidmore, Donald C, Alexander Graf, Or Gerlitz,
	Dr. David Alan Gilbert, Alex Williamson, Anthony Liguori,
	cornelia.huck, Ard Biesheuvel, Jani, Nrupal, amit.shah,
	Paolo Bonzini

> >
> > Even if the device driver doesn't support migration, you still want to
> > migrate VM? That maybe risk and we should add the "bad path" for the
> > driver at least.
> 
> At a minimum we should have support for hot-plug if we are expecting to
> support migration.  You would simply have to hot-plug the device before you
> start migration and then return it after.  That is how the current bonding
> approach for this works if I am not mistaken.

Hotplug is good to eliminate the device spefic state clone, but bonding approach is very network specific, it doesn’t work for other devices such as FPGA device, QaT devices & GPU devices, which we plan to support gradually :)

> 
> The advantage we are looking to gain is to avoid removing/disabling the
> device for as long as possible.  Ideally we want to keep the device active
> through the warm-up period, but if the guest doesn't do that we should still
> be able to fall back on the older approaches if needed.
> 

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: live migration vs device assignment (motivation)
  2015-12-28  3:20                         ` Dong, Eddie
@ 2015-12-28  4:26                           ` Alexander Duyck
  -1 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-28  4:26 UTC (permalink / raw)
  To: Dong, Eddie
  Cc: Yang Zhang, Tantilov, Emil S, kvm, Michael S. Tsirkin, aik,
	qemu-devel, lcapitulino, Blue Swirl, kraxel, Rustad, Mark D,
	quintela, Skidmore, Donald C, Alexander Graf, Or Gerlitz,
	Dr. David Alan Gilbert, Alex Williamson, Anthony Liguori,
	cornelia.huck, Lan, Tianyu, Ard Biesheuvel, Jani, Nrupal

On Sun, Dec 27, 2015 at 7:20 PM, Dong, Eddie <eddie.dong@intel.com> wrote:
>> >
>> > Even if the device driver doesn't support migration, you still want to
>> > migrate VM? That maybe risk and we should add the "bad path" for the
>> > driver at least.
>>
>> At a minimum we should have support for hot-plug if we are expecting to
>> support migration.  You would simply have to hot-plug the device before you
>> start migration and then return it after.  That is how the current bonding
>> approach for this works if I am not mistaken.
>
> Hotplug is good to eliminate the device spefic state clone, but bonding approach is very network specific, it doesn’t work for other devices such as FPGA device, QaT devices & GPU devices, which we plan to support gradually :)

Hotplug would be usable for that assuming the guest supports the
optional "pause" implementation as called out in the PCI hotplug spec.
With that the device can maintain state for some period of time after
the hotplug remove event has occurred.

The problem is that you have to get the device to quiesce at some
point as you cannot complete the migration with the device still
active.  The way you were doing it was using the per-device
configuration space mechanism.  That doesn't scale when you have to
implement it for each and every driver for each and every OS you have
to support.  Using the "pause" implementation for hot-plug would have
a much greater likelihood of scaling as you could either take the fast
path approach of "pausing" the device to resume it when migration has
completed, or you could just remove the device and restart the driver
on the other side if the pause support is not yet implemented.  You
would lose the state under such a migration but it is much more
practical than having to implement a per device solution.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-28  4:26                           ` Alexander Duyck
  0 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-28  4:26 UTC (permalink / raw)
  To: Dong, Eddie
  Cc: Yang Zhang, Tantilov, Emil S, kvm, Michael S. Tsirkin, aik,
	qemu-devel, lcapitulino, Blue Swirl, kraxel, Rustad, Mark D,
	quintela, Skidmore, Donald C, Alexander Graf, Or Gerlitz,
	Dr. David Alan Gilbert, Alex Williamson, Anthony Liguori,
	cornelia.huck, Lan, Tianyu, Ard Biesheuvel, Jani, Nrupal,
	amit.shah, Paolo Bonzini

On Sun, Dec 27, 2015 at 7:20 PM, Dong, Eddie <eddie.dong@intel.com> wrote:
>> >
>> > Even if the device driver doesn't support migration, you still want to
>> > migrate VM? That maybe risk and we should add the "bad path" for the
>> > driver at least.
>>
>> At a minimum we should have support for hot-plug if we are expecting to
>> support migration.  You would simply have to hot-plug the device before you
>> start migration and then return it after.  That is how the current bonding
>> approach for this works if I am not mistaken.
>
> Hotplug is good to eliminate the device spefic state clone, but bonding approach is very network specific, it doesn’t work for other devices such as FPGA device, QaT devices & GPU devices, which we plan to support gradually :)

Hotplug would be usable for that assuming the guest supports the
optional "pause" implementation as called out in the PCI hotplug spec.
With that the device can maintain state for some period of time after
the hotplug remove event has occurred.

The problem is that you have to get the device to quiesce at some
point as you cannot complete the migration with the device still
active.  The way you were doing it was using the per-device
configuration space mechanism.  That doesn't scale when you have to
implement it for each and every driver for each and every OS you have
to support.  Using the "pause" implementation for hot-plug would have
a much greater likelihood of scaling as you could either take the fast
path approach of "pausing" the device to resume it when migration has
completed, or you could just remove the device and restart the driver
on the other side if the pause support is not yet implemented.  You
would lose the state under such a migration but it is much more
practical than having to implement a per device solution.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-27 21:45                           ` Alexander Duyck
@ 2015-12-28  8:51                             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-28  8:51 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Lan Tianyu, Dr. David Alan Gilbert, Yang Zhang, qemu-devel,
	Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore, Donald C,
	quintela, Dong, Eddie, Jani, Nrupal, Alexander Graf, Blue Swirl,
	cornelia.huck, Alex Williamson, kraxel, Anthony Liguori,
	amit.shah, Paolo Bonzini, Rustad, Mark D, lcapitulino,
	Or Gerlitz

On Sun, Dec 27, 2015 at 01:45:15PM -0800, Alexander Duyck wrote:
> On Sun, Dec 27, 2015 at 1:21 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote:
> >> The PCI hot-plug specification calls out that the OS can optionally
> >> implement a "pause" mechanism which is meant to be used for high
> >> availability type environments.  What I am proposing is basically
> >> extending the standard SHPC capable PCI bridge so that we can support
> >> the DMA page dirtying for everything hosted on it, add a vendor
> >> specific block to the config space so that the guest can notify the
> >> host that it will do page dirtying, and add a mechanism to indicate
> >> that all hot-plug events during the warm-up phase of the migration are
> >> pause events instead of full removals.
> >
> > Two comments:
> >
> > 1. A vendor specific capability will always be problematic.
> > Better to register a capability id with pci sig.
> >
> > 2. There are actually several capabilities:
> >
> > A. support for memory dirtying
> >         if not supported, we must stop device before migration
> >
> >         This is supported by core guest OS code,
> >         using patches similar to posted by you.
> >
> >
> > B. support for device replacement
> >         This is a faster form of hotplug, where device is removed and
> >         later another device using same driver is inserted in the same slot.
> >
> >         This is a possible optimization, but I am convinced
> >         (A) should be implemented independently of (B).
> >
> 
> My thought on this was that we don't need much to really implement
> either feature.  Really only a bit or two for either one.  I had
> thought about extending the PCI Advanced Features, but for now it
> might make more sense to just implement it as a vendor capability for
> the QEMU based bridges instead of trying to make this a true PCI
> capability since I am not sure if this in any way would apply to
> physical hardware.  The fact is the PCI Advanced Features capability
> is essentially just a vendor specific capability with a different ID

Interesting. I see it more as a backport of pci express
features to pci.

> so if we were to use 2 bits that are currently reserved in the
> capability we could later merge the functionality without much
> overhead.

Don't do this. You must not touch reserved bits.

> I fully agree that the two implementations should be separate but
> nothing says we have to implement them completely different.  If we
> are just using 3 bits for capability, status, and control of each
> feature there is no reason for them to need to be stored in separate
> locations.

True.

> >> I've been poking around in the kernel and QEMU code and the part I
> >> have been trying to sort out is how to get QEMU based pci-bridge to
> >> use the SHPC driver because from what I can tell the driver never
> >> actually gets loaded on the device as it is left in the control of
> >> ACPI hot-plug.
> >
> > There are ways, but you can just use pci express, it's easier.
> 
> That's true.  I should probably just give up on trying to do an
> implementation that works with the i440fx implementation.  I could
> probably move over to the q35 and once that is done then we could look
> at something like the PCI Advanced Features solution for something
> like the PCI-bridge drivers.
> 
> - Alex

Once we have a decent idea of what's required, I can write
an ECN for pci code and id assignment specification.
That's cleaner than vendor specific stuff that's tied to
a specific device/vendor ID.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-28  8:51                             ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-28  8:51 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Yang Zhang, Tantilov, Emil S, kvm, aik, qemu-devel, lcapitulino,
	Blue Swirl, kraxel, Rustad, Mark D, quintela, Skidmore, Donald C,
	Alexander Graf, Or Gerlitz, Dr. David Alan Gilbert,
	Alex Williamson, Anthony Liguori, cornelia.huck, Lan Tianyu,
	Ard Biesheuvel, Dong, Eddie, Jani, Nrupal, amit.shah,
	Paolo Bonzini

On Sun, Dec 27, 2015 at 01:45:15PM -0800, Alexander Duyck wrote:
> On Sun, Dec 27, 2015 at 1:21 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote:
> >> The PCI hot-plug specification calls out that the OS can optionally
> >> implement a "pause" mechanism which is meant to be used for high
> >> availability type environments.  What I am proposing is basically
> >> extending the standard SHPC capable PCI bridge so that we can support
> >> the DMA page dirtying for everything hosted on it, add a vendor
> >> specific block to the config space so that the guest can notify the
> >> host that it will do page dirtying, and add a mechanism to indicate
> >> that all hot-plug events during the warm-up phase of the migration are
> >> pause events instead of full removals.
> >
> > Two comments:
> >
> > 1. A vendor specific capability will always be problematic.
> > Better to register a capability id with pci sig.
> >
> > 2. There are actually several capabilities:
> >
> > A. support for memory dirtying
> >         if not supported, we must stop device before migration
> >
> >         This is supported by core guest OS code,
> >         using patches similar to posted by you.
> >
> >
> > B. support for device replacement
> >         This is a faster form of hotplug, where device is removed and
> >         later another device using same driver is inserted in the same slot.
> >
> >         This is a possible optimization, but I am convinced
> >         (A) should be implemented independently of (B).
> >
> 
> My thought on this was that we don't need much to really implement
> either feature.  Really only a bit or two for either one.  I had
> thought about extending the PCI Advanced Features, but for now it
> might make more sense to just implement it as a vendor capability for
> the QEMU based bridges instead of trying to make this a true PCI
> capability since I am not sure if this in any way would apply to
> physical hardware.  The fact is the PCI Advanced Features capability
> is essentially just a vendor specific capability with a different ID

Interesting. I see it more as a backport of pci express
features to pci.

> so if we were to use 2 bits that are currently reserved in the
> capability we could later merge the functionality without much
> overhead.

Don't do this. You must not touch reserved bits.

> I fully agree that the two implementations should be separate but
> nothing says we have to implement them completely different.  If we
> are just using 3 bits for capability, status, and control of each
> feature there is no reason for them to need to be stored in separate
> locations.

True.

> >> I've been poking around in the kernel and QEMU code and the part I
> >> have been trying to sort out is how to get QEMU based pci-bridge to
> >> use the SHPC driver because from what I can tell the driver never
> >> actually gets loaded on the device as it is left in the control of
> >> ACPI hot-plug.
> >
> > There are ways, but you can just use pci express, it's easier.
> 
> That's true.  I should probably just give up on trying to do an
> implementation that works with the i440fx implementation.  I could
> probably move over to the q35 and once that is done then we could look
> at something like the PCI Advanced Features solution for something
> like the PCI-bridge drivers.
> 
> - Alex

Once we have a decent idea of what's required, I can write
an ECN for pci code and id assignment specification.
That's cleaner than vendor specific stuff that's tied to
a specific device/vendor ID.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* RE: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-14  9:26                   ` Michael S. Tsirkin
@ 2015-12-28  8:52                     ` Pavel Fedin
  -1 siblings, 0 replies; 142+ messages in thread
From: Pavel Fedin @ 2015-12-28  8:52 UTC (permalink / raw)
  To: 'Michael S. Tsirkin', 'Lan, Tianyu'
  Cc: 'Yang Zhang', 'Tantilov, Emil S',
	kvm, aik, qemu-devel, 'Alexander Duyck',
	lcapitulino, 'Blue Swirl',
	kraxel, 'Rustad, Mark D',
	quintela, 'Skidmore, Donald C', 'Alexander Graf',
	'Or Gerlitz', 'Dr. David Alan Gilbert',
	'Alex Williamson', 'Anthony Liguori',
	cornelia.huck, 'Ard Biesheuvel', 'Dong, Eddie',
	'Jani, Nrupal', amit.shah, 'Paolo Bonzini'

 Hello!

> A dedicated IRQ per device for something that is a system wide event
> sounds like a waste.  I don't understand why a spec change is strictly
> required, we only need to support this with the specific virtual bridge
> used by QEMU, so I think that a vendor specific capability will do.
> Once this works well in the field, a PCI spec ECN might make sense
> to standardise the capability.

 Keeping track of your discussion for some time, decided to jump in...
 So far, we want to have some kind of mailbox to notify the quest about migration. So what about some dedicated "pci device" for
this purpose? Some kind of "migration controller". This is:
a) perhaps easier to implement than capability, we don't need to push anything to PCI spec.
b) could easily make friendship with Windows, because this means that no bus code has to be touched at all. It would rely only on
drivers' ability to communicate with each other (i guess it should be possible in Windows, isn't it?)
c) does not need to steal resources (BARs, IRQs, etc) from the actual devices.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-28  8:52                     ` Pavel Fedin
  0 siblings, 0 replies; 142+ messages in thread
From: Pavel Fedin @ 2015-12-28  8:52 UTC (permalink / raw)
  To: 'Michael S. Tsirkin', 'Lan, Tianyu'
  Cc: 'Yang Zhang', 'Tantilov, Emil S',
	kvm, aik, qemu-devel, 'Alexander Duyck',
	lcapitulino, 'Blue Swirl',
	kraxel, 'Rustad, Mark D',
	quintela, 'Skidmore, Donald C', 'Alexander Graf',
	'Or Gerlitz', 'Dr. David Alan Gilbert',
	'Alex Williamson', 'Anthony Liguori',
	cornelia.huck, 'Ard Biesheuvel', 'Dong, Eddie',
	'Jani, Nrupal', amit.shah, 'Paolo Bonzini'

 Hello!

> A dedicated IRQ per device for something that is a system wide event
> sounds like a waste.  I don't understand why a spec change is strictly
> required, we only need to support this with the specific virtual bridge
> used by QEMU, so I think that a vendor specific capability will do.
> Once this works well in the field, a PCI spec ECN might make sense
> to standardise the capability.

 Keeping track of your discussion for some time, decided to jump in...
 So far, we want to have some kind of mailbox to notify the quest about migration. So what about some dedicated "pci device" for
this purpose? Some kind of "migration controller". This is:
a) perhaps easier to implement than capability, we don't need to push anything to PCI spec.
b) could easily make friendship with Windows, because this means that no bus code has to be touched at all. It would rely only on
drivers' ability to communicate with each other (i guess it should be possible in Windows, isn't it?)
c) does not need to steal resources (BARs, IRQs, etc) from the actual devices.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-28  3:20                         ` Dong, Eddie
@ 2015-12-28 11:50                           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-28 11:50 UTC (permalink / raw)
  To: Dong, Eddie
  Cc: Alexander Duyck, Lan, Tianyu, Dr. David Alan Gilbert, Yang Zhang,
	qemu-devel, Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore,
	Donald C, quintela, Jani, Nrupal, Alexander Graf, Blue Swirl,
	cornelia.huck, Alex Williamson, kraxel, Anthony Liguori,
	amit.shah, Paolo Bonzini, Rustad, Mark D

On Mon, Dec 28, 2015 at 03:20:10AM +0000, Dong, Eddie wrote:
> > >
> > > Even if the device driver doesn't support migration, you still want to
> > > migrate VM? That maybe risk and we should add the "bad path" for the
> > > driver at least.
> > 
> > At a minimum we should have support for hot-plug if we are expecting to
> > support migration.  You would simply have to hot-plug the device before you
> > start migration and then return it after.  That is how the current bonding
> > approach for this works if I am not mistaken.
> 
> Hotplug is good to eliminate the device spefic state clone, but
> bonding approach is very network specific, it doesn’t work for other
> devices such as FPGA device, QaT devices & GPU devices, which we plan
> to support gradually :)

Alexander didn't say do bonding. He just said bonding uses hot-unplug.

Gradual and generic is the correct approach. So focus on splitting the
work into manageable pieces which are also useful by themselves, and
generally reusable by different devices.

So live the pausing alone for a moment.

Start from Alexander's patchset for tracking dirty memory, add a way to
control and detect it from userspace (and maybe from host), and a way to
start migration while device is attached, removing it at the last
possible moment.

That will be a nice first step.


> > 
> > The advantage we are looking to gain is to avoid removing/disabling the
> > device for as long as possible.  Ideally we want to keep the device active
> > through the warm-up period, but if the guest doesn't do that we should still
> > be able to fall back on the older approaches if needed.
> > 

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-28 11:50                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-28 11:50 UTC (permalink / raw)
  To: Dong, Eddie
  Cc: Yang Zhang, Tantilov, Emil S, kvm, aik, qemu-devel,
	Alexander Duyck, lcapitulino, Blue Swirl, kraxel, Rustad, Mark D,
	quintela, Skidmore, Donald C, Alexander Graf, Or Gerlitz,
	Dr. David Alan Gilbert, Alex Williamson, Anthony Liguori,
	cornelia.huck, Lan, Tianyu, Ard Biesheuvel, Jani, Nrupal,
	amit.shah, Paolo Bonzini

On Mon, Dec 28, 2015 at 03:20:10AM +0000, Dong, Eddie wrote:
> > >
> > > Even if the device driver doesn't support migration, you still want to
> > > migrate VM? That maybe risk and we should add the "bad path" for the
> > > driver at least.
> > 
> > At a minimum we should have support for hot-plug if we are expecting to
> > support migration.  You would simply have to hot-plug the device before you
> > start migration and then return it after.  That is how the current bonding
> > approach for this works if I am not mistaken.
> 
> Hotplug is good to eliminate the device spefic state clone, but
> bonding approach is very network specific, it doesn’t work for other
> devices such as FPGA device, QaT devices & GPU devices, which we plan
> to support gradually :)

Alexander didn't say do bonding. He just said bonding uses hot-unplug.

Gradual and generic is the correct approach. So focus on splitting the
work into manageable pieces which are also useful by themselves, and
generally reusable by different devices.

So live the pausing alone for a moment.

Start from Alexander's patchset for tracking dirty memory, add a way to
control and detect it from userspace (and maybe from host), and a way to
start migration while device is attached, removing it at the last
possible moment.

That will be a nice first step.


> > 
> > The advantage we are looking to gain is to avoid removing/disabling the
> > device for as long as possible.  Ideally we want to keep the device active
> > through the warm-up period, but if the guest doesn't do that we should still
> > be able to fall back on the older approaches if needed.
> > 

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-28  8:52                     ` Pavel Fedin
@ 2015-12-28 11:51                       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-28 11:51 UTC (permalink / raw)
  To: Pavel Fedin
  Cc: 'Lan, Tianyu', 'Yang Zhang',
	'Tantilov, Emil S',
	kvm, aik, qemu-devel, 'Alexander Duyck',
	lcapitulino, 'Blue Swirl',
	kraxel, 'Rustad, Mark D',
	quintela, 'Skidmore, Donald C', 'Alexander Graf',
	'Or Gerlitz', 'Dr. David Alan Gilbert',
	'Alex Williamson', 'Anthony Liguori',
	cornelia.huck, 'Ard Biesheuvel', 'Dong, Eddie',
	'Jani, Nrupal', amit.shah, 'Paolo Bonzini'

On Mon, Dec 28, 2015 at 11:52:43AM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > A dedicated IRQ per device for something that is a system wide event
> > sounds like a waste.  I don't understand why a spec change is strictly
> > required, we only need to support this with the specific virtual bridge
> > used by QEMU, so I think that a vendor specific capability will do.
> > Once this works well in the field, a PCI spec ECN might make sense
> > to standardise the capability.
> 
>  Keeping track of your discussion for some time, decided to jump in...
>  So far, we want to have some kind of mailbox to notify the quest about migration. So what about some dedicated "pci device" for
> this purpose? Some kind of "migration controller". This is:
> a) perhaps easier to implement than capability, we don't need to push anything to PCI spec.
> b) could easily make friendship with Windows, because this means that no bus code has to be touched at all. It would rely only on
> drivers' ability to communicate with each other (i guess it should be possible in Windows, isn't it?)
> c) does not need to steal resources (BARs, IRQs, etc) from the actual devices.
> 
> Kind regards,
> Pavel Fedin
> Expert Engineer
> Samsung Electronics Research center Russia
> 

Sure, or we can use an ACPI device.  It doesn't really matter what we do
for the mailbox. Whoever writes this first will get to select a
mechanism.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-28 11:51                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-28 11:51 UTC (permalink / raw)
  To: Pavel Fedin
  Cc: 'Yang Zhang', 'Tantilov, Emil S',
	kvm, aik, qemu-devel, 'Alexander Duyck',
	lcapitulino, 'Blue Swirl',
	kraxel, 'Rustad, Mark D',
	quintela, 'Skidmore, Donald C', 'Alexander Graf',
	'Or Gerlitz', 'Dr. David Alan Gilbert',
	'Alex Williamson', 'Anthony Liguori',
	cornelia.huck, 'Lan, Tianyu', 'Ard Biesheuvel',
	'Dong, Eddie', 'Jani, Nrupal',
	amit.shah, 'Paolo Bonzini'

On Mon, Dec 28, 2015 at 11:52:43AM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > A dedicated IRQ per device for something that is a system wide event
> > sounds like a waste.  I don't understand why a spec change is strictly
> > required, we only need to support this with the specific virtual bridge
> > used by QEMU, so I think that a vendor specific capability will do.
> > Once this works well in the field, a PCI spec ECN might make sense
> > to standardise the capability.
> 
>  Keeping track of your discussion for some time, decided to jump in...
>  So far, we want to have some kind of mailbox to notify the quest about migration. So what about some dedicated "pci device" for
> this purpose? Some kind of "migration controller". This is:
> a) perhaps easier to implement than capability, we don't need to push anything to PCI spec.
> b) could easily make friendship with Windows, because this means that no bus code has to be touched at all. It would rely only on
> drivers' ability to communicate with each other (i guess it should be possible in Windows, isn't it?)
> c) does not need to steal resources (BARs, IRQs, etc) from the actual devices.
> 
> Kind regards,
> Pavel Fedin
> Expert Engineer
> Samsung Electronics Research center Russia
> 

Sure, or we can use an ACPI device.  It doesn't really matter what we do
for the mailbox. Whoever writes this first will get to select a
mechanism.

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-25 12:11                       ` Michael S. Tsirkin
@ 2015-12-28 17:42                         ` Lan, Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-28 17:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Dr. David Alan Gilbert, Yang Zhang, qemu-devel,
	Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore, Donald C,
	quintela, Dong, Eddie, Jani, Nrupal, Alexander Graf, Blue Swirl,
	cornelia.huck, Alex Williamson, kraxel, Anthony Liguori,
	amit.shah, Paolo Bonzini, Rustad, Mark D, lcapitulino,
	Or Gerlitz



On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
> As long as you keep up this vague talk about performance during
> migration, without even bothering with any measurements, this patchset
> will keep going nowhere.
>

I measured network service downtime for "keep device alive"(RFC patch V1 
presented) and "put down and up network interface"(RFC patch V2 
presented) during migration with some optimizations.

The former is around 140ms and the later is around 240ms.

My patchset relies on the maibox irq which doesn't work in the suspend 
state and so can't get downtime for suspend/resume cases. Will try to 
get the result later.

>
>
>
> There's Alex's patch that tracks memory changes during migration.  It
> needs some simple enhancements to be useful in production (e.g. add a
> host/guest handshake to both enable tracking in guest and to detect the
> support in host), then it can allow starting migration with an assigned
> device, by invoking hot-unplug after most of memory have been migrated.
>
> Please implement this in qemu and measure the speed.

Sure. Will do that.

> I will not be surprised if destroying/creating netdev in linux
> turns out to take too long, but before anyone bothered
> checking, it does not make sense to discuss further enhancements.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-28 17:42                         ` Lan, Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan, Tianyu @ 2015-12-28 17:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Yang Zhang, Tantilov, Emil S, kvm, aik, qemu-devel,
	Alexander Duyck, lcapitulino, Blue Swirl, kraxel, Rustad, Mark D,
	quintela, Skidmore, Donald C, Alexander Graf, Or Gerlitz,
	Dr. David Alan Gilbert, Alex Williamson, Anthony Liguori,
	cornelia.huck, Ard Biesheuvel, Dong, Eddie, Jani, Nrupal,
	amit.shah, Paolo Bonzini



On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
> As long as you keep up this vague talk about performance during
> migration, without even bothering with any measurements, this patchset
> will keep going nowhere.
>

I measured network service downtime for "keep device alive"(RFC patch V1 
presented) and "put down and up network interface"(RFC patch V2 
presented) during migration with some optimizations.

The former is around 140ms and the later is around 240ms.

My patchset relies on the maibox irq which doesn't work in the suspend 
state and so can't get downtime for suspend/resume cases. Will try to 
get the result later.

>
>
>
> There's Alex's patch that tracks memory changes during migration.  It
> needs some simple enhancements to be useful in production (e.g. add a
> host/guest handshake to both enable tracking in guest and to detect the
> support in host), then it can allow starting migration with an assigned
> device, by invoking hot-unplug after most of memory have been migrated.
>
> Please implement this in qemu and measure the speed.

Sure. Will do that.

> I will not be surprised if destroying/creating netdev in linux
> turns out to take too long, but before anyone bothered
> checking, it does not make sense to discuss further enhancements.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-28 17:42                         ` Lan, Tianyu
@ 2015-12-29 16:46                           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-29 16:46 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Alexander Duyck, Dr. David Alan Gilbert, Yang Zhang, qemu-devel,
	Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore, Donald C,
	quintela, Dong, Eddie, Jani, Nrupal, Alexander Graf, Blue Swirl,
	cornelia.huck, Alex Williamson, kraxel, Anthony Liguori,
	amit.shah, Paolo Bonzini, Rustad, Mark D, lcapitulino,
	Or Gerlitz

On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
> >As long as you keep up this vague talk about performance during
> >migration, without even bothering with any measurements, this patchset
> >will keep going nowhere.
> >
> 
> I measured network service downtime for "keep device alive"(RFC patch V1
> presented) and "put down and up network interface"(RFC patch V2 presented)
> during migration with some optimizations.
> 
> The former is around 140ms and the later is around 240ms.
> 
> My patchset relies on the maibox irq which doesn't work in the suspend state
> and so can't get downtime for suspend/resume cases. Will try to get the
> result later.


Interesting. So you sare saying merely ifdown/ifup is 100ms?
This does not sound reasonable.
Is there a chance you are e.g. getting IP from dhcp?

If so that is wrong - clearly should reconfigure the old IP
back without playing with dhcp. For testing, just set up
a static IP.

> >
> >
> >
> >There's Alex's patch that tracks memory changes during migration.  It
> >needs some simple enhancements to be useful in production (e.g. add a
> >host/guest handshake to both enable tracking in guest and to detect the
> >support in host), then it can allow starting migration with an assigned
> >device, by invoking hot-unplug after most of memory have been migrated.
> >
> >Please implement this in qemu and measure the speed.
> 
> Sure. Will do that.
> 
> >I will not be surprised if destroying/creating netdev in linux
> >turns out to take too long, but before anyone bothered
> >checking, it does not make sense to discuss further enhancements.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-29 16:46                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-29 16:46 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: Yang Zhang, Tantilov, Emil S, kvm, aik, qemu-devel,
	Alexander Duyck, lcapitulino, Blue Swirl, kraxel, Rustad, Mark D,
	quintela, Skidmore, Donald C, Alexander Graf, Or Gerlitz,
	Dr. David Alan Gilbert, Alex Williamson, Anthony Liguori,
	cornelia.huck, Ard Biesheuvel, Dong, Eddie, Jani, Nrupal,
	amit.shah, Paolo Bonzini

On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
> >As long as you keep up this vague talk about performance during
> >migration, without even bothering with any measurements, this patchset
> >will keep going nowhere.
> >
> 
> I measured network service downtime for "keep device alive"(RFC patch V1
> presented) and "put down and up network interface"(RFC patch V2 presented)
> during migration with some optimizations.
> 
> The former is around 140ms and the later is around 240ms.
> 
> My patchset relies on the maibox irq which doesn't work in the suspend state
> and so can't get downtime for suspend/resume cases. Will try to get the
> result later.


Interesting. So you sare saying merely ifdown/ifup is 100ms?
This does not sound reasonable.
Is there a chance you are e.g. getting IP from dhcp?

If so that is wrong - clearly should reconfigure the old IP
back without playing with dhcp. For testing, just set up
a static IP.

> >
> >
> >
> >There's Alex's patch that tracks memory changes during migration.  It
> >needs some simple enhancements to be useful in production (e.g. add a
> >host/guest handshake to both enable tracking in guest and to detect the
> >support in host), then it can allow starting migration with an assigned
> >device, by invoking hot-unplug after most of memory have been migrated.
> >
> >Please implement this in qemu and measure the speed.
> 
> Sure. Will do that.
> 
> >I will not be surprised if destroying/creating netdev in linux
> >turns out to take too long, but before anyone bothered
> >checking, it does not make sense to discuss further enhancements.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-29 16:46                           ` Michael S. Tsirkin
@ 2015-12-29 17:04                             ` Alexander Duyck
  -1 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-29 17:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lan, Tianyu, Dr. David Alan Gilbert, Yang Zhang, qemu-devel,
	Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore, Donald C,
	quintela, Dong, Eddie, Jani, Nrupal, Alexander Graf, Blue Swirl,
	cornelia.huck, Alex Williamson, kraxel, Anthony Liguori,
	amit.shah, Paolo Bonzini, Rustad, Mark D, lcapitulino,
	Or Gerlitz

On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
>>
>>
>> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
>> >As long as you keep up this vague talk about performance during
>> >migration, without even bothering with any measurements, this patchset
>> >will keep going nowhere.
>> >
>>
>> I measured network service downtime for "keep device alive"(RFC patch V1
>> presented) and "put down and up network interface"(RFC patch V2 presented)
>> during migration with some optimizations.
>>
>> The former is around 140ms and the later is around 240ms.
>>
>> My patchset relies on the maibox irq which doesn't work in the suspend state
>> and so can't get downtime for suspend/resume cases. Will try to get the
>> result later.
>
>
> Interesting. So you sare saying merely ifdown/ifup is 100ms?
> This does not sound reasonable.
> Is there a chance you are e.g. getting IP from dhcp?

Actually it wouldn't surprise me if that is due to a reset logic in
the driver.  For starters there is a 10 msec delay in the call
ixgbevf_reset_hw_vf which I believe is present to allow the PF time to
clear registers after the VF has requested a reset.  There is also a
10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues
were disabled.  That is in addition to the fact that the function that
disables the queues does so serially and polls each queue until the
hardware acknowledges that the queues are actually disabled.  The
driver also does the serial enable with poll logic on re-enabling the
queues which likely doesn't help things.

Really this driver is probably in need of a refactor to clean the
cruft out of the reset and initialization logic.  I suspect we have
far more delays than we really need and that is the source of much of
the slow down.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-29 17:04                             ` Alexander Duyck
  0 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-29 17:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Yang Zhang, Tantilov, Emil S, kvm, aik, qemu-devel, lcapitulino,
	Blue Swirl, kraxel, Rustad, Mark D, quintela, Skidmore, Donald C,
	Alexander Graf, Or Gerlitz, Dr. David Alan Gilbert,
	Alex Williamson, Anthony Liguori, cornelia.huck, Lan, Tianyu,
	Ard Biesheuvel, Dong, Eddie, Jani, Nrupal, amit.shah,
	Paolo Bonzini

On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
>>
>>
>> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
>> >As long as you keep up this vague talk about performance during
>> >migration, without even bothering with any measurements, this patchset
>> >will keep going nowhere.
>> >
>>
>> I measured network service downtime for "keep device alive"(RFC patch V1
>> presented) and "put down and up network interface"(RFC patch V2 presented)
>> during migration with some optimizations.
>>
>> The former is around 140ms and the later is around 240ms.
>>
>> My patchset relies on the maibox irq which doesn't work in the suspend state
>> and so can't get downtime for suspend/resume cases. Will try to get the
>> result later.
>
>
> Interesting. So you sare saying merely ifdown/ifup is 100ms?
> This does not sound reasonable.
> Is there a chance you are e.g. getting IP from dhcp?

Actually it wouldn't surprise me if that is due to a reset logic in
the driver.  For starters there is a 10 msec delay in the call
ixgbevf_reset_hw_vf which I believe is present to allow the PF time to
clear registers after the VF has requested a reset.  There is also a
10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues
were disabled.  That is in addition to the fact that the function that
disables the queues does so serially and polls each queue until the
hardware acknowledges that the queues are actually disabled.  The
driver also does the serial enable with poll logic on re-enabling the
queues which likely doesn't help things.

Really this driver is probably in need of a refactor to clean the
cruft out of the reset and initialization logic.  I suspect we have
far more delays than we really need and that is the source of much of
the slow down.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: live migration vs device assignment (motivation)
  2015-12-29 17:04                             ` Alexander Duyck
@ 2015-12-29 17:15                               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-29 17:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Yang Zhang, Tantilov, Emil S, kvm, aik, qemu-devel, lcapitulino,
	Blue Swirl, kraxel, Rustad, Mark D, quintela, Skidmore, Donald C,
	Alexander Graf, Or Gerlitz, Dr. David Alan Gilbert,
	Alex Williamson, Anthony Liguori, cornelia.huck, Lan, Tianyu,
	Ard Biesheuvel, Dong, Eddie, Jani, Nrupal, amit.shah,
	Paolo Bonzini

On Tue, Dec 29, 2015 at 09:04:51AM -0800, Alexander Duyck wrote:
> On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
> >>
> >>
> >> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
> >> >As long as you keep up this vague talk about performance during
> >> >migration, without even bothering with any measurements, this patchset
> >> >will keep going nowhere.
> >> >
> >>
> >> I measured network service downtime for "keep device alive"(RFC patch V1
> >> presented) and "put down and up network interface"(RFC patch V2 presented)
> >> during migration with some optimizations.
> >>
> >> The former is around 140ms and the later is around 240ms.
> >>
> >> My patchset relies on the maibox irq which doesn't work in the suspend state
> >> and so can't get downtime for suspend/resume cases. Will try to get the
> >> result later.
> >
> >
> > Interesting. So you sare saying merely ifdown/ifup is 100ms?
> > This does not sound reasonable.
> > Is there a chance you are e.g. getting IP from dhcp?
> 
> 
> Actually it wouldn't surprise me if that is due to a reset logic in
> the driver.  For starters there is a 10 msec delay in the call
> ixgbevf_reset_hw_vf which I believe is present to allow the PF time to
> clear registers after the VF has requested a reset.  There is also a
> 10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues
> were disabled.  That is in addition to the fact that the function that
> disables the queues does so serially and polls each queue until the
> hardware acknowledges that the queues are actually disabled.  The
> driver also does the serial enable with poll logic on re-enabling the
> queues which likely doesn't help things.
> 
> Really this driver is probably in need of a refactor to clean the
> cruft out of the reset and initialization logic.  I suspect we have
> far more delays than we really need and that is the source of much of
> the slow down.
> 
> - Alex

For ifdown, why is there any need to reset the device at all?
Is it so buffers can be reclaimed?

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-29 17:15                               ` Michael S. Tsirkin
  0 siblings, 0 replies; 142+ messages in thread
From: Michael S. Tsirkin @ 2015-12-29 17:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Yang Zhang, Tantilov, Emil S, kvm, aik, qemu-devel, lcapitulino,
	Blue Swirl, kraxel, Rustad, Mark D, quintela, Skidmore, Donald C,
	Alexander Graf, Or Gerlitz, Dr. David Alan Gilbert,
	Alex Williamson, Anthony Liguori, cornelia.huck, Lan, Tianyu,
	Ard Biesheuvel, Dong, Eddie, Jani, Nrupal, amit.shah,
	Paolo Bonzini

On Tue, Dec 29, 2015 at 09:04:51AM -0800, Alexander Duyck wrote:
> On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
> >>
> >>
> >> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
> >> >As long as you keep up this vague talk about performance during
> >> >migration, without even bothering with any measurements, this patchset
> >> >will keep going nowhere.
> >> >
> >>
> >> I measured network service downtime for "keep device alive"(RFC patch V1
> >> presented) and "put down and up network interface"(RFC patch V2 presented)
> >> during migration with some optimizations.
> >>
> >> The former is around 140ms and the later is around 240ms.
> >>
> >> My patchset relies on the maibox irq which doesn't work in the suspend state
> >> and so can't get downtime for suspend/resume cases. Will try to get the
> >> result later.
> >
> >
> > Interesting. So you sare saying merely ifdown/ifup is 100ms?
> > This does not sound reasonable.
> > Is there a chance you are e.g. getting IP from dhcp?
> 
> 
> Actually it wouldn't surprise me if that is due to a reset logic in
> the driver.  For starters there is a 10 msec delay in the call
> ixgbevf_reset_hw_vf which I believe is present to allow the PF time to
> clear registers after the VF has requested a reset.  There is also a
> 10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues
> were disabled.  That is in addition to the fact that the function that
> disables the queues does so serially and polls each queue until the
> hardware acknowledges that the queues are actually disabled.  The
> driver also does the serial enable with poll logic on re-enabling the
> queues which likely doesn't help things.
> 
> Really this driver is probably in need of a refactor to clean the
> cruft out of the reset and initialization logic.  I suspect we have
> far more delays than we really need and that is the source of much of
> the slow down.
> 
> - Alex

For ifdown, why is there any need to reset the device at all?
Is it so buffers can be reclaimed?

-- 
MST

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-29 17:15                               ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-12-29 18:04                                 ` Alexander Duyck
  -1 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-29 18:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lan, Tianyu, Dr. David Alan Gilbert, Yang Zhang, qemu-devel,
	Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore, Donald C,
	quintela, Dong, Eddie, Jani, Nrupal, Alexander Graf, Blue Swirl,
	cornelia.huck, Alex Williamson, kraxel, Anthony Liguori,
	amit.shah, Paolo Bonzini, Rustad, Mark D, lcapitulino,
	Or Gerlitz

On Tue, Dec 29, 2015 at 9:15 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Dec 29, 2015 at 09:04:51AM -0800, Alexander Duyck wrote:
>> On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
>> >>
>> >>
>> >> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
>> >> >As long as you keep up this vague talk about performance during
>> >> >migration, without even bothering with any measurements, this patchset
>> >> >will keep going nowhere.
>> >> >
>> >>
>> >> I measured network service downtime for "keep device alive"(RFC patch V1
>> >> presented) and "put down and up network interface"(RFC patch V2 presented)
>> >> during migration with some optimizations.
>> >>
>> >> The former is around 140ms and the later is around 240ms.
>> >>
>> >> My patchset relies on the maibox irq which doesn't work in the suspend state
>> >> and so can't get downtime for suspend/resume cases. Will try to get the
>> >> result later.
>> >
>> >
>> > Interesting. So you sare saying merely ifdown/ifup is 100ms?
>> > This does not sound reasonable.
>> > Is there a chance you are e.g. getting IP from dhcp?
>>
>>
>> Actually it wouldn't surprise me if that is due to a reset logic in
>> the driver.  For starters there is a 10 msec delay in the call
>> ixgbevf_reset_hw_vf which I believe is present to allow the PF time to
>> clear registers after the VF has requested a reset.  There is also a
>> 10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues
>> were disabled.  That is in addition to the fact that the function that
>> disables the queues does so serially and polls each queue until the
>> hardware acknowledges that the queues are actually disabled.  The
>> driver also does the serial enable with poll logic on re-enabling the
>> queues which likely doesn't help things.
>>
>> Really this driver is probably in need of a refactor to clean the
>> cruft out of the reset and initialization logic.  I suspect we have
>> far more delays than we really need and that is the source of much of
>> the slow down.
>
> For ifdown, why is there any need to reset the device at all?
> Is it so buffers can be reclaimed?
>

I believe it is mostly historical.  All the Intel drivers are derived
from e1000.  The e1000 has a 10ms sleep to allow outstanding PCI
transactions to complete before resetting and it looks like they ended
up inheriting that in the ixgbevf driver.  I suppose it does allow for
the buffers to be reclaimed which is something we may need, though the
VF driver should have already verified that it disabled the queues
when it was doing the polling on the bits being cleared in the
individual queue control registers.  Likely the 10ms sleep is
redundant as a result.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2015-12-29 18:04                                 ` Alexander Duyck
  0 siblings, 0 replies; 142+ messages in thread
From: Alexander Duyck @ 2015-12-29 18:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Yang Zhang, Tantilov, Emil S, kvm, aik, qemu-devel, lcapitulino,
	Blue Swirl, kraxel, Rustad, Mark D, quintela, Skidmore, Donald C,
	Alexander Graf, Or Gerlitz, Dr. David Alan Gilbert,
	Alex Williamson, Anthony Liguori, cornelia.huck, Lan, Tianyu,
	Ard Biesheuvel, Dong, Eddie, Jani, Nrupal, amit.shah,
	Paolo Bonzini

On Tue, Dec 29, 2015 at 9:15 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Dec 29, 2015 at 09:04:51AM -0800, Alexander Duyck wrote:
>> On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote:
>> >>
>> >>
>> >> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote:
>> >> >As long as you keep up this vague talk about performance during
>> >> >migration, without even bothering with any measurements, this patchset
>> >> >will keep going nowhere.
>> >> >
>> >>
>> >> I measured network service downtime for "keep device alive"(RFC patch V1
>> >> presented) and "put down and up network interface"(RFC patch V2 presented)
>> >> during migration with some optimizations.
>> >>
>> >> The former is around 140ms and the later is around 240ms.
>> >>
>> >> My patchset relies on the maibox irq which doesn't work in the suspend state
>> >> and so can't get downtime for suspend/resume cases. Will try to get the
>> >> result later.
>> >
>> >
>> > Interesting. So you sare saying merely ifdown/ifup is 100ms?
>> > This does not sound reasonable.
>> > Is there a chance you are e.g. getting IP from dhcp?
>>
>>
>> Actually it wouldn't surprise me if that is due to a reset logic in
>> the driver.  For starters there is a 10 msec delay in the call
>> ixgbevf_reset_hw_vf which I believe is present to allow the PF time to
>> clear registers after the VF has requested a reset.  There is also a
>> 10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues
>> were disabled.  That is in addition to the fact that the function that
>> disables the queues does so serially and polls each queue until the
>> hardware acknowledges that the queues are actually disabled.  The
>> driver also does the serial enable with poll logic on re-enabling the
>> queues which likely doesn't help things.
>>
>> Really this driver is probably in need of a refactor to clean the
>> cruft out of the reset and initialization logic.  I suspect we have
>> far more delays than we really need and that is the source of much of
>> the slow down.
>
> For ifdown, why is there any need to reset the device at all?
> Is it so buffers can be reclaimed?
>

I believe it is mostly historical.  All the Intel drivers are derived
from e1000.  The e1000 has a 10ms sleep to allow outstanding PCI
transactions to complete before resetting and it looks like they ended
up inheriting that in the ixgbevf driver.  I suppose it does allow for
the buffers to be reclaimed which is something we may need, though the
VF driver should have already verified that it disabled the queues
when it was doing the polling on the bits being cleared in the
individual queue control registers.  Likely the 10ms sleep is
redundant as a result.

- Alex

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
  2015-12-29 16:46                           ` Michael S. Tsirkin
@ 2016-01-04  2:15                             ` Lan Tianyu
  -1 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2016-01-04  2:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Dr. David Alan Gilbert, Yang Zhang, qemu-devel,
	Tantilov, Emil S, kvm, Ard Biesheuvel, aik, Skidmore, Donald C,
	quintela, Dong, Eddie, Jani, Nrupal, Alexander Graf, Blue Swirl,
	cornelia.huck, Alex Williamson, kraxel, Anthony Liguori,
	amit.shah, Paolo Bonzini, Rustad, Mark D, lcapitulino,
	Or Gerlitz

On 2015年12月30日 00:46, Michael S. Tsirkin wrote:
> Interesting. So you sare saying merely ifdown/ifup is 100ms?
> This does not sound reasonable.
> Is there a chance you are e.g. getting IP from dhcp?
> 
> If so that is wrong - clearly should reconfigure the old IP
> back without playing with dhcp. For testing, just set up
> a static IP.

MAC and IP are migrated with VM to target machine and not need to
reconfigure IP after migration.

From my test result, ixgbevf_down() consumes 35ms and ixgbevf_up()
consumes 55ms during migration.

-- 
Best regards
Tianyu Lan

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] live migration vs device assignment (motivation)
@ 2016-01-04  2:15                             ` Lan Tianyu
  0 siblings, 0 replies; 142+ messages in thread
From: Lan Tianyu @ 2016-01-04  2:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Yang Zhang, Tantilov, Emil S, kvm, aik, qemu-devel,
	Alexander Duyck, lcapitulino, Blue Swirl, kraxel, Rustad, Mark D,
	quintela, Skidmore, Donald C, Alexander Graf, Or Gerlitz,
	Dr. David Alan Gilbert, Alex Williamson, Anthony Liguori,
	cornelia.huck, Ard Biesheuvel, Dong, Eddie, Jani, Nrupal,
	amit.shah, Paolo Bonzini

On 2015年12月30日 00:46, Michael S. Tsirkin wrote:
> Interesting. So you sare saying merely ifdown/ifup is 100ms?
> This does not sound reasonable.
> Is there a chance you are e.g. getting IP from dhcp?
> 
> If so that is wrong - clearly should reconfigure the old IP
> back without playing with dhcp. For testing, just set up
> a static IP.

MAC and IP are migrated with VM to target machine and not need to
reconfigure IP after migration.

>From my test result, ixgbevf_down() consumes 35ms and ixgbevf_up()
consumes 55ms during migration.

-- 
Best regards
Tianyu Lan

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
  2015-11-24 13:35 ` [Qemu-devel] " Lan Tianyu
@ 2016-03-17  9:15   ` Wei Yang
  -1 siblings, 0 replies; 142+ messages in thread
From: Wei Yang @ 2016-03-17  9:15 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: aik, alex.williamson, amit.shah, anthony, ard.biesheuvel,
	blauwirbel, cornelia.huck, eddie.dong, nrupal.jani, agraf, kvm,
	pbonzini, qemu-devel, emil.s.tantilov, gerlitz.or,
	donald.c.skidmore, mark.d.rustad, mst, kraxel, lcapitulino,
	quintela

Hi, Tianyu,

I am testing your V2 patch set in our environment, while facing two issues
now. Have a workaround for the first one and hope you could share some light
on the second one :-)

1. Mismatch for ram_block (Have a workaround)
-------------------------------------------------------------------------------
Below is the error message on the destination:

    qemu-system-x86_64: Length mismatch: : 0x3000 in != 0x4000: Invalid argument
    qemu-system-x86_64: error while loading state for instance 0x0 of device 'ram'
    qemu-system-x86_64: load of migration failed: Invalid argument

With the following command line on source and destination respectively:

    git/qemu/x86_64-softmmu/qemu-system-x86_64 --enable-kvm -m 4096 -smp 4 --nographic -drive file=/root/nfs/rhel.img,format=raw,cache=none -device vfio-sriov,host=0000:03:10.0
    git/qemu/x86_64-softmmu/qemu-system-x86_64 --enable-kvm -m 4096 -smp 4 --nographic -drive file=/root/nfs/rhel.img,format=raw,cache=none -device vfio-sriov,host=0000:03:10.0 --incoming tcp:0:4444

By some debugging, the reason for this error is the ram_block->idstr of
pass-through MMIO region is not set.

My workaround is to add vmstate_register_ram() in vfio_mmap_region() after
memory_region_init_ram_ptr() returns.

I think this is not a good solution, since the ram_block->idstr is coded
with the VF's BDF. So I guess this will not work when the VF has different BDF
from source to destination respectively.

Maybe my test step is not correct?

2. Failed to migrate the MAC address
-------------------------------------------------------------------------------

By adding some code in VF's driver in destination guest, I found the MAC
information has been migrated to destination in adapter->hw.mac. While this is
"reset" by VF's driver, when ixgbevf_migration_task is invoked at the end of
the migration process.

Below is what I have printed:

The ifconfig output from destination:

    eth8      Link encap:Ethernet  HWaddr 52:54:00:81:39:F2  
              inet addr:9.31.210.106  Bcast:9.31.255.255  Mask:255.255.0.0
              inet6 addr: fe80::5054:ff:fe81:39f2/64 Scope:Link
              UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
              RX packets:66 errors:0 dropped:0 overruns:0 frame:0
              TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:1000 
              RX bytes:21840 (21.3 KiB)  TX bytes:920 (920.0 b)

The log message I printed in destination's VF driver:
    
    ixgbevf: migration end -- 
    ixgbevf: original mac:52:54:00:81:39:f2
    ixgbevf: after reset mac:52:54:00:92:04:a3
    ixgbevf: migration end  ==
    
I didn't take a close look in the "reset" function, while seems it retrieves
the mac from VF hardware. Hmm... is there some possible way to have the same
mac on both source and destination?

At last, I appreciated all your work and help, learned much from your side.

On Tue, Nov 24, 2015 at 09:35:17PM +0800, Lan Tianyu wrote:
>This patchset is to propose a solution of adding live migration
>support for SRIOV NIC.
>
>During migration, Qemu needs to let VF driver in the VM to know
>migration start and end. Qemu adds faked PCI migration capability
>to help to sync status between two sides during migration.
>
>Qemu triggers VF's mailbox irq via sending MSIX msg when migration
>status is changed. VF driver tells Qemu its mailbox vector index
>via the new PCI capability. In some cases(NIC is suspended or closed),
>VF mailbox irq is freed and VF driver can disable irq injecting via
>new capability.   
>
>VF driver will put down nic before migration and put up again on
>the target machine.
>
>Lan Tianyu (10):
>  Qemu/VFIO: Create head file pci.h to share data struct.
>  Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
>  Qemu/VFIO: Rework vfio_std_cap_max_size() function
>  Qemu/VFIO: Add vfio_find_free_cfg_reg() to find free PCI config space
>    regs
>  Qemu/VFIO: Expose PCI config space read/write and msix functions
>  Qemu/PCI: Add macros for faked PCI migration capability
>  Qemu: Add post_load_state() to run after restoring CPU state
>  Qemu: Add save_before_stop callback to run just before stopping VCPU
>    during migration
>  Qemu/VFIO: Add SRIOV VF migration support
>  Qemu/VFIO: Misc change for enable migration with VFIO
>
> hw/vfio/Makefile.objs       |   2 +-
> hw/vfio/pci.c               | 196 +++++++++-----------------------------------
> hw/vfio/pci.h               | 168 +++++++++++++++++++++++++++++++++++++
> hw/vfio/sriov.c             | 178 ++++++++++++++++++++++++++++++++++++++++
> include/hw/pci/pci_regs.h   |  19 +++++
> include/migration/vmstate.h |   5 ++
> include/sysemu/sysemu.h     |   1 +
> linux-headers/linux/vfio.h  |  16 ++++
> migration/migration.c       |   3 +-
> migration/savevm.c          |  28 +++++++
> 10 files changed, 459 insertions(+), 157 deletions(-)
> create mode 100644 hw/vfio/pci.h
> create mode 100644 hw/vfio/sriov.c
>
>-- 
>1.9.3
>
>

-- 
Richard Yang\nHelp you, Help me

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC
@ 2016-03-17  9:15   ` Wei Yang
  0 siblings, 0 replies; 142+ messages in thread
From: Wei Yang @ 2016-03-17  9:15 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: qemu-devel, emil.s.tantilov, kvm, ard.biesheuvel, aik,
	donald.c.skidmore, mst, eddie.dong, nrupal.jani, quintela, agraf,
	blauwirbel, cornelia.huck, alex.williamson, kraxel, anthony,
	amit.shah, pbonzini, mark.d.rustad, lcapitulino, gerlitz.or

Hi, Tianyu,

I am testing your V2 patch set in our environment, while facing two issues
now. Have a workaround for the first one and hope you could share some light
on the second one :-)

1. Mismatch for ram_block (Have a workaround)
-------------------------------------------------------------------------------
Below is the error message on the destination:

    qemu-system-x86_64: Length mismatch: : 0x3000 in != 0x4000: Invalid argument
    qemu-system-x86_64: error while loading state for instance 0x0 of device 'ram'
    qemu-system-x86_64: load of migration failed: Invalid argument

With the following command line on source and destination respectively:

    git/qemu/x86_64-softmmu/qemu-system-x86_64 --enable-kvm -m 4096 -smp 4 --nographic -drive file=/root/nfs/rhel.img,format=raw,cache=none -device vfio-sriov,host=0000:03:10.0
    git/qemu/x86_64-softmmu/qemu-system-x86_64 --enable-kvm -m 4096 -smp 4 --nographic -drive file=/root/nfs/rhel.img,format=raw,cache=none -device vfio-sriov,host=0000:03:10.0 --incoming tcp:0:4444

By some debugging, the reason for this error is the ram_block->idstr of
pass-through MMIO region is not set.

My workaround is to add vmstate_register_ram() in vfio_mmap_region() after
memory_region_init_ram_ptr() returns.

I think this is not a good solution, since the ram_block->idstr is coded
with the VF's BDF. So I guess this will not work when the VF has different BDF
from source to destination respectively.

Maybe my test step is not correct?

2. Failed to migrate the MAC address
-------------------------------------------------------------------------------

By adding some code in VF's driver in destination guest, I found the MAC
information has been migrated to destination in adapter->hw.mac. While this is
"reset" by VF's driver, when ixgbevf_migration_task is invoked at the end of
the migration process.

Below is what I have printed:

The ifconfig output from destination:

    eth8      Link encap:Ethernet  HWaddr 52:54:00:81:39:F2  
              inet addr:9.31.210.106  Bcast:9.31.255.255  Mask:255.255.0.0
              inet6 addr: fe80::5054:ff:fe81:39f2/64 Scope:Link
              UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
              RX packets:66 errors:0 dropped:0 overruns:0 frame:0
              TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:1000 
              RX bytes:21840 (21.3 KiB)  TX bytes:920 (920.0 b)

The log message I printed in destination's VF driver:
    
    ixgbevf: migration end -- 
    ixgbevf: original mac:52:54:00:81:39:f2
    ixgbevf: after reset mac:52:54:00:92:04:a3
    ixgbevf: migration end  ==
    
I didn't take a close look in the "reset" function, while seems it retrieves
the mac from VF hardware. Hmm... is there some possible way to have the same
mac on both source and destination?

At last, I appreciated all your work and help, learned much from your side.

On Tue, Nov 24, 2015 at 09:35:17PM +0800, Lan Tianyu wrote:
>This patchset is to propose a solution of adding live migration
>support for SRIOV NIC.
>
>During migration, Qemu needs to let VF driver in the VM to know
>migration start and end. Qemu adds faked PCI migration capability
>to help to sync status between two sides during migration.
>
>Qemu triggers VF's mailbox irq via sending MSIX msg when migration
>status is changed. VF driver tells Qemu its mailbox vector index
>via the new PCI capability. In some cases(NIC is suspended or closed),
>VF mailbox irq is freed and VF driver can disable irq injecting via
>new capability.   
>
>VF driver will put down nic before migration and put up again on
>the target machine.
>
>Lan Tianyu (10):
>  Qemu/VFIO: Create head file pci.h to share data struct.
>  Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition
>  Qemu/VFIO: Rework vfio_std_cap_max_size() function
>  Qemu/VFIO: Add vfio_find_free_cfg_reg() to find free PCI config space
>    regs
>  Qemu/VFIO: Expose PCI config space read/write and msix functions
>  Qemu/PCI: Add macros for faked PCI migration capability
>  Qemu: Add post_load_state() to run after restoring CPU state
>  Qemu: Add save_before_stop callback to run just before stopping VCPU
>    during migration
>  Qemu/VFIO: Add SRIOV VF migration support
>  Qemu/VFIO: Misc change for enable migration with VFIO
>
> hw/vfio/Makefile.objs       |   2 +-
> hw/vfio/pci.c               | 196 +++++++++-----------------------------------
> hw/vfio/pci.h               | 168 +++++++++++++++++++++++++++++++++++++
> hw/vfio/sriov.c             | 178 ++++++++++++++++++++++++++++++++++++++++
> include/hw/pci/pci_regs.h   |  19 +++++
> include/migration/vmstate.h |   5 ++
> include/sysemu/sysemu.h     |   1 +
> linux-headers/linux/vfio.h  |  16 ++++
> migration/migration.c       |   3 +-
> migration/savevm.c          |  28 +++++++
> 10 files changed, 459 insertions(+), 157 deletions(-)
> create mode 100644 hw/vfio/pci.h
> create mode 100644 hw/vfio/sriov.c
>
>-- 
>1.9.3
>
>

-- 
Richard Yang\nHelp you, Help me

^ permalink raw reply	[flat|nested] 142+ messages in thread

end of thread, other threads:[~2016-03-17  9:40 UTC | newest]

Thread overview: 142+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-24 13:35 [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC Lan Tianyu
2015-11-24 13:35 ` [Qemu-devel] " Lan Tianyu
2015-11-24 13:35 ` [RFC PATCH V2 01/10] Qemu/VFIO: Create head file pci.h to share data struct Lan Tianyu
2015-11-24 13:35   ` [Qemu-devel] " Lan Tianyu
2015-11-24 13:35 ` [RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition Lan Tianyu
2015-11-24 13:35   ` [Qemu-devel] " Lan Tianyu
2015-12-02 22:25   ` Alex Williamson
2015-12-02 22:25     ` [Qemu-devel] " Alex Williamson
2015-12-03  8:40     ` Lan, Tianyu
2015-12-03  8:40       ` [Qemu-devel] " Lan, Tianyu
2015-12-03 15:26       ` Alex Williamson
2015-12-03 15:26         ` [Qemu-devel] " Alex Williamson
2015-11-24 13:35 ` [RFC PATCH V2 03/10] Qemu/VFIO: Rework vfio_std_cap_max_size() function Lan Tianyu
2015-11-24 13:35   ` [Qemu-devel] " Lan Tianyu
2015-11-24 13:35 ` [RFC PATCH V2 04/10] Qemu/VFIO: Add vfio_find_free_cfg_reg() to find free PCI config space regs Lan Tianyu
2015-11-24 13:35   ` [Qemu-devel] " Lan Tianyu
2015-11-24 13:35 ` [RFC PATCH V2 05/10] Qemu/VFIO: Expose PCI config space read/write and msix functions Lan Tianyu
2015-11-24 13:35   ` [Qemu-devel] " Lan Tianyu
2015-11-24 13:35 ` [RFC PATCH V2 06/10] Qemu/PCI: Add macros for faked PCI migration capability Lan Tianyu
2015-11-24 13:35   ` [Qemu-devel] " Lan Tianyu
2015-12-02 22:25   ` Alex Williamson
2015-12-02 22:25     ` [Qemu-devel] " Alex Williamson
2015-12-03  8:57     ` Lan, Tianyu
2015-12-03  8:57       ` [Qemu-devel] " Lan, Tianyu
2015-11-24 13:35 ` [RFC PATCH V2 07/10] Qemu: Add post_load_state() to run after restoring CPU state Lan Tianyu
2015-11-24 13:35   ` [Qemu-devel] " Lan Tianyu
2015-11-24 13:35 ` [RFC PATCH V2 08/10] Qemu: Add save_before_stop callback to run just before stopping VCPU during migration Lan Tianyu
2015-11-24 13:35   ` [Qemu-devel] " Lan Tianyu
2015-11-24 13:35 ` [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support Lan Tianyu
2015-11-24 13:35   ` [Qemu-devel] " Lan Tianyu
2015-11-24 21:03   ` Michael S. Tsirkin
2015-11-24 21:03     ` [Qemu-devel] " Michael S. Tsirkin
2015-11-25 15:32     ` Lan, Tianyu
2015-11-25 15:32       ` [Qemu-devel] " Lan, Tianyu
2015-11-25 15:44       ` Michael S. Tsirkin
2015-11-25 15:44         ` [Qemu-devel] " Michael S. Tsirkin
2015-12-02 22:25   ` Alex Williamson
2015-12-02 22:25     ` [Qemu-devel] " Alex Williamson
2015-12-03  8:56     ` Lan, Tianyu
2015-12-03  8:56       ` [Qemu-devel] " Lan, Tianyu
2015-11-24 13:35 ` [RFC PATCH V2 10/10] Qemu/VFIO: Misc change for enable migration with VFIO Lan Tianyu
2015-11-24 13:35   ` [Qemu-devel] " Lan Tianyu
2015-11-30  8:01 ` [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC Michael S. Tsirkin
2015-11-30  8:01   ` [Qemu-devel] " Michael S. Tsirkin
2015-12-01  6:26   ` Lan, Tianyu
2015-12-01  6:26     ` [Qemu-devel] " Lan, Tianyu
2015-12-01 15:02     ` Michael S. Tsirkin
2015-12-01 15:02       ` [Qemu-devel] " Michael S. Tsirkin
2015-12-02 14:08       ` Lan, Tianyu
2015-12-02 14:08         ` [Qemu-devel] " Lan, Tianyu
2015-12-02 14:31         ` Michael S. Tsirkin
2015-12-02 14:31           ` [Qemu-devel] " Michael S. Tsirkin
2015-12-03 14:53           ` Lan, Tianyu
2015-12-03 14:53             ` [Qemu-devel] " Lan, Tianyu
2015-12-04  6:42           ` Lan, Tianyu
2015-12-04  6:42             ` [Qemu-devel] " Lan, Tianyu
2015-12-04  8:05             ` Michael S. Tsirkin
2015-12-04  8:05               ` [Qemu-devel] " Michael S. Tsirkin
2015-12-04 12:11               ` Lan, Tianyu
2015-12-04 12:11                 ` [Qemu-devel] " Lan, Tianyu
2015-12-03 18:32         ` Alexander Duyck
2015-12-03 18:32           ` [Qemu-devel] " Alexander Duyck
2015-12-07 16:50 ` live migration vs device assignment (was Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC) Michael S. Tsirkin
2015-12-07 16:50   ` [Qemu-devel] " Michael S. Tsirkin
2015-12-09 16:26   ` live migration vs device assignment (motivation) Lan, Tianyu
2015-12-09 16:26     ` [Qemu-devel] " Lan, Tianyu
2015-12-09 17:14     ` Alexander Duyck
2015-12-09 17:14       ` [Qemu-devel] " Alexander Duyck
2015-12-10  3:15       ` Lan, Tianyu
2015-12-10  3:15         ` [Qemu-devel] " Lan, Tianyu
2015-12-09 20:07     ` Michael S. Tsirkin
2015-12-09 20:07       ` [Qemu-devel] " Michael S. Tsirkin
2015-12-10  3:04       ` Lan, Tianyu
2015-12-10  3:04         ` [Qemu-devel] " Lan, Tianyu
2015-12-10  8:38         ` Michael S. Tsirkin
2015-12-10  8:38           ` [Qemu-devel] " Michael S. Tsirkin
2015-12-10 14:23           ` Lan, Tianyu
2015-12-10 14:23             ` [Qemu-devel] " Lan, Tianyu
2015-12-10 10:18     ` Dr. David Alan Gilbert
2015-12-10 10:18       ` Dr. David Alan Gilbert
2015-12-10 11:28       ` Yang Zhang
2015-12-10 11:28         ` Yang Zhang
2015-12-10 11:41         ` Dr. David Alan Gilbert
2015-12-10 11:41           ` Dr. David Alan Gilbert
2015-12-10 13:07           ` Yang Zhang
2015-12-10 13:07             ` Yang Zhang
2015-12-10 14:38           ` Lan, Tianyu
2015-12-10 14:38             ` [Qemu-devel] " Lan, Tianyu
2015-12-10 16:11             ` Michael S. Tsirkin
2015-12-10 16:11               ` Michael S. Tsirkin
2015-12-10 19:17               ` Alexander Duyck
2015-12-10 19:17                 ` Alexander Duyck
2015-12-11  7:32               ` Lan, Tianyu
2015-12-11  7:32                 ` Lan, Tianyu
2015-12-14  9:12                 ` Michael S. Tsirkin
2015-12-14  9:12                   ` Michael S. Tsirkin
2015-12-10 16:23             ` Dr. David Alan Gilbert
2015-12-10 16:23               ` Dr. David Alan Gilbert
2015-12-10 17:16             ` Alexander Duyck
2015-12-10 17:16               ` Alexander Duyck
2015-12-13 15:47               ` Lan, Tianyu
2015-12-13 15:47                 ` Lan, Tianyu
2015-12-13 19:30                 ` Alexander Duyck
2015-12-13 19:30                   ` Alexander Duyck
2015-12-25  7:03                   ` Lan Tianyu
2015-12-25  7:03                     ` [Qemu-devel] " Lan Tianyu
2015-12-25 12:11                     ` Michael S. Tsirkin
2015-12-25 12:11                       ` Michael S. Tsirkin
2015-12-28 17:42                       ` Lan, Tianyu
2015-12-28 17:42                         ` Lan, Tianyu
2015-12-29 16:46                         ` Michael S. Tsirkin
2015-12-29 16:46                           ` Michael S. Tsirkin
2015-12-29 17:04                           ` Alexander Duyck
2015-12-29 17:04                             ` Alexander Duyck
2015-12-29 17:15                             ` Michael S. Tsirkin
2015-12-29 17:15                               ` [Qemu-devel] " Michael S. Tsirkin
2015-12-29 18:04                               ` Alexander Duyck
2015-12-29 18:04                                 ` Alexander Duyck
2016-01-04  2:15                           ` Lan Tianyu
2016-01-04  2:15                             ` Lan Tianyu
2015-12-25 22:31                     ` Alexander Duyck
2015-12-25 22:31                       ` Alexander Duyck
2015-12-27  9:21                       ` Michael S. Tsirkin
2015-12-27  9:21                         ` [Qemu-devel] " Michael S. Tsirkin
2015-12-27 21:45                         ` Alexander Duyck
2015-12-27 21:45                           ` Alexander Duyck
2015-12-28  8:51                           ` Michael S. Tsirkin
2015-12-28  8:51                             ` Michael S. Tsirkin
2015-12-28  3:20                       ` Dong, Eddie
2015-12-28  3:20                         ` Dong, Eddie
2015-12-28  4:26                         ` Alexander Duyck
2015-12-28  4:26                           ` [Qemu-devel] " Alexander Duyck
2015-12-28 11:50                         ` Michael S. Tsirkin
2015-12-28 11:50                           ` Michael S. Tsirkin
2015-12-14  9:26                 ` Michael S. Tsirkin
2015-12-14  9:26                   ` Michael S. Tsirkin
2015-12-28  8:52                   ` Pavel Fedin
2015-12-28  8:52                     ` Pavel Fedin
2015-12-28 11:51                     ` Michael S. Tsirkin
2015-12-28 11:51                       ` Michael S. Tsirkin
2016-03-17  9:15 ` [Qemu-devel] [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC Wei Yang
2016-03-17  9:15   ` Wei Yang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.