netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
@ 2015-10-21 16:37 Lan Tianyu
  2015-10-21 16:37 ` [RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device Lan Tianyu
                   ` (14 more replies)
  0 siblings, 15 replies; 56+ messages in thread
From: Lan Tianyu @ 2015-10-21 16:37 UTC (permalink / raw)
  To: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson
  Cc: Lan Tianyu

This patchset is to propose a new solution to add live migration support for 82599
SRIOV network card.

Im our solution, we prefer to put all device specific operation into VF and
PF driver and make code in the Qemu more general.


VF status migration
=================================================================
VF status can be divided into 4 parts
1) PCI configure regs
2) MSIX configure
3) VF status in the PF driver
4) VF MMIO regs 

The first three status are all handled by Qemu. 
The PCI configure space regs and MSIX configure are originally
stored in Qemu. To save and restore "VF status in the PF driver"
by Qemu during migration, adds new sysfs node "state_in_pf" under
VF sysfs directory.

For VF MMIO regs, we introduce self emulation layer in the VF
driver to record MMIO reg values during reading or writing MMIO
and put these data in the guest memory. It will be migrated with
guest memory to new machine.


VF function restoration
================================================================
Restoring VF function operation are done in the VF and PF driver.
 
In order to let VF driver to know migration status, Qemu fakes VF
PCI configure regs to indicate migration status and add new sysfs
node "notify_vf" to trigger VF mailbox irq in order to notify VF 
about migration status change.

Transmit/Receive descriptor head regs are read-only and can't
be restored via writing back recording reg value directly and they
are set to 0 during VF reset. To reuse original tx/rx rings, shift
desc ring in order to move the desc pointed by original head reg to
first entry of the ring and then enable tx/rx rings. VF restarts to
receive and transmit from original head desc.


Tracking DMA accessed memory
=================================================================
Migration relies on tracking dirty page to migrate memory.
Hardware can't automatically mark a page as dirty after DMA
memory access. VF descriptor rings and data buffers are modified
by hardware when receive and transmit data. To track such dirty memory
manually, do dummy writes(read a byte and write it back) when receive
and transmit data.


Service down time test
=================================================================
So far, we tested migration between two laptops with 82599 nic which
are connected to a gigabit switch. Ping VF in the 0.001s interval
during migration on the host of source side. It service down
time is about 180ms.

[983769928.053604] 64 bytes from 10.239.48.100: icmp_seq=4131 ttl=64 time=2.79 ms
[983769928.056422] 64 bytes from 10.239.48.100: icmp_seq=4132 ttl=64 time=2.79 ms
[983769928.059241] 64 bytes from 10.239.48.100: icmp_seq=4133 ttl=64 time=2.79 ms
[983769928.062071] 64 bytes from 10.239.48.100: icmp_seq=4134 ttl=64 time=2.80 ms
[983769928.064890] 64 bytes from 10.239.48.100: icmp_seq=4135 ttl=64 time=2.79 ms
[983769928.067716] 64 bytes from 10.239.48.100: icmp_seq=4136 ttl=64 time=2.79 ms
[983769928.070538] 64 bytes from 10.239.48.100: icmp_seq=4137 ttl=64 time=2.79 ms
[983769928.073360] 64 bytes from 10.239.48.100: icmp_seq=4138 ttl=64 time=2.79 ms
[983769928.083444] no answer yet for icmp_seq=4139
[983769928.093524] no answer yet for icmp_seq=4140
[983769928.103602] no answer yet for icmp_seq=4141
[983769928.113684] no answer yet for icmp_seq=4142
[983769928.123763] no answer yet for icmp_seq=4143
[983769928.133854] no answer yet for icmp_seq=4144
[983769928.143931] no answer yet for icmp_seq=4145
[983769928.154008] no answer yet for icmp_seq=4146
[983769928.164084] no answer yet for icmp_seq=4147
[983769928.174160] no answer yet for icmp_seq=4148
[983769928.184236] no answer yet for icmp_seq=4149
[983769928.194313] no answer yet for icmp_seq=4150
[983769928.204390] no answer yet for icmp_seq=4151
[983769928.214468] no answer yet for icmp_seq=4152
[983769928.224556] no answer yet for icmp_seq=4153
[983769928.234632] no answer yet for icmp_seq=4154
[983769928.244709] no answer yet for icmp_seq=4155
[983769928.254783] no answer yet for icmp_seq=4156
[983769928.256094] 64 bytes from 10.239.48.100: icmp_seq=4139 ttl=64 time=182 ms
[983769928.256107] 64 bytes from 10.239.48.100: icmp_seq=4140 ttl=64 time=172 ms
[983769928.256114] no answer yet for icmp_seq=4157
[983769928.256236] 64 bytes from 10.239.48.100: icmp_seq=4141 ttl=64 time=162 ms
[983769928.256245] 64 bytes from 10.239.48.100: icmp_seq=4142 ttl=64 time=152 ms
[983769928.256272] 64 bytes from 10.239.48.100: icmp_seq=4143 ttl=64 time=142 ms
[983769928.256310] 64 bytes from 10.239.48.100: icmp_seq=4144 ttl=64 time=132 ms
[983769928.256325] 64 bytes from 10.239.48.100: icmp_seq=4145 ttl=64 time=122 ms
[983769928.256332] 64 bytes from 10.239.48.100: icmp_seq=4146 ttl=64 time=112 ms
[983769928.256440] 64 bytes from 10.239.48.100: icmp_seq=4147 ttl=64 time=102 ms
[983769928.256455] 64 bytes from 10.239.48.100: icmp_seq=4148 ttl=64 time=92.3 ms
[983769928.256494] 64 bytes from 10.239.48.100: icmp_seq=4149 ttl=64 time=82.3 ms
[983769928.256503] 64 bytes from 10.239.48.100: icmp_seq=4150 ttl=64 time=72.2 ms
[983769928.256631] 64 bytes from 10.239.48.100: icmp_seq=4158 ttl=64 time=0.500 ms
[983769928.257284] 64 bytes from 10.239.48.100: icmp_seq=4159 ttl=64 time=0.154 ms
[983769928.258297] 64 bytes from 10.239.48.100: icmp_seq=4160 ttl=64 time=0.165 ms

Todo
=======================================================
So far, the patchset isn't perfect. VF net interface can't be open, closed, down 
and up during migration. Will prevent such operation during migration in the
future job.

Very appreciate for your comments.


Lan Tianyu (12):
  PCI: Add virtfn_index for struct pci_device
  IXGBE: Add new mail box event to restore VF status in the PF driver
  IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF
    driver
  IXGBE: Add ixgbe_ping_vf() to notify a specified VF via mailbox msg.
  IXGBE: Add new sysfs interface of "notify_vf"
  IXGBEVF: Add self emulation layer
  IXGBEVF: Add new mail box event for migration
  IXGBEVF: Rework code of finding the end transmit desc of package
  IXGBEVF: Add live migration support for VF driver
  IXGBEVF: Add lock to protect tx/rx ring operation
  IXGBEVF: Migrate VF statistic data
  IXGBEVF: Track dma dirty pages

 drivers/net/ethernet/intel/ixgbe/ixgbe.h           |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h       |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c     | 245 ++++++++++++++++++++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h     |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h      |   4 +
 drivers/net/ethernet/intel/ixgbevf/Makefile        |   3 +-
 drivers/net/ethernet/intel/ixgbevf/defines.h       |   6 +
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h       |  10 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  | 179 ++++++++++++++-
 drivers/net/ethernet/intel/ixgbevf/mbx.h           |   3 +
 .../net/ethernet/intel/ixgbevf/self-emulation.c    | 133 +++++++++++
 drivers/net/ethernet/intel/ixgbevf/vf.c            |  10 +
 drivers/net/ethernet/intel/ixgbevf/vf.h            |   6 +-
 drivers/pci/iov.c                                  |   1 +
 include/linux/pci.h                                |   1 +
 15 files changed, 582 insertions(+), 22 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c

-- 
1.8.4.rc0.1.g8f6a3e5.dirty

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device
  2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
@ 2015-10-21 16:37 ` Lan Tianyu
  2015-10-21 18:07   ` Alexander Duyck
  2015-10-21 16:37 ` [RFC Patch 02/12] IXGBE: Add new mail box event to restore VF status in the PF driver Lan Tianyu
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Lan Tianyu @ 2015-10-21 16:37 UTC (permalink / raw)
  To: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson
  Cc: Lan Tianyu

Add "virtfn_index" member in the struct pci_device to record VF sequence
of PF. This will be used in the VF sysfs node handle.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 drivers/pci/iov.c   | 1 +
 include/linux/pci.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ee0ebff..065b6bb 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -136,6 +136,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 	virtfn->physfn = pci_dev_get(dev);
 	virtfn->is_virtfn = 1;
 	virtfn->multifunction = 0;
+	virtfn->virtfn_index = id;
 
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
 		res = &dev->resource[i + PCI_IOV_RESOURCES];
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 353db8d..85c5531 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -356,6 +356,7 @@ struct pci_dev {
 	unsigned int	io_window_1k:1;	/* Intel P2P bridge 1K I/O windows */
 	unsigned int	irq_managed:1;
 	pci_dev_flags_t dev_flags;
+	unsigned int	virtfn_index;
 	atomic_t	enable_cnt;	/* pci_enable_device has been called */
 
 	u32		saved_config_space[16]; /* config space saved at suspend time */
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC Patch 02/12] IXGBE: Add new mail box event to restore VF status in the PF driver
  2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
  2015-10-21 16:37 ` [RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device Lan Tianyu
@ 2015-10-21 16:37 ` Lan Tianyu
  2015-10-21 20:34   ` Alexander Duyck
  2015-10-21 16:37 ` [RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate " Lan Tianyu
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Lan Tianyu @ 2015-10-21 16:37 UTC (permalink / raw)
  To: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson
  Cc: Lan Tianyu

This patch is to restore VF status in the PF driver when get event
from VF.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h       |  1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h   |  1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 40 ++++++++++++++++++++++++++
 3 files changed, 42 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 636f9e3..9d5669a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -148,6 +148,7 @@ struct vf_data_storage {
 	bool pf_set_mac;
 	u16 pf_vlan; /* When set, guest VLAN config not allowed. */
 	u16 pf_qos;
+	u32 vf_lpe;
 	u16 tx_rate;
 	u16 vlan_count;
 	u8 spoofchk_enabled;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
index b1e4703..8fdb38d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
@@ -91,6 +91,7 @@ enum ixgbe_pfvf_api_rev {
 
 /* mailbox API, version 1.1 VF requests */
 #define IXGBE_VF_GET_QUEUES	0x09 /* get queue configuration */
+#define IXGBE_VF_NOTIFY_RESUME    0x0c /* VF notify PF migration finishing */
 
 /* GET_QUEUES return data indices within the mailbox */
 #define IXGBE_VF_TX_QUEUES	1	/* number of Tx queues supported */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 1d17b58..ab2a2e2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -648,6 +648,42 @@ static inline void ixgbe_write_qde(struct ixgbe_adapter *adapter, u32 vf,
 	}
 }
 
+/**
+ *  Restore the settings by mailbox, after migration
+ **/
+void ixgbe_restore_setting(struct ixgbe_adapter *adapter, u32 vf)
+{
+	struct ixgbe_hw *hw = &adapter->hw;
+	u32 reg, reg_offset, vf_shift;
+	int rar_entry = hw->mac.num_rar_entries - (vf + 1);
+
+	vf_shift = vf % 32;
+	reg_offset = vf / 32;
+
+	/* enable transmit and receive for vf */
+	reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
+	reg |= (1 << vf_shift);
+	IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
+
+	reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
+	reg |= (1 << vf_shift);
+	IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
+
+	reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
+	reg |= (1 << vf_shift);
+	IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);
+
+	ixgbe_vf_reset_event(adapter, vf);
+
+	hw->mac.ops.set_rar(hw, rar_entry,
+			    adapter->vfinfo[vf].vf_mac_addresses,
+			    vf, IXGBE_RAH_AV);
+
+
+	if (adapter->vfinfo[vf].vf_lpe)
+		ixgbe_set_vf_lpe(adapter, &adapter->vfinfo[vf].vf_lpe, vf);
+}
+
 static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf)
 {
 	struct ixgbe_ring_feature *vmdq = &adapter->ring_feature[RING_F_VMDQ];
@@ -1047,6 +1083,7 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter *adapter, u32 vf)
 		break;
 	case IXGBE_VF_SET_LPE:
 		retval = ixgbe_set_vf_lpe(adapter, msgbuf, vf);
+		adapter->vfinfo[vf].vf_lpe = *msgbuf;
 		break;
 	case IXGBE_VF_SET_MACVLAN:
 		retval = ixgbe_set_vf_macvlan_msg(adapter, msgbuf, vf);
@@ -1063,6 +1100,9 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter *adapter, u32 vf)
 	case IXGBE_VF_GET_RSS_KEY:
 		retval = ixgbe_get_vf_rss_key(adapter, msgbuf, vf);
 		break;
+	case IXGBE_VF_NOTIFY_RESUME:
+		ixgbe_restore_setting(adapter, vf);
+		break;
 	default:
 		e_err(drv, "Unhandled Msg %8.8x\n", msgbuf[0]);
 		retval = IXGBE_ERR_MBX;
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver
  2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
  2015-10-21 16:37 ` [RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device Lan Tianyu
  2015-10-21 16:37 ` [RFC Patch 02/12] IXGBE: Add new mail box event to restore VF status in the PF driver Lan Tianyu
@ 2015-10-21 16:37 ` Lan Tianyu
  2015-10-21 20:45   ` Alexander Duyck
  2015-10-21 16:37 ` [RFC Patch 04/12] IXGBE: Add ixgbe_ping_vf() to notify a specified VF via mailbox msg Lan Tianyu
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Lan Tianyu @ 2015-10-21 16:37 UTC (permalink / raw)
  To: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson
  Cc: Lan Tianyu

This patch is to add sysfs interface state_in_pf under sysfs directory
of VF PCI device for Qemu to get and put VF status in the PF driver during
migration.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 156 ++++++++++++++++++++++++-
 1 file changed, 155 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index ab2a2e2..89671eb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -124,6 +124,157 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter *adapter)
 	return -ENOMEM;
 }
 
+#define IXGBE_PCI_VFCOMMAND   0x4
+#define IXGBE_PCI_VFMSIXMC    0x72
+#define IXGBE_SRIOV_VF_OFFSET 0x180
+#define IXGBE_SRIOV_VF_STRIDE 0x2
+
+#define to_adapter(dev) ((struct ixgbe_adapter *)(pci_get_drvdata(to_pci_dev(dev)->physfn)))
+
+struct state_in_pf {
+	u16 command;
+	u16 msix_message_control;
+	struct vf_data_storage vf_data;
+};
+
+static struct pci_dev *ixgbe_get_virtfn_dev(struct pci_dev *pdev, int vfn)
+{
+	u16 rid = pdev->devfn + IXGBE_SRIOV_VF_OFFSET + IXGBE_SRIOV_VF_STRIDE * vfn;
+	return pci_get_bus_and_slot(pdev->bus->number + (rid >> 8), rid & 0xff);
+}
+
+static ssize_t ixgbe_show_state_in_pf(struct device *dev,
+				      struct device_attribute *attr, char *buf)
+{
+	struct ixgbe_adapter *adapter = to_adapter(dev);
+	struct pci_dev *pdev = adapter->pdev, *vdev;
+	struct pci_dev *vf_pdev = to_pci_dev(dev);
+	struct ixgbe_hw *hw = &adapter->hw;
+	struct state_in_pf *state = (struct state_in_pf *)buf;
+	int vfn = vf_pdev->virtfn_index;
+	u32 reg, reg_offset, vf_shift;
+
+	/* Clear VF mac and disable VF */
+	ixgbe_del_mac_filter(adapter, adapter->vfinfo[vfn].vf_mac_addresses, vfn);
+
+	/* Record PCI configurations */
+	vdev = ixgbe_get_virtfn_dev(pdev, vfn);
+	if (vdev) {
+		pci_read_config_word(vdev, IXGBE_PCI_VFCOMMAND, &state->command);
+		pci_read_config_word(vdev, IXGBE_PCI_VFMSIXMC, &state->msix_message_control);
+	}
+	else
+		printk(KERN_WARNING "Unable to find VF device.\n");
+
+	/* Record states hold by PF */
+	memcpy(&state->vf_data, &adapter->vfinfo[vfn], sizeof(struct vf_data_storage));
+
+	vf_shift = vfn % 32;
+	reg_offset = vfn / 32;
+
+	reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
+	reg &= ~(1 << vf_shift);
+	IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
+
+	reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
+	reg &= ~(1 << vf_shift);
+	IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
+
+	reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
+	reg &= ~(1 << vf_shift);
+	IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);
+
+	return sizeof(struct state_in_pf);
+}
+
+static ssize_t ixgbe_store_state_in_pf(struct device *dev,
+				       struct device_attribute *attr,
+				       const char *buf, size_t count)
+{
+	struct ixgbe_adapter *adapter = to_adapter(dev);
+	struct pci_dev *pdev = adapter->pdev, *vdev;
+	struct pci_dev *vf_pdev = to_pci_dev(dev);
+	struct state_in_pf *state = (struct state_in_pf *)buf;
+	int vfn = vf_pdev->virtfn_index;
+
+	/* Check struct size */
+	if (count != sizeof(struct state_in_pf)) {
+		printk(KERN_ERR "State in PF size does not fit.\n");
+		goto out;
+	}
+
+	/* Restore PCI configurations */
+	vdev = ixgbe_get_virtfn_dev(pdev, vfn);
+	if (vdev) {
+		pci_write_config_word(vdev, IXGBE_PCI_VFCOMMAND, state->command);
+		pci_write_config_word(vdev, IXGBE_PCI_VFMSIXMC, state->msix_message_control);
+	}
+
+	/* Restore states hold by PF */
+	memcpy(&adapter->vfinfo[vfn], &state->vf_data, sizeof(struct vf_data_storage));
+
+  out:
+	return count;
+}
+
+static struct device_attribute ixgbe_per_state_in_pf_attribute =
+	__ATTR(state_in_pf, S_IRUGO | S_IWUSR,
+		ixgbe_show_state_in_pf, ixgbe_store_state_in_pf);
+
+void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
+{
+	struct pci_dev *pdev = adapter->pdev;
+	struct pci_dev *vfdev;
+	unsigned short vf_id;
+	int pos, ret;
+
+	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
+	if (!pos)
+		return;
+
+	/* get the device ID for the VF */
+	pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, &vf_id);
+
+	vfdev = pci_get_device(pdev->vendor, vf_id, NULL);
+
+	while (vfdev) {
+		if (vfdev->is_virtfn) {
+			ret = device_create_file(&vfdev->dev,
+					&ixgbe_per_state_in_pf_attribute);
+			if (ret)
+				pr_warn("Unable to add VF attribute for dev %s,\n",
+					dev_name(&vfdev->dev));
+		}
+
+		vfdev = pci_get_device(pdev->vendor, vf_id, vfdev);
+	}
+}
+
+void ixgbe_remove_vf_attrib(struct ixgbe_adapter *adapter)
+{
+	struct pci_dev *pdev = adapter->pdev;
+	struct pci_dev *vfdev;
+	unsigned short vf_id;
+	int pos;
+
+	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
+	if (!pos)
+		return;
+
+	/* get the device ID for the VF */
+	pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, &vf_id);
+
+	vfdev = pci_get_device(pdev->vendor, vf_id, NULL);
+
+	while (vfdev) {
+		if (vfdev->is_virtfn) {
+			device_remove_file(&vfdev->dev, &ixgbe_per_state_in_pf_attribute);
+		}
+
+		vfdev = pci_get_device(pdev->vendor, vf_id, vfdev);
+	}
+}
+
 /* Note this function is called when the user wants to enable SR-IOV
  * VFs using the now deprecated module parameter
  */
@@ -198,6 +349,9 @@ int ixgbe_disable_sriov(struct ixgbe_adapter *adapter)
 	if (!(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED))
 		return 0;
 
+
+	ixgbe_remove_vf_attrib(adapter);
+
 #ifdef CONFIG_PCI_IOV
 	/*
 	 * If our VFs are assigned we cannot shut down SR-IOV
@@ -284,7 +438,7 @@ static int ixgbe_pci_sriov_enable(struct pci_dev *dev, int num_vfs)
 		return err;
 	}
 	ixgbe_sriov_reinit(adapter);
-
+	ixgbe_add_vf_attrib(adapter);
 	return num_vfs;
 #else
 	return 0;
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC Patch 04/12] IXGBE: Add ixgbe_ping_vf() to notify a specified VF via mailbox msg.
  2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
                   ` (2 preceding siblings ...)
  2015-10-21 16:37 ` [RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate " Lan Tianyu
@ 2015-10-21 16:37 ` Lan Tianyu
  2015-10-21 16:37 ` [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf" Lan Tianyu
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Lan Tianyu @ 2015-10-21 16:37 UTC (permalink / raw)
  To: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson
  Cc: Lan Tianyu

This patch is to add ixgbe_ping_vf() to notify a specified VF. When
migration status is changed, it's necessary to notify VF the change.
VF driver will check the migrate status when it gets mailbox msg.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 19 ++++++++++++-------
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h |  1 +
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 89671eb..e247d67 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -1318,18 +1318,23 @@ void ixgbe_disable_tx_rx(struct ixgbe_adapter *adapter)
 	IXGBE_WRITE_REG(hw, IXGBE_VFRE(1), 0);
 }
 
-void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter)
+void ixgbe_ping_vf(struct ixgbe_adapter *adapter, int vfn)
 {
 	struct ixgbe_hw *hw = &adapter->hw;
 	u32 ping;
+
+	ping = IXGBE_PF_CONTROL_MSG;
+	if (adapter->vfinfo[vfn].clear_to_send)
+		ping |= IXGBE_VT_MSGTYPE_CTS;
+	ixgbe_write_mbx(hw, &ping, 1, vfn);
+}
+
+void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter)
+{
 	int i;
 
-	for (i = 0 ; i < adapter->num_vfs; i++) {
-		ping = IXGBE_PF_CONTROL_MSG;
-		if (adapter->vfinfo[i].clear_to_send)
-			ping |= IXGBE_VT_MSGTYPE_CTS;
-		ixgbe_write_mbx(hw, &ping, 1, i);
-	}
+	for (i = 0 ; i < adapter->num_vfs; i++)
+		ixgbe_ping_vf(adapter, i);
 }
 
 int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int vf, u8 *mac)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
index 2c197e6..143e2fd 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
@@ -41,6 +41,7 @@ void ixgbe_msg_task(struct ixgbe_adapter *adapter);
 int ixgbe_vf_configuration(struct pci_dev *pdev, unsigned int event_mask);
 void ixgbe_disable_tx_rx(struct ixgbe_adapter *adapter);
 void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter);
+void ixgbe_ping_vf(struct ixgbe_adapter *adapter, int vfn);
 int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int queue, u8 *mac);
 int ixgbe_ndo_set_vf_vlan(struct net_device *netdev, int queue, u16 vlan,
 			   u8 qos);
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"
  2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
                   ` (3 preceding siblings ...)
  2015-10-21 16:37 ` [RFC Patch 04/12] IXGBE: Add ixgbe_ping_vf() to notify a specified VF via mailbox msg Lan Tianyu
@ 2015-10-21 16:37 ` Lan Tianyu
  2015-10-21 20:52   ` Alexander Duyck
  2015-10-21 16:37 ` [RFC Patch 06/12] IXGBEVF: Add self emulation layer Lan Tianyu
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Lan Tianyu @ 2015-10-21 16:37 UTC (permalink / raw)
  To: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson
  Cc: Lan Tianyu

This patch is to add new sysfs interface of "notify_vf" under sysfs
directory of VF PCI device for Qemu to notify VF when migration status
is changed.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 30 ++++++++++++++++++++++++++
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h  |  4 ++++
 2 files changed, 34 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index e247d67..5cc7817 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -217,10 +217,37 @@ static ssize_t ixgbe_store_state_in_pf(struct device *dev,
 	return count;
 }
 
+static ssize_t ixgbe_store_notify_vf(struct device *dev,
+				       struct device_attribute *attr,
+				       const char *buf, size_t count)
+{
+	struct ixgbe_adapter *adapter = to_adapter(dev);
+	struct ixgbe_hw *hw = &adapter->hw;
+	struct pci_dev *vf_pdev = to_pci_dev(dev);
+	int vfn = vf_pdev->virtfn_index;
+	u32 ivar;
+
+	/* Enable VF mailbox irq first */
+	IXGBE_WRITE_REG(hw, IXGBE_PVTEIMS(vfn), 0x4);
+	IXGBE_WRITE_REG(hw, IXGBE_PVTEIAM(vfn), 0x4);
+	IXGBE_WRITE_REG(hw, IXGBE_PVTEIAC(vfn), 0x4);
+
+	ivar = IXGBE_READ_REG(hw, IXGBE_PVTIVAR_MISC(vfn));
+	ivar &= ~0xFF;
+	ivar |= 0x2 | IXGBE_IVAR_ALLOC_VAL;
+	IXGBE_WRITE_REG(hw, IXGBE_PVTIVAR_MISC(vfn), ivar);
+
+	ixgbe_ping_vf(adapter, vfn);
+	return count;
+}
+
 static struct device_attribute ixgbe_per_state_in_pf_attribute =
 	__ATTR(state_in_pf, S_IRUGO | S_IWUSR,
 		ixgbe_show_state_in_pf, ixgbe_store_state_in_pf);
 
+static struct device_attribute ixgbe_per_notify_vf_attribute =
+	__ATTR(notify_vf, S_IWUSR, NULL, ixgbe_store_notify_vf);
+
 void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
 {
 	struct pci_dev *pdev = adapter->pdev;
@@ -241,6 +268,8 @@ void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
 		if (vfdev->is_virtfn) {
 			ret = device_create_file(&vfdev->dev,
 					&ixgbe_per_state_in_pf_attribute);
+			ret |= device_create_file(&vfdev->dev,
+					&ixgbe_per_notify_vf_attribute);
 			if (ret)
 				pr_warn("Unable to add VF attribute for dev %s,\n",
 					dev_name(&vfdev->dev));
@@ -269,6 +298,7 @@ void ixgbe_remove_vf_attrib(struct ixgbe_adapter *adapter)
 	while (vfdev) {
 		if (vfdev->is_virtfn) {
 			device_remove_file(&vfdev->dev, &ixgbe_per_state_in_pf_attribute);
+			device_remove_file(&vfdev->dev, &ixgbe_per_notify_vf_attribute);
 		}
 
 		vfdev = pci_get_device(pdev->vendor, vf_id, vfdev);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
index dd6ba59..c6ddb66 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
@@ -2302,6 +2302,10 @@ enum {
 #define IXGBE_PVFTDT(P)		(0x06018 + (0x40 * (P)))
 #define IXGBE_PVFTDWBAL(P)	(0x06038 + (0x40 * (P)))
 #define IXGBE_PVFTDWBAH(P)	(0x0603C + (0x40 * (P)))
+#define IXGBE_PVTEIMS(P)	(0x00D00 + (4 * (P)))
+#define IXGBE_PVTIVAR_MISC(P)	(0x04E00 + (4 * (P)))
+#define IXGBE_PVTEIAC(P)       (0x00F00 + (4 * P))
+#define IXGBE_PVTEIAM(P)       (0x04D00 + (4 * P))
 
 #define IXGBE_PVFTDWBALn(q_per_pool, vf_number, vf_q_index) \
 		(IXGBE_PVFTDWBAL((q_per_pool)*(vf_number) + (vf_q_index)))
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC Patch 06/12] IXGBEVF: Add self emulation layer
  2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
                   ` (4 preceding siblings ...)
  2015-10-21 16:37 ` [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf" Lan Tianyu
@ 2015-10-21 16:37 ` Lan Tianyu
  2015-10-21 20:58   ` Alexander Duyck
  2015-10-21 16:37 ` [RFC Patch 07/12] IXGBEVF: Add new mail box event for migration Lan Tianyu
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Lan Tianyu @ 2015-10-21 16:37 UTC (permalink / raw)
  To: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson
  Cc: Lan Tianyu

In order to restore VF function after migration, add self emulation layer
to record regs' values during accessing regs.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/Makefile        |  3 ++-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |  2 +-
 .../net/ethernet/intel/ixgbevf/self-emulation.c    | 26 ++++++++++++++++++++++
 drivers/net/ethernet/intel/ixgbevf/vf.h            |  5 ++++-
 4 files changed, 33 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c

diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile b/drivers/net/ethernet/intel/ixgbevf/Makefile
index 4ce4c97..841c884 100644
--- a/drivers/net/ethernet/intel/ixgbevf/Makefile
+++ b/drivers/net/ethernet/intel/ixgbevf/Makefile
@@ -31,7 +31,8 @@
 
 obj-$(CONFIG_IXGBEVF) += ixgbevf.o
 
-ixgbevf-objs := vf.o \
+ixgbevf-objs := self-emulation.o \
+		vf.o \
                 mbx.o \
                 ethtool.o \
                 ixgbevf_main.o
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index a16d267..4446916 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg)
 
 	if (IXGBE_REMOVED(reg_addr))
 		return IXGBE_FAILED_READ_REG;
-	value = readl(reg_addr + reg);
+	value = ixgbe_self_emul_readl(reg_addr, reg);
 	if (unlikely(value == IXGBE_FAILED_READ_REG))
 		ixgbevf_check_remove(hw, reg);
 	return value;
diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
new file mode 100644
index 0000000..d74b2da
--- /dev/null
+++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
@@ -0,0 +1,26 @@
+#include <linux/netdevice.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+#include <linux/interrupt.h>
+#include <net/arp.h>
+
+#include "vf.h"
+#include "ixgbevf.h"
+
+static u32 hw_regs[0x4000];
+
+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr)
+{
+	u32 tmp;
+
+	tmp = readl(base + addr);
+	hw_regs[(unsigned long)addr] = tmp;
+
+	return tmp;
+}
+
+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr)
+{
+	hw_regs[(unsigned long)addr] = val;
+	writel(val, (volatile void __iomem *)(base + addr));
+}
diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h b/drivers/net/ethernet/intel/ixgbevf/vf.h
index d40f036..6a3f4eb 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.h
@@ -39,6 +39,9 @@
 
 struct ixgbe_hw;
 
+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr);
+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr);
+
 /* iterator type for walking multicast address lists */
 typedef u8* (*ixgbe_mc_addr_itr) (struct ixgbe_hw *hw, u8 **mc_addr_ptr,
 				  u32 *vmdq);
@@ -182,7 +185,7 @@ static inline void ixgbe_write_reg(struct ixgbe_hw *hw, u32 reg, u32 value)
 
 	if (IXGBE_REMOVED(reg_addr))
 		return;
-	writel(value, reg_addr + reg);
+	ixgbe_self_emul_writel(value, reg_addr, reg);
 }
 
 #define IXGBE_WRITE_REG(h, r, v) ixgbe_write_reg(h, r, v)
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC Patch 07/12] IXGBEVF: Add new mail box event for migration
  2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
                   ` (5 preceding siblings ...)
  2015-10-21 16:37 ` [RFC Patch 06/12] IXGBEVF: Add self emulation layer Lan Tianyu
@ 2015-10-21 16:37 ` Lan Tianyu
  2015-10-21 16:37 ` [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package Lan Tianyu
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Lan Tianyu @ 2015-10-21 16:37 UTC (permalink / raw)
  To: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson
  Cc: Lan Tianyu

VF status in the PF driver needs to be restored after migration and reset
VF hardware. This patch is to add a new event for VF driver to notify PF
driver to restore status.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/mbx.h |  3 +++
 drivers/net/ethernet/intel/ixgbevf/vf.c  | 10 ++++++++++
 drivers/net/ethernet/intel/ixgbevf/vf.h  |  1 +
 3 files changed, 14 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/mbx.h b/drivers/net/ethernet/intel/ixgbevf/mbx.h
index 82f44e0..22761d8 100644
--- a/drivers/net/ethernet/intel/ixgbevf/mbx.h
+++ b/drivers/net/ethernet/intel/ixgbevf/mbx.h
@@ -112,6 +112,9 @@ enum ixgbe_pfvf_api_rev {
 #define IXGBE_VF_GET_RETA	0x0a	/* VF request for RETA */
 #define IXGBE_VF_GET_RSS_KEY	0x0b	/* get RSS hash key */
 
+/* mail box event for live migration  */
+#define IXGBE_VF_NOTIFY_RESUME  0x0c /* VF notify PF migration to restore status */
+
 /* length of permanent address message returned from PF */
 #define IXGBE_VF_PERMADDR_MSG_LEN	4
 /* word in permanent address message with the current multicast type */
diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.c b/drivers/net/ethernet/intel/ixgbevf/vf.c
index d1339b0..1e4e5e6 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.c
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.c
@@ -717,6 +717,15 @@ int ixgbevf_get_queues(struct ixgbe_hw *hw, unsigned int *num_tcs,
 	return err;
 }
 
+static void ixgbevf_notify_resume_vf(struct ixgbe_hw *hw)
+{
+	struct ixgbe_mbx_info *mbx = &hw->mbx;
+	u32 msgbuf[1];
+
+	msgbuf[0] = IXGBE_VF_NOTIFY_RESUME;
+	mbx->ops.write_posted(hw, msgbuf, 1);
+}
+
 static const struct ixgbe_mac_operations ixgbevf_mac_ops = {
 	.init_hw		= ixgbevf_init_hw_vf,
 	.reset_hw		= ixgbevf_reset_hw_vf,
@@ -729,6 +738,7 @@ static const struct ixgbe_mac_operations ixgbevf_mac_ops = {
 	.update_mc_addr_list	= ixgbevf_update_mc_addr_list_vf,
 	.set_uc_addr		= ixgbevf_set_uc_addr_vf,
 	.set_vfta		= ixgbevf_set_vfta_vf,
+	.notify_resume		= ixgbevf_notify_resume_vf,
 };
 
 const struct ixgbevf_info ixgbevf_82599_vf_info = {
diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h b/drivers/net/ethernet/intel/ixgbevf/vf.h
index 6a3f4eb..a25fe81 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.h
@@ -70,6 +70,7 @@ struct ixgbe_mac_operations {
 	s32 (*disable_mc)(struct ixgbe_hw *);
 	s32 (*clear_vfta)(struct ixgbe_hw *);
 	s32 (*set_vfta)(struct ixgbe_hw *, u32, u32, bool);
+	void (*notify_resume)(struct ixgbe_hw *); 
 };
 
 enum ixgbe_mac_type {
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package
  2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
                   ` (6 preceding siblings ...)
  2015-10-21 16:37 ` [RFC Patch 07/12] IXGBEVF: Add new mail box event for migration Lan Tianyu
@ 2015-10-21 16:37 ` Lan Tianyu
  2015-10-21 21:14   ` Alexander Duyck
  2015-10-22 12:58   ` Michael S. Tsirkin
  2015-10-21 16:37 ` [RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver Lan Tianyu
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 56+ messages in thread
From: Lan Tianyu @ 2015-10-21 16:37 UTC (permalink / raw)
  To: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson
  Cc: Lan Tianyu

When transmit a package, the end transmit desc of package
indicates whether package is sent already. Current code records
the end desc's pointer in the next_to_watch of struct tx buffer.
This code will be broken if shifting desc ring after migration.
The pointer will be invalid. This patch is to replace recording
pointer with recording the desc number of the package and find
the end decs via the first desc and desc number.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      |  1 +
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 19 ++++++++++++++++---
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 775d089..c823616 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -54,6 +54,7 @@
  */
 struct ixgbevf_tx_buffer {
 	union ixgbe_adv_tx_desc *next_to_watch;
+	u16 desc_num;
 	unsigned long time_stamp;
 	struct sk_buff *skb;
 	unsigned int bytecount;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 4446916..056841c 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -210,6 +210,7 @@ static void ixgbevf_unmap_and_free_tx_resource(struct ixgbevf_ring *tx_ring,
 			       DMA_TO_DEVICE);
 	}
 	tx_buffer->next_to_watch = NULL;
+	tx_buffer->desc_num = 0;
 	tx_buffer->skb = NULL;
 	dma_unmap_len_set(tx_buffer, len, 0);
 	/* tx_buffer must be completely set up in the transmit path */
@@ -295,7 +296,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
 	union ixgbe_adv_tx_desc *tx_desc;
 	unsigned int total_bytes = 0, total_packets = 0;
 	unsigned int budget = tx_ring->count / 2;
-	unsigned int i = tx_ring->next_to_clean;
+	int i, watch_index;
 
 	if (test_bit(__IXGBEVF_DOWN, &adapter->state))
 		return true;
@@ -305,9 +306,17 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
 	i -= tx_ring->count;
 
 	do {
-		union ixgbe_adv_tx_desc *eop_desc = tx_buffer->next_to_watch;
+		union ixgbe_adv_tx_desc *eop_desc;
+
+		if (!tx_buffer->desc_num)
+			break;
+
+		if (i + tx_buffer->desc_num >= 0)
+			watch_index = i + tx_buffer->desc_num;
+		else
+			watch_index = i + tx_ring->count + tx_buffer->desc_num;
 
-		/* if next_to_watch is not set then there is no work pending */
+		eop_desc = IXGBEVF_TX_DESC(tx_ring, watch_index);
 		if (!eop_desc)
 			break;
 
@@ -320,6 +329,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
 
 		/* clear next_to_watch to prevent false hangs */
 		tx_buffer->next_to_watch = NULL;
+		tx_buffer->desc_num = 0;
 
 		/* update the statistics for this packet */
 		total_bytes += tx_buffer->bytecount;
@@ -3457,6 +3467,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
 	u32 tx_flags = first->tx_flags;
 	__le32 cmd_type;
 	u16 i = tx_ring->next_to_use;
+	u16 start;
 
 	tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
 
@@ -3540,6 +3551,8 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
 
 	/* set next_to_watch value indicating a packet is present */
 	first->next_to_watch = tx_desc;
+	start = first - tx_ring->tx_buffer_info;
+	first->desc_num = (i - start >= 0) ? i - start: i + tx_ring->count - start;
 
 	i++;
 	if (i == tx_ring->count)
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver
  2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
                   ` (7 preceding siblings ...)
  2015-10-21 16:37 ` [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package Lan Tianyu
@ 2015-10-21 16:37 ` Lan Tianyu
  2015-10-21 21:48   ` Alexander Duyck
  2015-10-22 12:46   ` Michael S. Tsirkin
  2015-10-21 16:37 ` [RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation Lan Tianyu
                   ` (5 subsequent siblings)
  14 siblings, 2 replies; 56+ messages in thread
From: Lan Tianyu @ 2015-10-21 16:37 UTC (permalink / raw)
  To: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson
  Cc: Lan Tianyu

To let VF driver in the guest to know migration status, Qemu will
fake PCI configure reg 0xF0 and 0xF1 to show migrate status and
get ack from VF driver.

When migration starts, Qemu will set reg "0xF0" to 1, notify
VF driver via triggering mail box msg and wait for VF driver to tell
it's ready for migration(set reg "0xF1" to 1). After migration, Qemu
will set reg "0xF0" to 0 and notify VF driver by mail box irq. VF
driver begins to restore tx/rx function after detecting sttatus change.

When VF receives mail box irq, it will check reg "0xF0" in the service
task function to get migration status and performs related operations
according its value.

Steps of restarting receive and transmit function
1) Restore VF status in the PF driver via sending mail event to PF driver
2) Write back reg values recorded by self emulation layer
3) Restart rx/tx ring
4) Recovery interrupt

Transmit/Receive descriptor head regs are read-only and can't
be restored via writing back recording reg value directly and they
are set to 0 during VF reset. To reuse original tx/rx rings, shift
desc ring in order to move the desc pointed by original head reg to
first entry of the ring and then enable tx/rx rings. VF restarts to
receive and transmit from original head desc.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/defines.h       |   6 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h       |   7 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  | 115 ++++++++++++++++++++-
 .../net/ethernet/intel/ixgbevf/self-emulation.c    | 107 +++++++++++++++++++
 4 files changed, 232 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h b/drivers/net/ethernet/intel/ixgbevf/defines.h
index 770e21a..113efd2 100644
--- a/drivers/net/ethernet/intel/ixgbevf/defines.h
+++ b/drivers/net/ethernet/intel/ixgbevf/defines.h
@@ -239,6 +239,12 @@ struct ixgbe_adv_tx_context_desc {
 	__le32 mss_l4len_idx;
 };
 
+union ixgbevf_desc {
+	union ixgbe_adv_tx_desc rx_desc;
+	union ixgbe_adv_rx_desc tx_desc;
+	struct ixgbe_adv_tx_context_desc tx_context_desc;
+};
+
 /* Adv Transmit Descriptor Config Masks */
 #define IXGBE_ADVTXD_DTYP_MASK	0x00F00000 /* DTYP mask */
 #define IXGBE_ADVTXD_DTYP_CTXT	0x00200000 /* Advanced Context Desc */
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index c823616..6eab402e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -109,7 +109,7 @@ struct ixgbevf_ring {
 	struct ixgbevf_ring *next;
 	struct net_device *netdev;
 	struct device *dev;
-	void *desc;			/* descriptor ring memory */
+	union ixgbevf_desc *desc;	/* descriptor ring memory */
 	dma_addr_t dma;			/* phys. address of descriptor ring */
 	unsigned int size;		/* length in bytes */
 	u16 count;			/* amount of descriptors */
@@ -493,6 +493,11 @@ extern void ixgbevf_write_eitr(struct ixgbevf_q_vector *q_vector);
 
 void ixgbe_napi_add_all(struct ixgbevf_adapter *adapter);
 void ixgbe_napi_del_all(struct ixgbevf_adapter *adapter);
+int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head);
+int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head);
+void ixgbevf_restore_state(struct ixgbevf_adapter *adapter);
+inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter);
+
 
 #ifdef DEBUG
 char *ixgbevf_get_hw_dev_name(struct ixgbe_hw *hw);
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 056841c..15ec361 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -91,6 +91,10 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit Virtual Function Network Driver");
 MODULE_LICENSE("GPL");
 MODULE_VERSION(DRV_VERSION);
 
+
+#define MIGRATION_COMPLETED   0x00
+#define MIGRATION_IN_PROGRESS 0x01
+
 #define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK)
 static int debug = -1;
 module_param(debug, int, 0);
@@ -221,6 +225,78 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring *ring)
 	return ring->stats.packets;
 }
 
+int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
+{
+	struct ixgbevf_tx_buffer *tx_buffer = NULL;
+	static union ixgbevf_desc *tx_desc = NULL;
+
+	tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
+	if (!tx_buffer)
+		return -ENOMEM;
+
+	tx_desc = vmalloc(sizeof(union ixgbevf_desc) * r->count);
+	if (!tx_desc)
+		return -ENOMEM;
+
+	memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
+	memcpy(r->desc, &tx_desc[head], sizeof(union ixgbevf_desc) * (r->count - head));
+	memcpy(&r->desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * head);
+
+	memcpy(tx_buffer, r->tx_buffer_info, sizeof(struct ixgbevf_tx_buffer) * r->count);
+	memcpy(r->tx_buffer_info, &tx_buffer[head], sizeof(struct ixgbevf_tx_buffer) * (r->count - head));
+	memcpy(&r->tx_buffer_info[r->count - head], tx_buffer, sizeof(struct ixgbevf_tx_buffer) * head);
+
+	if (r->next_to_clean >= head)
+		r->next_to_clean -= head;
+	else
+		r->next_to_clean += (r->count - head);
+
+	if (r->next_to_use >= head)
+		r->next_to_use -= head;
+	else
+		r->next_to_use += (r->count - head);
+
+	vfree(tx_buffer);
+	vfree(tx_desc);
+	return 0;
+}
+
+int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
+{
+	struct ixgbevf_rx_buffer *rx_buffer = NULL;
+	static union ixgbevf_desc *rx_desc = NULL;
+
+	rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count));
+	if (!rx_buffer)
+		return -ENOMEM;
+
+	rx_desc = vmalloc(sizeof(union ixgbevf_desc) * (r->count));
+	if (!rx_desc)
+		return -ENOMEM;
+
+	memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count));
+	memcpy(r->desc, &rx_desc[head], sizeof(union ixgbevf_desc) * (r->count - head));
+	memcpy(&r->desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * head);
+
+	memcpy(rx_buffer, r->rx_buffer_info, sizeof(struct ixgbevf_rx_buffer) * (r->count));
+	memcpy(r->rx_buffer_info, &rx_buffer[head], sizeof(struct ixgbevf_rx_buffer) * (r->count - head));
+	memcpy(&r->rx_buffer_info[r->count - head], rx_buffer, sizeof(struct ixgbevf_rx_buffer) * head);
+
+	if (r->next_to_clean >= head)
+		r->next_to_clean -= head;
+	else
+		r->next_to_clean += (r->count - head);
+
+	if (r->next_to_use >= head)
+		r->next_to_use -= head;
+	else
+		r->next_to_use += (r->count - head);
+
+	vfree(rx_buffer);
+	vfree(rx_desc);
+	return 0;
+}
+
 static u32 ixgbevf_get_tx_pending(struct ixgbevf_ring *ring)
 {
 	struct ixgbevf_adapter *adapter = netdev_priv(ring->netdev);
@@ -1122,7 +1198,7 @@ static int ixgbevf_busy_poll_recv(struct napi_struct *napi)
  * ixgbevf_configure_msix sets up the hardware to properly generate MSI-X
  * interrupts.
  **/
-static void ixgbevf_configure_msix(struct ixgbevf_adapter *adapter)
+static  void ixgbevf_configure_msix(struct ixgbevf_adapter *adapter)
 {
 	struct ixgbevf_q_vector *q_vector;
 	int q_vectors, v_idx;
@@ -1534,7 +1610,7 @@ static inline void ixgbevf_irq_disable(struct ixgbevf_adapter *adapter)
  * ixgbevf_irq_enable - Enable default interrupt generation settings
  * @adapter: board private structure
  **/
-static inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter)
+inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter)
 {
 	struct ixgbe_hw *hw = &adapter->hw;
 
@@ -2901,6 +2977,36 @@ static void ixgbevf_watchdog_subtask(struct ixgbevf_adapter *adapter)
 	ixgbevf_update_stats(adapter);
 }
 
+int ixgbevf_live_mg(struct ixgbevf_adapter *adapter)
+{
+	struct pci_dev *pdev = adapter->pdev;
+ 	static int migration_status = MIGRATION_COMPLETED;
+	u8 val;
+
+	if (migration_status == MIGRATION_COMPLETED) {
+		pci_read_config_byte(pdev, 0xf0, &val);
+		if (!val)
+			return 0;
+
+		del_timer_sync(&adapter->service_timer);
+		pr_info("migration start\n");
+		migration_status = MIGRATION_IN_PROGRESS; 
+
+		/* Tell Qemu VF is ready for migration. */
+		pci_write_config_byte(pdev, 0xf1, 0x1);
+		return 1;
+	} else {
+		pci_read_config_byte(pdev, 0xf0, &val);
+		if (val)
+			return 1;
+
+		ixgbevf_restore_state(adapter);
+		migration_status = MIGRATION_COMPLETED;
+		pr_info("migration end\n");
+		return 0;
+	}
+}
+
 /**
  * ixgbevf_service_task - manages and runs subtasks
  * @work: pointer to work_struct containing our data
@@ -2912,6 +3018,11 @@ static void ixgbevf_service_task(struct work_struct *work)
 						       service_task);
 	struct ixgbe_hw *hw = &adapter->hw;
 
+	if (ixgbevf_live_mg(adapter)) {
+		ixgbevf_service_event_complete(adapter);
+		return;
+	}
+
 	if (IXGBE_REMOVED(hw->hw_addr)) {
 		if (!test_bit(__IXGBEVF_DOWN, &adapter->state)) {
 			rtnl_lock();
diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
index d74b2da..4476428 100644
--- a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
+++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
@@ -9,6 +9,8 @@
 
 static u32 hw_regs[0x4000];
 
+#define RESTORE_REG(hw, reg) IXGBE_WRITE_REG(hw, reg, hw_regs[reg])
+
 u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr)
 {
 	u32 tmp;
@@ -24,3 +26,108 @@ void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr)
 	hw_regs[(unsigned long)addr] = val;
 	writel(val, (volatile void __iomem *)(base + addr));
 }
+
+static u32 restore_regs[] = {
+	IXGBE_VTIVAR(0),
+	IXGBE_VTIVAR(1),
+	IXGBE_VTIVAR(2),
+	IXGBE_VTIVAR(3),
+	IXGBE_VTIVAR_MISC,
+	IXGBE_VTEITR(0),
+	IXGBE_VTEITR(1),
+	IXGBE_VFPSRTYPE,
+};
+
+void ixgbevf_restore_state(struct ixgbevf_adapter *adapter)
+{
+	struct ixgbe_hw *hw = &adapter->hw;
+	struct ixgbe_mbx_info *mbx = &hw->mbx;
+	int i;
+	u32 timeout = IXGBE_VF_INIT_TIMEOUT, rdh, tdh, 	rxdctl, txdctl;
+	u32 wait_loop = 10;
+
+	/* VF resetting */
+	IXGBE_WRITE_REG(hw, IXGBE_VFCTRL, IXGBE_CTRL_RST);
+	IXGBE_WRITE_FLUSH(hw);
+
+	while (!mbx->ops.check_for_rst(hw) && timeout) {
+		timeout--;
+		udelay(5);
+	}
+	if (!timeout)
+		printk(KERN_ERR "[IXGBEVF] Unable to reset VF.\n");
+
+	/* Restoring VF status in the status */
+	hw->mac.ops.notify_resume(hw);
+
+	/* Restoring regs value */
+	for (i = 0; i < sizeof(restore_regs)/sizeof(u32); i++)
+		writel(hw_regs[restore_regs[i]], (volatile void *)(restore_regs[i] + hw->hw_addr));
+
+	/* Restoring rx ring */
+	for (i = 0; i < adapter->num_rx_queues; i++) {
+		if (hw_regs[IXGBE_VFRXDCTL(i)] & IXGBE_RXDCTL_ENABLE) {
+			RESTORE_REG(hw, IXGBE_VFRDBAL(i));
+			RESTORE_REG(hw, IXGBE_VFRDBAH(i));
+			RESTORE_REG(hw, IXGBE_VFRDLEN(i));
+			RESTORE_REG(hw, IXGBE_VFDCA_RXCTRL(i));
+			RESTORE_REG(hw, IXGBE_VFSRRCTL(i));
+
+			rdh = adapter->rx_ring[i]->next_to_clean;
+			while (IXGBEVF_RX_DESC(adapter->rx_ring[i], rdh)->wb.upper.status_error
+			       & cpu_to_le32(IXGBE_RXD_STAT_DD))
+				rdh = (rdh + 1) % adapter->rx_ring[i]->count;
+
+			ixgbevf_rx_ring_shift(adapter->rx_ring[i], rdh);
+
+			wait_loop = 10;
+			RESTORE_REG(hw, IXGBE_VFRXDCTL(i));
+			do {
+				udelay(10);
+				rxdctl = IXGBE_READ_REG(hw, IXGBE_VFRXDCTL(i));
+			} while (--wait_loop && !(rxdctl & IXGBE_RXDCTL_ENABLE));
+
+			if (!wait_loop)
+				pr_err("RXDCTL.ENABLE queue %d not cleared while polling\n",
+				       i);
+
+			IXGBE_WRITE_REG(hw, IXGBE_VFRDT(i), adapter->rx_ring[i]->next_to_use);
+		}
+	}
+
+	/* Restoring tx ring */
+	for (i = 0; i < adapter->num_tx_queues; i++) {
+		if (hw_regs[IXGBE_VFTXDCTL(i)] & IXGBE_TXDCTL_ENABLE) {
+			RESTORE_REG(hw, IXGBE_VFTDBAL(i));
+			RESTORE_REG(hw, IXGBE_VFTDBAH(i));
+			RESTORE_REG(hw, IXGBE_VFTDLEN(i));
+			RESTORE_REG(hw, IXGBE_VFDCA_TXCTRL(i));
+
+			tdh = adapter->tx_ring[i]->next_to_clean;
+			while (IXGBEVF_TX_DESC(adapter->tx_ring[i], tdh)->wb.status
+			       & cpu_to_le32(IXGBE_TXD_STAT_DD))
+				tdh = (tdh + 1) % adapter->rx_ring[i]->count;
+			ixgbevf_tx_ring_shift(adapter->tx_ring[i], tdh);
+
+			wait_loop = 10;
+			RESTORE_REG(hw, IXGBE_VFTXDCTL(i));
+			do {
+				udelay(2000);
+				txdctl = IXGBE_READ_REG(hw, IXGBE_VFTXDCTL(i));
+			} while (--wait_loop && !(txdctl & IXGBE_TXDCTL_ENABLE));
+
+			if (!wait_loop)
+				pr_err("Could not enable Tx Queue %d\n", i);
+	
+			IXGBE_WRITE_REG(hw, IXGBE_VFTDT(i), adapter->tx_ring[i]->next_to_use);
+		}
+	}
+
+	/* Restore irq */
+	IXGBE_WRITE_REG(hw, IXGBE_VTEIMS, hw_regs[IXGBE_VTEIMS] & 0x7);
+	IXGBE_WRITE_REG(hw, IXGBE_VTEIMC, (~hw_regs[IXGBE_VTEIMS]) & 0x7);
+	IXGBE_WRITE_REG(hw, IXGBE_VTEICS, hw_regs[IXGBE_VTEICS]);
+
+	ixgbevf_irq_enable(adapter);
+}
+
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation
  2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
                   ` (8 preceding siblings ...)
  2015-10-21 16:37 ` [RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver Lan Tianyu
@ 2015-10-21 16:37 ` Lan Tianyu
  2015-10-21 21:55   ` Alexander Duyck
  2015-10-22 12:40   ` Michael S. Tsirkin
  2015-10-21 16:37 ` [RFC Patch 11/12] IXGBEVF: Migrate VF statistic data Lan Tianyu
                   ` (4 subsequent siblings)
  14 siblings, 2 replies; 56+ messages in thread
From: Lan Tianyu @ 2015-10-21 16:37 UTC (permalink / raw)
  To: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson
  Cc: Lan Tianyu

Ring shifting during restoring VF function maybe race with original
ring operation(transmit/receive package). This patch is to add tx/rx
lock to protect ring related data.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      |  2 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 28 ++++++++++++++++++++---
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 6eab402e..3a748c8 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -448,6 +448,8 @@ struct ixgbevf_adapter {
 
 	spinlock_t mbx_lock;
 	unsigned long last_reset;
+	spinlock_t mg_rx_lock;
+	spinlock_t mg_tx_lock;
 };
 
 enum ixbgevf_state_t {
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 15ec361..04b6ce7 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -227,8 +227,10 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring *ring)
 
 int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
 {
+	struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
 	struct ixgbevf_tx_buffer *tx_buffer = NULL;
 	static union ixgbevf_desc *tx_desc = NULL;
+	unsigned long flags;
 
 	tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
 	if (!tx_buffer)
@@ -238,6 +240,7 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
 	if (!tx_desc)
 		return -ENOMEM;
 
+	spin_lock_irqsave(&adapter->mg_tx_lock, flags);
 	memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
 	memcpy(r->desc, &tx_desc[head], sizeof(union ixgbevf_desc) * (r->count - head));
 	memcpy(&r->desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * head);
@@ -256,6 +259,8 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
 	else
 		r->next_to_use += (r->count - head);
 
+	spin_unlock_irqrestore(&adapter->mg_tx_lock, flags);
+
 	vfree(tx_buffer);
 	vfree(tx_desc);
 	return 0;
@@ -263,8 +268,10 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
 
 int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
 {
+	struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
 	struct ixgbevf_rx_buffer *rx_buffer = NULL;
 	static union ixgbevf_desc *rx_desc = NULL;
+	unsigned long flags;	
 
 	rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count));
 	if (!rx_buffer)
@@ -274,6 +281,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
 	if (!rx_desc)
 		return -ENOMEM;
 
+	spin_lock_irqsave(&adapter->mg_rx_lock, flags);
 	memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count));
 	memcpy(r->desc, &rx_desc[head], sizeof(union ixgbevf_desc) * (r->count - head));
 	memcpy(&r->desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * head);
@@ -291,6 +299,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
 		r->next_to_use -= head;
 	else
 		r->next_to_use += (r->count - head);
+	spin_unlock_irqrestore(&adapter->mg_rx_lock, flags);
 
 	vfree(rx_buffer);
 	vfree(rx_desc);
@@ -377,6 +386,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
 	if (test_bit(__IXGBEVF_DOWN, &adapter->state))
 		return true;
 
+	spin_lock(&adapter->mg_tx_lock);
+	i = tx_ring->next_to_clean;
 	tx_buffer = &tx_ring->tx_buffer_info[i];
 	tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
 	i -= tx_ring->count;
@@ -471,6 +482,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
 	q_vector->tx.total_bytes += total_bytes;
 	q_vector->tx.total_packets += total_packets;
 
+	spin_unlock(&adapter->mg_tx_lock);
+
 	if (check_for_tx_hang(tx_ring) && ixgbevf_check_tx_hang(tx_ring)) {
 		struct ixgbe_hw *hw = &adapter->hw;
 		union ixgbe_adv_tx_desc *eop_desc;
@@ -999,10 +1012,12 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
 				struct ixgbevf_ring *rx_ring,
 				int budget)
 {
+	struct ixgbevf_adapter *adapter = netdev_priv(rx_ring->netdev);
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 	u16 cleaned_count = ixgbevf_desc_unused(rx_ring);
 	struct sk_buff *skb = rx_ring->skb;
 
+	spin_lock(&adapter->mg_rx_lock);
 	while (likely(total_rx_packets < budget)) {
 		union ixgbe_adv_rx_desc *rx_desc;
 
@@ -1078,6 +1093,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
 	q_vector->rx.total_packets += total_rx_packets;
 	q_vector->rx.total_bytes += total_rx_bytes;
 
+	spin_unlock(&adapter->mg_rx_lock);
 	return total_rx_packets;
 }
 
@@ -3572,14 +3588,17 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
 	struct ixgbevf_tx_buffer *tx_buffer;
 	union ixgbe_adv_tx_desc *tx_desc;
 	struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[0];
+	struct ixgbevf_adapter *adapter = netdev_priv(tx_ring->netdev);
 	unsigned int data_len = skb->data_len;
 	unsigned int size = skb_headlen(skb);
 	unsigned int paylen = skb->len - hdr_len;
+	unsigned long flags;
 	u32 tx_flags = first->tx_flags;
 	__le32 cmd_type;
-	u16 i = tx_ring->next_to_use;
-	u16 start;
+	u16 i, start;     
 
+	spin_lock_irqsave(&adapter->mg_tx_lock, flags);
+	i = tx_ring->next_to_use;
 	tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
 
 	ixgbevf_tx_olinfo_status(tx_desc, tx_flags, paylen);
@@ -3673,7 +3692,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
 
 	/* notify HW of packet */
 	ixgbevf_write_tail(tx_ring, i);
-
+	spin_unlock_irqrestore(&adapter->mg_tx_lock, flags);
 	return;
 dma_error:
 	dev_err(tx_ring->dev, "TX DMA map failed\n");
@@ -3690,6 +3709,7 @@ dma_error:
 	}
 
 	tx_ring->next_to_use = i;
+	spin_unlock_irqrestore(&adapter->mg_tx_lock, flags);
 }
 
 static int __ixgbevf_maybe_stop_tx(struct ixgbevf_ring *tx_ring, int size)
@@ -4188,6 +4208,8 @@ static int ixgbevf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		break;
 	}
 
+	spin_lock_init(&adapter->mg_tx_lock);
+	spin_lock_init(&adapter->mg_rx_lock);
 	return 0;
 
 err_register:
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC Patch 11/12] IXGBEVF: Migrate VF statistic data
  2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
                   ` (9 preceding siblings ...)
  2015-10-21 16:37 ` [RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation Lan Tianyu
@ 2015-10-21 16:37 ` Lan Tianyu
  2015-10-22 12:36   ` Michael S. Tsirkin
  2015-10-21 16:37 ` [RFC Patch 12/12] IXGBEVF: Track dma dirty pages Lan Tianyu
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Lan Tianyu @ 2015-10-21 16:37 UTC (permalink / raw)
  To: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson
  Cc: Lan Tianyu

VF statistic regs are read-only and can't be migrated via writing back
directly.

Currently, statistic data returned to user space by the driver is not equal
to value of statistic regs. VF driver records value of statistic regs as base data
when net interface is up or open, calculate increased count of regs during
last period of online service and added it to saved_reset data. When user
space collects statistic data, VF driver returns result of
"current - base + saved_reset". "Current" is reg value at that point.

Restoring net function after migration just likes net interface is up or open.
Call existed function to update base and saved_reset data to keep statistic
data continual during migration.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 04b6ce7..d22160f 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -3005,6 +3005,7 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter)
 			return 0;
 
 		del_timer_sync(&adapter->service_timer);
+		ixgbevf_update_stats(adapter);
 		pr_info("migration start\n");
 		migration_status = MIGRATION_IN_PROGRESS; 
 
@@ -3017,6 +3018,8 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter)
 			return 1;
 
 		ixgbevf_restore_state(adapter);
+		ixgbevf_save_reset_stats(adapter);
+		ixgbevf_init_last_counter_stats(adapter);
 		migration_status = MIGRATION_COMPLETED;
 		pr_info("migration end\n");
 		return 0;
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC Patch 12/12] IXGBEVF: Track dma dirty pages
  2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
                   ` (10 preceding siblings ...)
  2015-10-21 16:37 ` [RFC Patch 11/12] IXGBEVF: Migrate VF statistic data Lan Tianyu
@ 2015-10-21 16:37 ` Lan Tianyu
  2015-10-22 12:30   ` Michael S. Tsirkin
  2015-10-21 18:45 ` [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Or Gerlitz
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Lan Tianyu @ 2015-10-21 16:37 UTC (permalink / raw)
  To: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson
  Cc: Lan Tianyu

Migration relies on tracking dirty page to migrate memory.
Hardware can't automatically mark a page as dirty after DMA
memory access. VF descriptor rings and data buffers are modified
by hardware when receive and transmit data. To track such dirty memory
manually, do dummy writes(read a byte and write it back) during receive
and transmit data.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index d22160f..ce7bd7a 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -414,6 +414,9 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
 		if (!(eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
 			break;
 
+		/* write back status to mark page dirty */
+		eop_desc->wb.status = eop_desc->wb.status;
+
 		/* clear next_to_watch to prevent false hangs */
 		tx_buffer->next_to_watch = NULL;
 		tx_buffer->desc_num = 0;
@@ -946,15 +949,17 @@ static struct sk_buff *ixgbevf_fetch_rx_buffer(struct ixgbevf_ring *rx_ring,
 {
 	struct ixgbevf_rx_buffer *rx_buffer;
 	struct page *page;
+	u8 *page_addr;
 
 	rx_buffer = &rx_ring->rx_buffer_info[rx_ring->next_to_clean];
 	page = rx_buffer->page;
 	prefetchw(page);
 
-	if (likely(!skb)) {
-		void *page_addr = page_address(page) +
-				  rx_buffer->page_offset;
+	/* Mark page dirty */
+	page_addr = page_address(page) + rx_buffer->page_offset;
+	*page_addr = *page_addr;
 
+	if (likely(!skb)) {
 		/* prefetch first cache line of first page */
 		prefetch(page_addr);
 #if L1_CACHE_BYTES < 128
@@ -1032,6 +1037,9 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
 		if (!ixgbevf_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
 			break;
 
+		/* Write back status to mark page dirty */
+		rx_desc->wb.upper.status_error = rx_desc->wb.upper.status_error;
+
 		/* This memory barrier is needed to keep us from reading
 		 * any other fields out of the rx_desc until we know the
 		 * RXD_STAT_DD bit is set
-- 
1.8.4.rc0.1.g8f6a3e5.dirty


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device
  2015-10-21 16:37 ` [RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device Lan Tianyu
@ 2015-10-21 18:07   ` Alexander Duyck
  2015-10-24 14:46     ` Lan, Tianyu
  0 siblings, 1 reply; 56+ messages in thread
From: Alexander Duyck @ 2015-10-21 18:07 UTC (permalink / raw)
  To: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/21/2015 09:37 AM, Lan Tianyu wrote:
> Add "virtfn_index" member in the struct pci_device to record VF sequence
> of PF. This will be used in the VF sysfs node handle.
>
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>   drivers/pci/iov.c   | 1 +
>   include/linux/pci.h | 1 +
>   2 files changed, 2 insertions(+)
>
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index ee0ebff..065b6bb 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -136,6 +136,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
>   	virtfn->physfn = pci_dev_get(dev);
>   	virtfn->is_virtfn = 1;
>   	virtfn->multifunction = 0;
> +	virtfn->virtfn_index = id;
>   
>   	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>   		res = &dev->resource[i + PCI_IOV_RESOURCES];
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 353db8d..85c5531 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -356,6 +356,7 @@ struct pci_dev {
>   	unsigned int	io_window_1k:1;	/* Intel P2P bridge 1K I/O windows */
>   	unsigned int	irq_managed:1;
>   	pci_dev_flags_t dev_flags;
> +	unsigned int	virtfn_index;
>   	atomic_t	enable_cnt;	/* pci_enable_device has been called */
>   
>   	u32		saved_config_space[16]; /* config space saved at suspend time */
>

Can't you just calculate the VF index based on the VF BDF number 
combined with the information in the PF BDF number and VF 
offset/stride?  Seems kind of pointless to add a variable that is only 
used by one driver and is in a slowpath when you can just calculate it 
pretty quickly.

- Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
                   ` (11 preceding siblings ...)
  2015-10-21 16:37 ` [RFC Patch 12/12] IXGBEVF: Track dma dirty pages Lan Tianyu
@ 2015-10-21 18:45 ` Or Gerlitz
  2015-10-21 19:20   ` Alex Williamson
  2015-10-22 12:55 ` [Qemu-devel] " Michael S. Tsirkin
  2015-10-23 18:36 ` Alexander Duyck
  14 siblings, 1 reply; 56+ messages in thread
From: Or Gerlitz @ 2015-10-21 18:45 UTC (permalink / raw)
  To: Lan Tianyu, Michael S. Tsirkin <mst@redhat.com> (mst@redhat.com)
  Cc: bhelgaas, carolyn.wyborny, Skidmore, Donald C, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, Paolo Bonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, Jeff Kirsher, Jesse Brandeburg,
	john.ronciak, Linux Kernel, linux-pci, matthew.vick,
	Mitch Williams, Linux Netdev List, Shannon Nelson

On Wed, Oct 21, 2015 at 7:37 PM, Lan Tianyu <tianyu.lan@intel.com> wrote:
> This patchset is to propose a new solution to add live migration support
> for 82599 SRIOV network card.

> In our solution, we prefer to put all device specific operation into VF and
> PF driver and make code in the Qemu more general.

[...]

> Service down time test
> So far, we tested migration between two laptops with 82599 nic which
> are connected to a gigabit switch. Ping VF in the 0.001s interval
> during migration on the host of source side. It service down
> time is about 180ms.

So... what would you expect service down wise for the following
solution which is zero touch and I think should work for any VF
driver:

on host A: unplug the VM and conduct live migration to host B ala the
no-SRIOV case.

on host B:

when the VM "gets back to live", probe a VF there with the same assigned mac

next, udev on the VM will call the VF driver to create netdev instance

DHCP client would run to get the same IP address

+ under config directive (or from Qemu) send Gratuitous ARP to notify
the switch/es on the new location for that mac.

Or.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-21 18:45 ` [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Or Gerlitz
@ 2015-10-21 19:20   ` Alex Williamson
  2015-10-21 23:26     ` Alexander Duyck
                       ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Alex Williamson @ 2015-10-21 19:20 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Lan Tianyu,
	Michael S. Tsirkin <mst@redhat.com> (mst@redhat.com),
	bhelgaas, carolyn.wyborny, Skidmore, Donald C, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, Paolo Bonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, Jeff Kirsher, Jesse Brandeburg,
	john.ronciak, Linux Kernel, linux-pci, matthew.vick,
	Mitch Williams, Linux Netdev List, Shannon Nelson

On Wed, 2015-10-21 at 21:45 +0300, Or Gerlitz wrote:
> On Wed, Oct 21, 2015 at 7:37 PM, Lan Tianyu <tianyu.lan@intel.com> wrote:
> > This patchset is to propose a new solution to add live migration support
> > for 82599 SRIOV network card.
> 
> > In our solution, we prefer to put all device specific operation into VF and
> > PF driver and make code in the Qemu more general.
> 
> [...]
> 
> > Service down time test
> > So far, we tested migration between two laptops with 82599 nic which
> > are connected to a gigabit switch. Ping VF in the 0.001s interval
> > during migration on the host of source side. It service down
> > time is about 180ms.
> 
> So... what would you expect service down wise for the following
> solution which is zero touch and I think should work for any VF
> driver:
> 
> on host A: unplug the VM and conduct live migration to host B ala the
> no-SRIOV case.

The trouble here is that the VF needs to be unplugged prior to the start
of migration because we can't do effective dirty page tracking while the
device is connected and doing DMA.  So the downtime, assuming we're
counting only VF connectivity, is dependent on memory size, rate of
dirtying, and network bandwidth; seconds for small guests, minutes or
more (maybe much, much more) for large guests.

This is why the typical VF agnostic approach here is to using bonding
and fail over to a emulated device during migration, so performance
suffers, but downtime is something acceptable.

If we want the ability to defer the VF unplug until just before the
final stages of the migration, we need the VF to participate in dirty
page tracking.  Here it's done via an enlightened guest driver.  Alex
Graf presented a solution using a device specific enlightenment in QEMU.
Otherwise we'd need hardware support from the IOMMU.  Thanks,

Alex

> on host B:
> 
> when the VM "gets back to live", probe a VF there with the same assigned mac
> 
> next, udev on the VM will call the VF driver to create netdev instance
> 
> DHCP client would run to get the same IP address
> 
> + under config directive (or from Qemu) send Gratuitous ARP to notify
> the switch/es on the new location for that mac.
> 
> Or.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 02/12] IXGBE: Add new mail box event to restore VF status in the PF driver
  2015-10-21 16:37 ` [RFC Patch 02/12] IXGBE: Add new mail box event to restore VF status in the PF driver Lan Tianyu
@ 2015-10-21 20:34   ` Alexander Duyck
  0 siblings, 0 replies; 56+ messages in thread
From: Alexander Duyck @ 2015-10-21 20:34 UTC (permalink / raw)
  To: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/21/2015 09:37 AM, Lan Tianyu wrote:
> This patch is to restore VF status in the PF driver when get event
> from VF.
>
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>   drivers/net/ethernet/intel/ixgbe/ixgbe.h       |  1 +
>   drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h   |  1 +
>   drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 40 ++++++++++++++++++++++++++
>   3 files changed, 42 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> index 636f9e3..9d5669a 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> @@ -148,6 +148,7 @@ struct vf_data_storage {
>   	bool pf_set_mac;
>   	u16 pf_vlan; /* When set, guest VLAN config not allowed. */
>   	u16 pf_qos;
> +	u32 vf_lpe;
>   	u16 tx_rate;
>   	u16 vlan_count;
>   	u8 spoofchk_enabled;
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
> index b1e4703..8fdb38d 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
> @@ -91,6 +91,7 @@ enum ixgbe_pfvf_api_rev {
>
>   /* mailbox API, version 1.1 VF requests */
>   #define IXGBE_VF_GET_QUEUES	0x09 /* get queue configuration */
> +#define IXGBE_VF_NOTIFY_RESUME    0x0c /* VF notify PF migration finishing */
>
>   /* GET_QUEUES return data indices within the mailbox */
>   #define IXGBE_VF_TX_QUEUES	1	/* number of Tx queues supported */
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> index 1d17b58..ab2a2e2 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> @@ -648,6 +648,42 @@ static inline void ixgbe_write_qde(struct ixgbe_adapter *adapter, u32 vf,
>   	}
>   }
>
> +/**
> + *  Restore the settings by mailbox, after migration
> + **/
> +void ixgbe_restore_setting(struct ixgbe_adapter *adapter, u32 vf)
> +{
> +	struct ixgbe_hw *hw = &adapter->hw;
> +	u32 reg, reg_offset, vf_shift;
> +	int rar_entry = hw->mac.num_rar_entries - (vf + 1);
> +
> +	vf_shift = vf % 32;
> +	reg_offset = vf / 32;
> +
> +	/* enable transmit and receive for vf */
> +	reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
> +	reg |= (1 << vf_shift);
> +	IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
> +
> +	reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
> +	reg |= (1 << vf_shift);
> +	IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
> +

This is just blanket enabling Rx and Tx.  I don't see how this can be 
valid.  It seems like it would result in memory corruption for the guest 
if you are enabling Rx on a device that is not ready.  A perfect example 
is if the guest is not configured to handle jumbo frames and the PF has 
jumbo frames enabled.

> +	reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
> +	reg |= (1 << vf_shift);
> +	IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);

This assumes that the anti-spoof is enabled.  That may not be the case.

> +	ixgbe_vf_reset_event(adapter, vf);
> +
> +	hw->mac.ops.set_rar(hw, rar_entry,
> +			    adapter->vfinfo[vf].vf_mac_addresses,
> +			    vf, IXGBE_RAH_AV);
> +
> +
> +	if (adapter->vfinfo[vf].vf_lpe)
> +		ixgbe_set_vf_lpe(adapter, &adapter->vfinfo[vf].vf_lpe, vf);
> +}
> +

The function ixgbe_set_vf_lpe also enabled the receive, you should take 
a look at it.  For 82598 you cannot just arbitrarily enable the Rx as 
there is a risk of corrupting guest memory or causing a kernel panic.

>   static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf)
>   {
>   	struct ixgbe_ring_feature *vmdq = &adapter->ring_feature[RING_F_VMDQ];
> @@ -1047,6 +1083,7 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter *adapter, u32 vf)
>   		break;
>   	case IXGBE_VF_SET_LPE:
>   		retval = ixgbe_set_vf_lpe(adapter, msgbuf, vf);
> +		adapter->vfinfo[vf].vf_lpe = *msgbuf;
>   		break;

Why not just leave this for the VF to notify us of via a reset.  It 
seems like if the VF is migrated it should start with the cts bits of 
the mailbox cleared as though the PF driver as been reloaded.

>   	case IXGBE_VF_SET_MACVLAN:
>   		retval = ixgbe_set_vf_macvlan_msg(adapter, msgbuf, vf);
> @@ -1063,6 +1100,9 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter *adapter, u32 vf)
>   	case IXGBE_VF_GET_RSS_KEY:
>   		retval = ixgbe_get_vf_rss_key(adapter, msgbuf, vf);
>   		break;
> +	case IXGBE_VF_NOTIFY_RESUME:
> +		ixgbe_restore_setting(adapter, vf);
> +		break;
>   	default:
>   		e_err(drv, "Unhandled Msg %8.8x\n", msgbuf[0]);
>   		retval = IXGBE_ERR_MBX;
>

I really don't think the VF should be sending us a message telling us to 
restore settings.  Why not just use the existing messages?

The VF as it is now can survive a suspend/resume cycle for the entire 
system.  That means the VF is reset via a power cycle of the PF.  If we 
can resume our previous state after that we should be able to do so 
without needing to add extra code to the mailbox API.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver
  2015-10-21 16:37 ` [RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate " Lan Tianyu
@ 2015-10-21 20:45   ` Alexander Duyck
  2015-10-25  7:21     ` Lan, Tianyu
  0 siblings, 1 reply; 56+ messages in thread
From: Alexander Duyck @ 2015-10-21 20:45 UTC (permalink / raw)
  To: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/21/2015 09:37 AM, Lan Tianyu wrote:
> This patch is to add sysfs interface state_in_pf under sysfs directory
> of VF PCI device for Qemu to get and put VF status in the PF driver during
> migration.
>
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>   drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 156 ++++++++++++++++++++++++-
>   1 file changed, 155 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> index ab2a2e2..89671eb 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> @@ -124,6 +124,157 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter *adapter)
>   	return -ENOMEM;
>   }
>   
> +#define IXGBE_PCI_VFCOMMAND   0x4
> +#define IXGBE_PCI_VFMSIXMC    0x72
> +#define IXGBE_SRIOV_VF_OFFSET 0x180
> +#define IXGBE_SRIOV_VF_STRIDE 0x2
> +
> +#define to_adapter(dev) ((struct ixgbe_adapter *)(pci_get_drvdata(to_pci_dev(dev)->physfn)))
> +
> +struct state_in_pf {
> +	u16 command;
> +	u16 msix_message_control;
> +	struct vf_data_storage vf_data;
> +};
> +
> +static struct pci_dev *ixgbe_get_virtfn_dev(struct pci_dev *pdev, int vfn)
> +{
> +	u16 rid = pdev->devfn + IXGBE_SRIOV_VF_OFFSET + IXGBE_SRIOV_VF_STRIDE * vfn;
> +	return pci_get_bus_and_slot(pdev->bus->number + (rid >> 8), rid & 0xff);
> +}
> +
> +static ssize_t ixgbe_show_state_in_pf(struct device *dev,
> +				      struct device_attribute *attr, char *buf)
> +{
> +	struct ixgbe_adapter *adapter = to_adapter(dev);
> +	struct pci_dev *pdev = adapter->pdev, *vdev;
> +	struct pci_dev *vf_pdev = to_pci_dev(dev);
> +	struct ixgbe_hw *hw = &adapter->hw;
> +	struct state_in_pf *state = (struct state_in_pf *)buf;
> +	int vfn = vf_pdev->virtfn_index;
> +	u32 reg, reg_offset, vf_shift;
> +
> +	/* Clear VF mac and disable VF */
> +	ixgbe_del_mac_filter(adapter, adapter->vfinfo[vfn].vf_mac_addresses, vfn);
> +
> +	/* Record PCI configurations */
> +	vdev = ixgbe_get_virtfn_dev(pdev, vfn);
> +	if (vdev) {
> +		pci_read_config_word(vdev, IXGBE_PCI_VFCOMMAND, &state->command);
> +		pci_read_config_word(vdev, IXGBE_PCI_VFMSIXMC, &state->msix_message_control);
> +	}
> +	else
> +		printk(KERN_WARNING "Unable to find VF device.\n");
> +

Formatting for the if/else is incorrect.  The else condition should be 
in brackets as well.

> +	/* Record states hold by PF */
> +	memcpy(&state->vf_data, &adapter->vfinfo[vfn], sizeof(struct vf_data_storage));
> +
> +	vf_shift = vfn % 32;
> +	reg_offset = vfn / 32;
> +
> +	reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
> +	reg &= ~(1 << vf_shift);
> +	IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
> +
> +	reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
> +	reg &= ~(1 << vf_shift);
> +	IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
> +
> +	reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
> +	reg &= ~(1 << vf_shift);
> +	IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);
> +
> +	return sizeof(struct state_in_pf);
> +}
> +

This is a read.  Why does it need to switch off the VF?  Also why turn 
of the anti-spoof, it doesn't make much sense.

> +static ssize_t ixgbe_store_state_in_pf(struct device *dev,
> +				       struct device_attribute *attr,
> +				       const char *buf, size_t count)
> +{
> +	struct ixgbe_adapter *adapter = to_adapter(dev);
> +	struct pci_dev *pdev = adapter->pdev, *vdev;
> +	struct pci_dev *vf_pdev = to_pci_dev(dev);
> +	struct state_in_pf *state = (struct state_in_pf *)buf;
> +	int vfn = vf_pdev->virtfn_index;
> +
> +	/* Check struct size */
> +	if (count != sizeof(struct state_in_pf)) {
> +		printk(KERN_ERR "State in PF size does not fit.\n");
> +		goto out;
> +	}
> +
> +	/* Restore PCI configurations */
> +	vdev = ixgbe_get_virtfn_dev(pdev, vfn);
> +	if (vdev) {
> +		pci_write_config_word(vdev, IXGBE_PCI_VFCOMMAND, state->command);
> +		pci_write_config_word(vdev, IXGBE_PCI_VFMSIXMC, state->msix_message_control);
> +	}
> +
> +	/* Restore states hold by PF */
> +	memcpy(&adapter->vfinfo[vfn], &state->vf_data, sizeof(struct vf_data_storage));
> +
> +  out:
> +	return count;
> +}

Just doing a memcpy to move the vfinfo over adds no value.  The fact is 
there are a number of filters that have to be configured in hardware 
after, and it isn't as simple as just migrating the values stored.  As I 
mentioned in the case of the 82598 there is also jumbo frames to take 
into account.  If the first PF didn't have it enabled, but the second 
one does that implies the state of the VF needs to change to account for 
that.

I really think you would be better off only migrating the data related 
to what can be configured using the ip link command and leaving other 
values such as clear_to_send at the reset value of 0. Then you can at 
least restore state from the VF after just a couple of quick messages.

> +static struct device_attribute ixgbe_per_state_in_pf_attribute =
> +	__ATTR(state_in_pf, S_IRUGO | S_IWUSR,
> +		ixgbe_show_state_in_pf, ixgbe_store_state_in_pf);
> +
> +void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
> +{
> +	struct pci_dev *pdev = adapter->pdev;
> +	struct pci_dev *vfdev;
> +	unsigned short vf_id;
> +	int pos, ret;
> +
> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
> +	if (!pos)
> +		return;
> +
> +	/* get the device ID for the VF */
> +	pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, &vf_id);
> +
> +	vfdev = pci_get_device(pdev->vendor, vf_id, NULL);
> +
> +	while (vfdev) {
> +		if (vfdev->is_virtfn) {
> +			ret = device_create_file(&vfdev->dev,
> +					&ixgbe_per_state_in_pf_attribute);
> +			if (ret)
> +				pr_warn("Unable to add VF attribute for dev %s,\n",
> +					dev_name(&vfdev->dev));
> +		}
> +
> +		vfdev = pci_get_device(pdev->vendor, vf_id, vfdev);
> +	}
> +}

Driver specific sysfs is a no-go.  Otherwise we will end up with a 
different implementation of this for every driver.  You will need to 
find a way to make this generic in order to have a hope of getting this 
to be acceptable.

> +void ixgbe_remove_vf_attrib(struct ixgbe_adapter *adapter)
> +{
> +	struct pci_dev *pdev = adapter->pdev;
> +	struct pci_dev *vfdev;
> +	unsigned short vf_id;
> +	int pos;
> +
> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
> +	if (!pos)
> +		return;
> +
> +	/* get the device ID for the VF */
> +	pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, &vf_id);
> +
> +	vfdev = pci_get_device(pdev->vendor, vf_id, NULL);
> +
> +	while (vfdev) {
> +		if (vfdev->is_virtfn) {
> +			device_remove_file(&vfdev->dev, &ixgbe_per_state_in_pf_attribute);
> +		}
> +
> +		vfdev = pci_get_device(pdev->vendor, vf_id, vfdev);
> +	}
> +}
> +
>   /* Note this function is called when the user wants to enable SR-IOV
>    * VFs using the now deprecated module parameter
>    */
> @@ -198,6 +349,9 @@ int ixgbe_disable_sriov(struct ixgbe_adapter *adapter)
>   	if (!(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED))
>   		return 0;
>   
> +
> +	ixgbe_remove_vf_attrib(adapter);
> +
>   #ifdef CONFIG_PCI_IOV
>   	/*
>   	 * If our VFs are assigned we cannot shut down SR-IOV

You can probably drop the extra space you added before the function.

> @@ -284,7 +438,7 @@ static int ixgbe_pci_sriov_enable(struct pci_dev *dev, int num_vfs)
>   		return err;
>   	}
>   	ixgbe_sriov_reinit(adapter);
> -
> +	ixgbe_add_vf_attrib(adapter);
>   	return num_vfs;
>   #else
>   	return 0;

You should probably add a space here before the return.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"
  2015-10-21 16:37 ` [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf" Lan Tianyu
@ 2015-10-21 20:52   ` Alexander Duyck
  2015-10-22 12:51     ` Michael S. Tsirkin
  2015-10-24 15:43     ` Lan, Tianyu
  0 siblings, 2 replies; 56+ messages in thread
From: Alexander Duyck @ 2015-10-21 20:52 UTC (permalink / raw)
  To: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/21/2015 09:37 AM, Lan Tianyu wrote:
> This patch is to add new sysfs interface of "notify_vf" under sysfs
> directory of VF PCI device for Qemu to notify VF when migration status
> is changed.
>
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>   drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 30 ++++++++++++++++++++++++++
>   drivers/net/ethernet/intel/ixgbe/ixgbe_type.h  |  4 ++++
>   2 files changed, 34 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> index e247d67..5cc7817 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> @@ -217,10 +217,37 @@ static ssize_t ixgbe_store_state_in_pf(struct device *dev,
>   	return count;
>   }
>   
> +static ssize_t ixgbe_store_notify_vf(struct device *dev,
> +				       struct device_attribute *attr,
> +				       const char *buf, size_t count)
> +{
> +	struct ixgbe_adapter *adapter = to_adapter(dev);
> +	struct ixgbe_hw *hw = &adapter->hw;
> +	struct pci_dev *vf_pdev = to_pci_dev(dev);
> +	int vfn = vf_pdev->virtfn_index;
> +	u32 ivar;
> +
> +	/* Enable VF mailbox irq first */
> +	IXGBE_WRITE_REG(hw, IXGBE_PVTEIMS(vfn), 0x4);
> +	IXGBE_WRITE_REG(hw, IXGBE_PVTEIAM(vfn), 0x4);
> +	IXGBE_WRITE_REG(hw, IXGBE_PVTEIAC(vfn), 0x4);
> +
> +	ivar = IXGBE_READ_REG(hw, IXGBE_PVTIVAR_MISC(vfn));
> +	ivar &= ~0xFF;
> +	ivar |= 0x2 | IXGBE_IVAR_ALLOC_VAL;
> +	IXGBE_WRITE_REG(hw, IXGBE_PVTIVAR_MISC(vfn), ivar);
> +
> +	ixgbe_ping_vf(adapter, vfn);
> +	return count;
> +}
> +

NAK, this won't fly.  You can't just go in from the PF and enable 
interrupts on the VF hoping they are configured well enough to handle an 
interrupt you decide to trigger from them.

Also have you even considered the MSI-X configuration on the VF?  I 
haven't seen anything anywhere that would have migrated the VF's MSI-X 
configuration from BAR 3 on one system to the new system.

>   static struct device_attribute ixgbe_per_state_in_pf_attribute =
>   	__ATTR(state_in_pf, S_IRUGO | S_IWUSR,
>   		ixgbe_show_state_in_pf, ixgbe_store_state_in_pf);
>   
> +static struct device_attribute ixgbe_per_notify_vf_attribute =
> +	__ATTR(notify_vf, S_IWUSR, NULL, ixgbe_store_notify_vf);
> +
>   void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
>   {
>   	struct pci_dev *pdev = adapter->pdev;
> @@ -241,6 +268,8 @@ void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
>   		if (vfdev->is_virtfn) {
>   			ret = device_create_file(&vfdev->dev,
>   					&ixgbe_per_state_in_pf_attribute);
> +			ret |= device_create_file(&vfdev->dev,
> +					&ixgbe_per_notify_vf_attribute);
>   			if (ret)
>   				pr_warn("Unable to add VF attribute for dev %s,\n",
>   					dev_name(&vfdev->dev));
> @@ -269,6 +298,7 @@ void ixgbe_remove_vf_attrib(struct ixgbe_adapter *adapter)
>   	while (vfdev) {
>   		if (vfdev->is_virtfn) {
>   			device_remove_file(&vfdev->dev, &ixgbe_per_state_in_pf_attribute);
> +			device_remove_file(&vfdev->dev, &ixgbe_per_notify_vf_attribute);
>   		}
>   
>   		vfdev = pci_get_device(pdev->vendor, vf_id, vfdev);

More driver specific sysfs.  This needs to be moved out of the driver if 
this is to be considered anything more than a proof of concept.

> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
> index dd6ba59..c6ddb66 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
> @@ -2302,6 +2302,10 @@ enum {
>   #define IXGBE_PVFTDT(P)		(0x06018 + (0x40 * (P)))
>   #define IXGBE_PVFTDWBAL(P)	(0x06038 + (0x40 * (P)))
>   #define IXGBE_PVFTDWBAH(P)	(0x0603C + (0x40 * (P)))
> +#define IXGBE_PVTEIMS(P)	(0x00D00 + (4 * (P)))
> +#define IXGBE_PVTIVAR_MISC(P)	(0x04E00 + (4 * (P)))
> +#define IXGBE_PVTEIAC(P)       (0x00F00 + (4 * P))
> +#define IXGBE_PVTEIAM(P)       (0x04D00 + (4 * P))
>   
>   #define IXGBE_PVFTDWBALn(q_per_pool, vf_number, vf_q_index) \
>   		(IXGBE_PVFTDWBAL((q_per_pool)*(vf_number) + (vf_q_index)))

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 06/12] IXGBEVF: Add self emulation layer
  2015-10-21 16:37 ` [RFC Patch 06/12] IXGBEVF: Add self emulation layer Lan Tianyu
@ 2015-10-21 20:58   ` Alexander Duyck
  2015-10-22 12:50     ` [Qemu-devel] " Michael S. Tsirkin
  0 siblings, 1 reply; 56+ messages in thread
From: Alexander Duyck @ 2015-10-21 20:58 UTC (permalink / raw)
  To: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/21/2015 09:37 AM, Lan Tianyu wrote:
> In order to restore VF function after migration, add self emulation layer
> to record regs' values during accessing regs.
>
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>   drivers/net/ethernet/intel/ixgbevf/Makefile        |  3 ++-
>   drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |  2 +-
>   .../net/ethernet/intel/ixgbevf/self-emulation.c    | 26 ++++++++++++++++++++++
>   drivers/net/ethernet/intel/ixgbevf/vf.h            |  5 ++++-
>   4 files changed, 33 insertions(+), 3 deletions(-)
>   create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c
>
> diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile b/drivers/net/ethernet/intel/ixgbevf/Makefile
> index 4ce4c97..841c884 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/Makefile
> +++ b/drivers/net/ethernet/intel/ixgbevf/Makefile
> @@ -31,7 +31,8 @@
>   
>   obj-$(CONFIG_IXGBEVF) += ixgbevf.o
>   
> -ixgbevf-objs := vf.o \
> +ixgbevf-objs := self-emulation.o \
> +		vf.o \
>                   mbx.o \
>                   ethtool.o \
>                   ixgbevf_main.o
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index a16d267..4446916 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg)
>   
>   	if (IXGBE_REMOVED(reg_addr))
>   		return IXGBE_FAILED_READ_REG;
> -	value = readl(reg_addr + reg);
> +	value = ixgbe_self_emul_readl(reg_addr, reg);
>   	if (unlikely(value == IXGBE_FAILED_READ_REG))
>   		ixgbevf_check_remove(hw, reg);
>   	return value;
> diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
> new file mode 100644
> index 0000000..d74b2da
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
> @@ -0,0 +1,26 @@
> +#include <linux/netdevice.h>
> +#include <linux/pci.h>
> +#include <linux/delay.h>
> +#include <linux/interrupt.h>
> +#include <net/arp.h>
> +
> +#include "vf.h"
> +#include "ixgbevf.h"
> +
> +static u32 hw_regs[0x4000];
> +
> +u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr)
> +{
> +	u32 tmp;
> +
> +	tmp = readl(base + addr);
> +	hw_regs[(unsigned long)addr] = tmp;
> +
> +	return tmp;
> +}
> +
> +void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr)
> +{
> +	hw_regs[(unsigned long)addr] = val;
> +	writel(val, (volatile void __iomem *)(base + addr));
> +}

So I see what you are doing, however I don't think this adds much 
value.  Many of the key registers for the device are not simple 
Read/Write registers.  Most of them are things like write 1 to clear or 
some other sort of value where writing doesn't set the bit but has some 
other side effect.  Just take a look through the Datasheet at registers 
such as the VFCTRL, VFMAILBOX, or most of the interrupt registers.  The 
fact is simply storing the values off doesn't give you any real idea of 
what the state of things are.

> diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h b/drivers/net/ethernet/intel/ixgbevf/vf.h
> index d40f036..6a3f4eb 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/vf.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/vf.h
> @@ -39,6 +39,9 @@
>   
>   struct ixgbe_hw;
>   
> +u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr);
> +void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr);
> +
>   /* iterator type for walking multicast address lists */
>   typedef u8* (*ixgbe_mc_addr_itr) (struct ixgbe_hw *hw, u8 **mc_addr_ptr,
>   				  u32 *vmdq);
> @@ -182,7 +185,7 @@ static inline void ixgbe_write_reg(struct ixgbe_hw *hw, u32 reg, u32 value)
>   
>   	if (IXGBE_REMOVED(reg_addr))
>   		return;
> -	writel(value, reg_addr + reg);
> +	ixgbe_self_emul_writel(value, reg_addr, reg);
>   }
>   
>   #define IXGBE_WRITE_REG(h, r, v) ixgbe_write_reg(h, r, v)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package
  2015-10-21 16:37 ` [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package Lan Tianyu
@ 2015-10-21 21:14   ` Alexander Duyck
  2015-10-24 16:12     ` Lan, Tianyu
  2015-10-22 12:58   ` Michael S. Tsirkin
  1 sibling, 1 reply; 56+ messages in thread
From: Alexander Duyck @ 2015-10-21 21:14 UTC (permalink / raw)
  To: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/21/2015 09:37 AM, Lan Tianyu wrote:
> When transmit a package, the end transmit desc of package
> indicates whether package is sent already. Current code records
> the end desc's pointer in the next_to_watch of struct tx buffer.
> This code will be broken if shifting desc ring after migration.
> The pointer will be invalid. This patch is to replace recording
> pointer with recording the desc number of the package and find
> the end decs via the first desc and desc number.
>
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>   drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      |  1 +
>   drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 19 ++++++++++++++++---
>   2 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> index 775d089..c823616 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> @@ -54,6 +54,7 @@
>    */
>   struct ixgbevf_tx_buffer {
>   	union ixgbe_adv_tx_desc *next_to_watch;
> +	u16 desc_num;
>   	unsigned long time_stamp;
>   	struct sk_buff *skb;
>   	unsigned int bytecount;

So if you can't use next_to_watch why is it left in here?  Also you 
might want to take a look at moving desc_num to a different spot in the 
buffer as you are leaving a 6 byte hole in the descriptor.

> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index 4446916..056841c 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -210,6 +210,7 @@ static void ixgbevf_unmap_and_free_tx_resource(struct ixgbevf_ring *tx_ring,
>   			       DMA_TO_DEVICE);
>   	}
>   	tx_buffer->next_to_watch = NULL;
> +	tx_buffer->desc_num = 0;
>   	tx_buffer->skb = NULL;
>   	dma_unmap_len_set(tx_buffer, len, 0);

This opens up a race condition.  If you have a descriptor ready to be 
cleaned at offset 0 what is to prevent you from just running through the 
ring?  You likely need to find a descriptor number that cannot be valid 
to use here.

>   	/* tx_buffer must be completely set up in the transmit path */
> @@ -295,7 +296,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
>   	union ixgbe_adv_tx_desc *tx_desc;
>   	unsigned int total_bytes = 0, total_packets = 0;
>   	unsigned int budget = tx_ring->count / 2;
> -	unsigned int i = tx_ring->next_to_clean;
> +	int i, watch_index;
>   

Where is i being initialized?  It was here but you removed it.  Are you 
using i without initializing it?

>   	if (test_bit(__IXGBEVF_DOWN, &adapter->state))
>   		return true;
> @@ -305,9 +306,17 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
>   	i -= tx_ring->count;
>   
>   	do {
> -		union ixgbe_adv_tx_desc *eop_desc = tx_buffer->next_to_watch;
> +		union ixgbe_adv_tx_desc *eop_desc;
> +
> +		if (!tx_buffer->desc_num)
> +			break;
> +
> +		if (i + tx_buffer->desc_num >= 0)
> +			watch_index = i + tx_buffer->desc_num;
> +		else
> +			watch_index = i + tx_ring->count + tx_buffer->desc_num;
>   
> -		/* if next_to_watch is not set then there is no work pending */
> +		eop_desc = IXGBEVF_TX_DESC(tx_ring, watch_index);
>   		if (!eop_desc)
>   			break;
>   

So I don't see how this isn't triggering Tx hangs.  I suspect for the 
simple ping case desc_num will often be 0.  The fact is there are many 
cases where first and tx_buffer_info are the same descriptor.

> @@ -320,6 +329,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
>   
>   		/* clear next_to_watch to prevent false hangs */
>   		tx_buffer->next_to_watch = NULL;
> +		tx_buffer->desc_num = 0;
>   
>   		/* update the statistics for this packet */
>   		total_bytes += tx_buffer->bytecount;

You cannot use 0 because 0 is a valid number.  You are using it as a 
look-ahead currently and there are cases where i is the eop_desc index.

> @@ -3457,6 +3467,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
>   	u32 tx_flags = first->tx_flags;
>   	__le32 cmd_type;
>   	u16 i = tx_ring->next_to_use;
> +	u16 start;
>   
>   	tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
>   
> @@ -3540,6 +3551,8 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
>   
>   	/* set next_to_watch value indicating a packet is present */
>   	first->next_to_watch = tx_desc;
> +	start = first - tx_ring->tx_buffer_info;
> +	first->desc_num = (i - start >= 0) ? i - start: i + tx_ring->count - start;
>   
>   	i++;
>   	if (i == tx_ring->count)

start and i could be the same value.  If you look at ixgbevf_tx_map you 
should find that if the packet is contained in a single buffer then the 
first and last descriptor in your send will be the same one.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver
  2015-10-21 16:37 ` [RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver Lan Tianyu
@ 2015-10-21 21:48   ` Alexander Duyck
  2015-10-22 12:46   ` Michael S. Tsirkin
  1 sibling, 0 replies; 56+ messages in thread
From: Alexander Duyck @ 2015-10-21 21:48 UTC (permalink / raw)
  To: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/21/2015 09:37 AM, Lan Tianyu wrote:
> To let VF driver in the guest to know migration status, Qemu will
> fake PCI configure reg 0xF0 and 0xF1 to show migrate status and
> get ack from VF driver.
>
> When migration starts, Qemu will set reg "0xF0" to 1, notify
> VF driver via triggering mail box msg and wait for VF driver to tell
> it's ready for migration(set reg "0xF1" to 1). After migration, Qemu
> will set reg "0xF0" to 0 and notify VF driver by mail box irq. VF
> driver begins to restore tx/rx function after detecting sttatus change.
>
> When VF receives mail box irq, it will check reg "0xF0" in the service
> task function to get migration status and performs related operations
> according its value.
>
> Steps of restarting receive and transmit function
> 1) Restore VF status in the PF driver via sending mail event to PF driver
> 2) Write back reg values recorded by self emulation layer
> 3) Restart rx/tx ring
> 4) Recovery interrupt
>
> Transmit/Receive descriptor head regs are read-only and can't
> be restored via writing back recording reg value directly and they
> are set to 0 during VF reset. To reuse original tx/rx rings, shift
> desc ring in order to move the desc pointed by original head reg to
> first entry of the ring and then enable tx/rx rings. VF restarts to
> receive and transmit from original head desc.
>
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>   drivers/net/ethernet/intel/ixgbevf/defines.h       |   6 ++
>   drivers/net/ethernet/intel/ixgbevf/ixgbevf.h       |   7 +-
>   drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  | 115 ++++++++++++++++++++-
>   .../net/ethernet/intel/ixgbevf/self-emulation.c    | 107 +++++++++++++++++++
>   4 files changed, 232 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h b/drivers/net/ethernet/intel/ixgbevf/defines.h
> index 770e21a..113efd2 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/defines.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/defines.h
> @@ -239,6 +239,12 @@ struct ixgbe_adv_tx_context_desc {
>   	__le32 mss_l4len_idx;
>   };
>
> +union ixgbevf_desc {
> +	union ixgbe_adv_tx_desc rx_desc;
> +	union ixgbe_adv_rx_desc tx_desc;
> +	struct ixgbe_adv_tx_context_desc tx_context_desc;
> +};
> +
>   /* Adv Transmit Descriptor Config Masks */
>   #define IXGBE_ADVTXD_DTYP_MASK	0x00F00000 /* DTYP mask */
>   #define IXGBE_ADVTXD_DTYP_CTXT	0x00200000 /* Advanced Context Desc */
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> index c823616..6eab402e 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> @@ -109,7 +109,7 @@ struct ixgbevf_ring {
>   	struct ixgbevf_ring *next;
>   	struct net_device *netdev;
>   	struct device *dev;
> -	void *desc;			/* descriptor ring memory */
> +	union ixgbevf_desc *desc;	/* descriptor ring memory */
>   	dma_addr_t dma;			/* phys. address of descriptor ring */
>   	unsigned int size;		/* length in bytes */
>   	u16 count;			/* amount of descriptors */
> @@ -493,6 +493,11 @@ extern void ixgbevf_write_eitr(struct ixgbevf_q_vector *q_vector);
>
>   void ixgbe_napi_add_all(struct ixgbevf_adapter *adapter);
>   void ixgbe_napi_del_all(struct ixgbevf_adapter *adapter);
> +int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head);
> +int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head);
> +void ixgbevf_restore_state(struct ixgbevf_adapter *adapter);
> +inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter);
> +
>
>   #ifdef DEBUG
>   char *ixgbevf_get_hw_dev_name(struct ixgbe_hw *hw);
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index 056841c..15ec361 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -91,6 +91,10 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit Virtual Function Network Driver");
>   MODULE_LICENSE("GPL");
>   MODULE_VERSION(DRV_VERSION);
>
> +
> +#define MIGRATION_COMPLETED   0x00
> +#define MIGRATION_IN_PROGRESS 0x01
> +
>   #define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK)
>   static int debug = -1;
>   module_param(debug, int, 0);
> @@ -221,6 +225,78 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring *ring)
>   	return ring->stats.packets;
>   }
>
> +int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
> +{
> +	struct ixgbevf_tx_buffer *tx_buffer = NULL;
> +	static union ixgbevf_desc *tx_desc = NULL;
> +
> +	tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
> +	if (!tx_buffer)
> +		return -ENOMEM;
> +
> +	tx_desc = vmalloc(sizeof(union ixgbevf_desc) * r->count);
> +	if (!tx_desc)
> +		return -ENOMEM;
> +
> +	memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
> +	memcpy(r->desc, &tx_desc[head], sizeof(union ixgbevf_desc) * (r->count - head));
> +	memcpy(&r->desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * head);
> +
> +	memcpy(tx_buffer, r->tx_buffer_info, sizeof(struct ixgbevf_tx_buffer) * r->count);
> +	memcpy(r->tx_buffer_info, &tx_buffer[head], sizeof(struct ixgbevf_tx_buffer) * (r->count - head));
> +	memcpy(&r->tx_buffer_info[r->count - head], tx_buffer, sizeof(struct ixgbevf_tx_buffer) * head);
> +
> +	if (r->next_to_clean >= head)
> +		r->next_to_clean -= head;
> +	else
> +		r->next_to_clean += (r->count - head);
> +
> +	if (r->next_to_use >= head)
> +		r->next_to_use -= head;
> +	else
> +		r->next_to_use += (r->count - head);
> +
> +	vfree(tx_buffer);
> +	vfree(tx_desc);
> +	return 0;
> +}
> +
> +int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
> +{
> +	struct ixgbevf_rx_buffer *rx_buffer = NULL;
> +	static union ixgbevf_desc *rx_desc = NULL;
> +
> +	rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count));
> +	if (!rx_buffer)
> +		return -ENOMEM;
> +
> +	rx_desc = vmalloc(sizeof(union ixgbevf_desc) * (r->count));
> +	if (!rx_desc)
> +		return -ENOMEM;
> +
> +	memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count));
> +	memcpy(r->desc, &rx_desc[head], sizeof(union ixgbevf_desc) * (r->count - head));
> +	memcpy(&r->desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * head);
> +
> +	memcpy(rx_buffer, r->rx_buffer_info, sizeof(struct ixgbevf_rx_buffer) * (r->count));
> +	memcpy(r->rx_buffer_info, &rx_buffer[head], sizeof(struct ixgbevf_rx_buffer) * (r->count - head));
> +	memcpy(&r->rx_buffer_info[r->count - head], rx_buffer, sizeof(struct ixgbevf_rx_buffer) * head);
> +
> +	if (r->next_to_clean >= head)
> +		r->next_to_clean -= head;
> +	else
> +		r->next_to_clean += (r->count - head);
> +
> +	if (r->next_to_use >= head)
> +		r->next_to_use -= head;
> +	else
> +		r->next_to_use += (r->count - head);
> +
> +	vfree(rx_buffer);
> +	vfree(rx_desc);
> +	return 0;
> +}
> +
>   static u32 ixgbevf_get_tx_pending(struct ixgbevf_ring *ring)
>   {
>   	struct ixgbevf_adapter *adapter = netdev_priv(ring->netdev);
> @@ -1122,7 +1198,7 @@ static int ixgbevf_busy_poll_recv(struct napi_struct *napi)
>    * ixgbevf_configure_msix sets up the hardware to properly generate MSI-X
>    * interrupts.
>    **/
> -static void ixgbevf_configure_msix(struct ixgbevf_adapter *adapter)
> +static  void ixgbevf_configure_msix(struct ixgbevf_adapter *adapter)
>   {
>   	struct ixgbevf_q_vector *q_vector;
>   	int q_vectors, v_idx;
> @@ -1534,7 +1610,7 @@ static inline void ixgbevf_irq_disable(struct ixgbevf_adapter *adapter)
>    * ixgbevf_irq_enable - Enable default interrupt generation settings
>    * @adapter: board private structure
>    **/
> -static inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter)
> +inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter)
>   {
>   	struct ixgbe_hw *hw = &adapter->hw;
>
> @@ -2901,6 +2977,36 @@ static void ixgbevf_watchdog_subtask(struct ixgbevf_adapter *adapter)
>   	ixgbevf_update_stats(adapter);
>   }
>
> +int ixgbevf_live_mg(struct ixgbevf_adapter *adapter)
> +{
> +	struct pci_dev *pdev = adapter->pdev;
> + 	static int migration_status = MIGRATION_COMPLETED;
> +	u8 val;
> +
> +	if (migration_status == MIGRATION_COMPLETED) {
> +		pci_read_config_byte(pdev, 0xf0, &val);
> +		if (!val)
> +			return 0;
> +
> +		del_timer_sync(&adapter->service_timer);
> +		pr_info("migration start\n");
> +		migration_status = MIGRATION_IN_PROGRESS;
> +
> +		/* Tell Qemu VF is ready for migration. */
> +		pci_write_config_byte(pdev, 0xf1, 0x1);
> +		return 1;
> +	} else {
> +		pci_read_config_byte(pdev, 0xf0, &val);
> +		if (val)
> +			return 1;
> +
> +		ixgbevf_restore_state(adapter);
> +		migration_status = MIGRATION_COMPLETED;
> +		pr_info("migration end\n");
> +		return 0;
> +	}
> +}
> +

Correct me if I'm wrong but isn't migration_status going to affect all 
VFs on a given system?  Seems like that might be a bit racy if you were 
to have a VM with more than one VF present.  It seems to me 
migration_status should probably be a part of the adapter or hw structs.

>   /**
>    * ixgbevf_service_task - manages and runs subtasks
>    * @work: pointer to work_struct containing our data
> @@ -2912,6 +3018,11 @@ static void ixgbevf_service_task(struct work_struct *work)
>   						       service_task);
>   	struct ixgbe_hw *hw = &adapter->hw;
>
> +	if (ixgbevf_live_mg(adapter)) {
> +		ixgbevf_service_event_complete(adapter);
> +		return;
> +	}
> +
>   	if (IXGBE_REMOVED(hw->hw_addr)) {
>   		if (!test_bit(__IXGBEVF_DOWN, &adapter->state)) {
>   			rtnl_lock();
> diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
> index d74b2da..4476428 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
> @@ -9,6 +9,8 @@
>
>   static u32 hw_regs[0x4000];
>
> +#define RESTORE_REG(hw, reg) IXGBE_WRITE_REG(hw, reg, hw_regs[reg])
> +
>   u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr)
>   {
>   	u32 tmp;
> @@ -24,3 +26,108 @@ void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr)
>   	hw_regs[(unsigned long)addr] = val;
>   	writel(val, (volatile void __iomem *)(base + addr));
>   }
> +
> +static u32 restore_regs[] = {
> +	IXGBE_VTIVAR(0),
> +	IXGBE_VTIVAR(1),
> +	IXGBE_VTIVAR(2),
> +	IXGBE_VTIVAR(3),
> +	IXGBE_VTIVAR_MISC,
> +	IXGBE_VTEITR(0),
> +	IXGBE_VTEITR(1),
> +	IXGBE_VFPSRTYPE,
> +};
> +

Most of these registers don't need to be copied over.  They can just be 
configured from their existing values.  For example the IVARs have a 
function that already exist to configure them.  You could probably get 
away with just calling ixgbevf_configure_msix to restore most of that 
information.  Same thing for EITR and PSRTYPE.  The fact is most of this 
doesn't need to be saved and could just be reconfigured based on 
power-on values.

> +void ixgbevf_restore_state(struct ixgbevf_adapter *adapter)
> +{
> +	struct ixgbe_hw *hw = &adapter->hw;
> +	struct ixgbe_mbx_info *mbx = &hw->mbx;
> +	int i;
> +	u32 timeout = IXGBE_VF_INIT_TIMEOUT, rdh, tdh, 	rxdctl, txdctl;
> +	u32 wait_loop = 10;
> +
> +	/* VF resetting */
> +	IXGBE_WRITE_REG(hw, IXGBE_VFCTRL, IXGBE_CTRL_RST);
> +	IXGBE_WRITE_FLUSH(hw);
> +
> +	while (!mbx->ops.check_for_rst(hw) && timeout) {
> +		timeout--;
> +		udelay(5);
> +	}
> +	if (!timeout)
> +		printk(KERN_ERR "[IXGBEVF] Unable to reset VF.\n");
> +
> +	/* Restoring VF status in the status */
> +	hw->mac.ops.notify_resume(hw);
> +

This seems like a recipe for putting the VF in a bad state.  It seems 
like if you are going to go though the process here you might was well 
just call the ixgbevf_reset function and wait.  Doing your own hand 
coded version of the reset_hw function seems like a recipe for disaster.

> +	/* Restoring regs value */
> +	for (i = 0; i < sizeof(restore_regs)/sizeof(u32); i++)
> +		writel(hw_regs[restore_regs[i]], (volatile void *)(restore_regs[i] + hw->hw_addr));
> +
> +	/* Restoring rx ring */
> +	for (i = 0; i < adapter->num_rx_queues; i++) {
> +		if (hw_regs[IXGBE_VFRXDCTL(i)] & IXGBE_RXDCTL_ENABLE) {
> +			RESTORE_REG(hw, IXGBE_VFRDBAL(i));
> +			RESTORE_REG(hw, IXGBE_VFRDBAH(i));
> +			RESTORE_REG(hw, IXGBE_VFRDLEN(i));
> +			RESTORE_REG(hw, IXGBE_VFDCA_RXCTRL(i));
> +			RESTORE_REG(hw, IXGBE_VFSRRCTL(i));
> +
> +			rdh = adapter->rx_ring[i]->next_to_clean;
> +			while (IXGBEVF_RX_DESC(adapter->rx_ring[i], rdh)->wb.upper.status_error
> +			       & cpu_to_le32(IXGBE_RXD_STAT_DD))
> +				rdh = (rdh + 1) % adapter->rx_ring[i]->count;
> +
> +			ixgbevf_rx_ring_shift(adapter->rx_ring[i], rdh);
> +
> +			wait_loop = 10;
> +			RESTORE_REG(hw, IXGBE_VFRXDCTL(i));
> +			do {
> +				udelay(10);
> +				rxdctl = IXGBE_READ_REG(hw, IXGBE_VFRXDCTL(i));
> +			} while (--wait_loop && !(rxdctl & IXGBE_RXDCTL_ENABLE));
> +
> +			if (!wait_loop)
> +				pr_err("RXDCTL.ENABLE queue %d not cleared while polling\n",
> +				       i);
> +
> +			IXGBE_WRITE_REG(hw, IXGBE_VFRDT(i), adapter->rx_ring[i]->next_to_use);
> +		}
> +	}

This could probably be replaced with ixgbevf_configure_rx_ring().  You 
would just need to pull out ixgbevf_alloc_rx_buffers from the call and 
handle that in ixgbevf_configure_rx instead.

Also you probably don't need to check the RXDCTL_ENABLE flag, instead 
you could just check for netif_running().


> +	/* Restoring tx ring */
> +	for (i = 0; i < adapter->num_tx_queues; i++) {
> +		if (hw_regs[IXGBE_VFTXDCTL(i)] & IXGBE_TXDCTL_ENABLE) {
> +			RESTORE_REG(hw, IXGBE_VFTDBAL(i));
> +			RESTORE_REG(hw, IXGBE_VFTDBAH(i));
> +			RESTORE_REG(hw, IXGBE_VFTDLEN(i));
> +			RESTORE_REG(hw, IXGBE_VFDCA_TXCTRL(i));
> +
> +			tdh = adapter->tx_ring[i]->next_to_clean;
> +			while (IXGBEVF_TX_DESC(adapter->tx_ring[i], tdh)->wb.status
> +			       & cpu_to_le32(IXGBE_TXD_STAT_DD))
> +				tdh = (tdh + 1) % adapter->rx_ring[i]->count;
> +			ixgbevf_tx_ring_shift(adapter->tx_ring[i], tdh);
> +
> +			wait_loop = 10;
> +			RESTORE_REG(hw, IXGBE_VFTXDCTL(i));
> +			do {
> +				udelay(2000);
> +				txdctl = IXGBE_READ_REG(hw, IXGBE_VFTXDCTL(i));
> +			} while (--wait_loop && !(txdctl & IXGBE_TXDCTL_ENABLE));
> +
> +			if (!wait_loop)
> +				pr_err("Could not enable Tx Queue %d\n", i);
> +	
> +			IXGBE_WRITE_REG(hw, IXGBE_VFTDT(i), adapter->tx_ring[i]->next_to_use);
> +		}
> +	}
> +

Same here.  You are adding a bunch of code that already exists at other 
places in the driver.

> +	/* Restore irq */
> +	IXGBE_WRITE_REG(hw, IXGBE_VTEIMS, hw_regs[IXGBE_VTEIMS] & 0x7);
> +	IXGBE_WRITE_REG(hw, IXGBE_VTEIMC, (~hw_regs[IXGBE_VTEIMS]) & 0x7);


You might just want to clear all possible interrupts first, and then 
just enable the ones already set in EIMS.

> +	IXGBE_WRITE_REG(hw, IXGBE_VTEICS, hw_regs[IXGBE_VTEICS]);


As far as EICS you would probably want to just fire every interrupt once 
to flush them out.  No point in only enabling what is in EICS.

> +	ixgbevf_irq_enable(adapter);
> +}
> +
>

Why bother with all that when you could have just called 
ixgbevf_irq_enable in the first place.  You end up writing the same 
registers twice when you could have saved yourself the trouble and 
probably just called ixgbevf_irq_enable which will already take care of 
enabling everything and it will do it correctly.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation
  2015-10-21 16:37 ` [RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation Lan Tianyu
@ 2015-10-21 21:55   ` Alexander Duyck
  2015-10-22 12:40   ` Michael S. Tsirkin
  1 sibling, 0 replies; 56+ messages in thread
From: Alexander Duyck @ 2015-10-21 21:55 UTC (permalink / raw)
  To: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/21/2015 09:37 AM, Lan Tianyu wrote:
> Ring shifting during restoring VF function maybe race with original
> ring operation(transmit/receive package). This patch is to add tx/rx
> lock to protect ring related data.
>
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>   drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      |  2 ++
>   drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 28 ++++++++++++++++++++---
>   2 files changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> index 6eab402e..3a748c8 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> @@ -448,6 +448,8 @@ struct ixgbevf_adapter {
>
>   	spinlock_t mbx_lock;
>   	unsigned long last_reset;
> +	spinlock_t mg_rx_lock;
> +	spinlock_t mg_tx_lock;
>   };
>

Really, a shared lock for all of the Rx or Tx rings?  This is going to 
kill any chance at performance.  Especially since just recently the VFs 
got support for RSS.

To top it off it also means we cannot clean Tx while adding new buffers 
which will kill Tx performance.

The other concern I have is what is supposed to prevent the hardware 
from accessing the rings while you are reading?  I suspect nothing so I 
don't see how this helps anything.

I would honestly say you are better off just giving up on all of the 
data stored in the descriptor rings rather than trying to restore them. 
  Yes you are going to lose a few packets but you don't have the risk 
for races that this code introduces.

>   enum ixbgevf_state_t {
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index 15ec361..04b6ce7 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -227,8 +227,10 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring *ring)
>
>   int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
>   {
> +	struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
>   	struct ixgbevf_tx_buffer *tx_buffer = NULL;
>   	static union ixgbevf_desc *tx_desc = NULL;
> +	unsigned long flags;
>
>   	tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
>   	if (!tx_buffer)
> @@ -238,6 +240,7 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
>   	if (!tx_desc)
>   		return -ENOMEM;
>
> +	spin_lock_irqsave(&adapter->mg_tx_lock, flags);
>   	memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
>   	memcpy(r->desc, &tx_desc[head], sizeof(union ixgbevf_desc) * (r->count - head));
>   	memcpy(&r->desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * head);
> @@ -256,6 +259,8 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
>   	else
>   		r->next_to_use += (r->count - head);
>
> +	spin_unlock_irqrestore(&adapter->mg_tx_lock, flags);
> +
>   	vfree(tx_buffer);
>   	vfree(tx_desc);
>   	return 0;
> @@ -263,8 +268,10 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
>
>   int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
>   {
> +	struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
>   	struct ixgbevf_rx_buffer *rx_buffer = NULL;
>   	static union ixgbevf_desc *rx_desc = NULL;
> +	unsigned long flags;	
>
>   	rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count));
>   	if (!rx_buffer)
> @@ -274,6 +281,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
>   	if (!rx_desc)
>   		return -ENOMEM;
>
> +	spin_lock_irqsave(&adapter->mg_rx_lock, flags);
>   	memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count));
>   	memcpy(r->desc, &rx_desc[head], sizeof(union ixgbevf_desc) * (r->count - head));
>   	memcpy(&r->desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * head);
> @@ -291,6 +299,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
>   		r->next_to_use -= head;
>   	else
>   		r->next_to_use += (r->count - head);
> +	spin_unlock_irqrestore(&adapter->mg_rx_lock, flags);
>
>   	vfree(rx_buffer);
>   	vfree(rx_desc);
> @@ -377,6 +386,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
>   	if (test_bit(__IXGBEVF_DOWN, &adapter->state))
>   		return true;
>
> +	spin_lock(&adapter->mg_tx_lock);
> +	i = tx_ring->next_to_clean;
>   	tx_buffer = &tx_ring->tx_buffer_info[i];
>   	tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
>   	i -= tx_ring->count;
> @@ -471,6 +482,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
>   	q_vector->tx.total_bytes += total_bytes;
>   	q_vector->tx.total_packets += total_packets;
>
> +	spin_unlock(&adapter->mg_tx_lock);
> +
>   	if (check_for_tx_hang(tx_ring) && ixgbevf_check_tx_hang(tx_ring)) {
>   		struct ixgbe_hw *hw = &adapter->hw;
>   		union ixgbe_adv_tx_desc *eop_desc;
> @@ -999,10 +1012,12 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
>   				struct ixgbevf_ring *rx_ring,
>   				int budget)
>   {
> +	struct ixgbevf_adapter *adapter = netdev_priv(rx_ring->netdev);
>   	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>   	u16 cleaned_count = ixgbevf_desc_unused(rx_ring);
>   	struct sk_buff *skb = rx_ring->skb;
>
> +	spin_lock(&adapter->mg_rx_lock);
>   	while (likely(total_rx_packets < budget)) {
>   		union ixgbe_adv_rx_desc *rx_desc;
>
> @@ -1078,6 +1093,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
>   	q_vector->rx.total_packets += total_rx_packets;
>   	q_vector->rx.total_bytes += total_rx_bytes;
>
> +	spin_unlock(&adapter->mg_rx_lock);
>   	return total_rx_packets;
>   }
>
> @@ -3572,14 +3588,17 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
>   	struct ixgbevf_tx_buffer *tx_buffer;
>   	union ixgbe_adv_tx_desc *tx_desc;
>   	struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[0];
> +	struct ixgbevf_adapter *adapter = netdev_priv(tx_ring->netdev);
>   	unsigned int data_len = skb->data_len;
>   	unsigned int size = skb_headlen(skb);
>   	unsigned int paylen = skb->len - hdr_len;
> +	unsigned long flags;
>   	u32 tx_flags = first->tx_flags;
>   	__le32 cmd_type;
> -	u16 i = tx_ring->next_to_use;
> -	u16 start;
> +	u16 i, start;
>
> +	spin_lock_irqsave(&adapter->mg_tx_lock, flags);
> +	i = tx_ring->next_to_use;
>   	tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
>
>   	ixgbevf_tx_olinfo_status(tx_desc, tx_flags, paylen);
> @@ -3673,7 +3692,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
>
>   	/* notify HW of packet */
>   	ixgbevf_write_tail(tx_ring, i);
> -
> +	spin_unlock_irqrestore(&adapter->mg_tx_lock, flags);
>   	return;
>   dma_error:
>   	dev_err(tx_ring->dev, "TX DMA map failed\n");
> @@ -3690,6 +3709,7 @@ dma_error:
>   	}
>
>   	tx_ring->next_to_use = i;
> +	spin_unlock_irqrestore(&adapter->mg_tx_lock, flags);
>   }
>
>   static int __ixgbevf_maybe_stop_tx(struct ixgbevf_ring *tx_ring, int size)
> @@ -4188,6 +4208,8 @@ static int ixgbevf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>   		break;
>   	}
>
> +	spin_lock_init(&adapter->mg_tx_lock);
> +	spin_lock_init(&adapter->mg_rx_lock);
>   	return 0;
>
>   err_register:
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-21 19:20   ` Alex Williamson
@ 2015-10-21 23:26     ` Alexander Duyck
  2015-10-22 12:32     ` [Qemu-devel] " Michael S. Tsirkin
  2015-10-22 15:58     ` Or Gerlitz
  2 siblings, 0 replies; 56+ messages in thread
From: Alexander Duyck @ 2015-10-21 23:26 UTC (permalink / raw)
  To: Alex Williamson, Or Gerlitz
  Cc: Lan Tianyu,
	Michael S. Tsirkin <mst@redhat.com> (mst@redhat.com),
	bhelgaas, carolyn.wyborny, Skidmore, Donald C, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, Paolo Bonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, Jeff Kirsher, Jesse Brandeburg,
	john.ronciak, Linux Kernel, linux-pci, matthew.vick,
	Mitch Williams, Linux Netdev List, Shannon Nelson

On 10/21/2015 12:20 PM, Alex Williamson wrote:
> On Wed, 2015-10-21 at 21:45 +0300, Or Gerlitz wrote:
>> On Wed, Oct 21, 2015 at 7:37 PM, Lan Tianyu <tianyu.lan@intel.com> wrote:
>>> This patchset is to propose a new solution to add live migration support
>>> for 82599 SRIOV network card.
>>
>>> In our solution, we prefer to put all device specific operation into VF and
>>> PF driver and make code in the Qemu more general.
>>
>> [...]
>>
>>> Service down time test
>>> So far, we tested migration between two laptops with 82599 nic which
>>> are connected to a gigabit switch. Ping VF in the 0.001s interval
>>> during migration on the host of source side. It service down
>>> time is about 180ms.
>>
>> So... what would you expect service down wise for the following
>> solution which is zero touch and I think should work for any VF
>> driver:
>>
>> on host A: unplug the VM and conduct live migration to host B ala the
>> no-SRIOV case.
>
> The trouble here is that the VF needs to be unplugged prior to the start
> of migration because we can't do effective dirty page tracking while the
> device is connected and doing DMA.  So the downtime, assuming we're
> counting only VF connectivity, is dependent on memory size, rate of
> dirtying, and network bandwidth; seconds for small guests, minutes or
> more (maybe much, much more) for large guests.

The question of dirty page tracking though should be pretty simple.  We 
start the Tx packets out as dirty so we don't need to add anything 
there.  It seems like the Rx data and Tx/Rx descriptor rings are the issue.

> This is why the typical VF agnostic approach here is to using bonding
> and fail over to a emulated device during migration, so performance
> suffers, but downtime is something acceptable.
>
> If we want the ability to defer the VF unplug until just before the
> final stages of the migration, we need the VF to participate in dirty
> page tracking.  Here it's done via an enlightened guest driver.  Alex
> Graf presented a solution using a device specific enlightenment in QEMU.
> Otherwise we'd need hardware support from the IOMMU.

My only real complaint with this patch series is that it seems like 
there was to much focus on instrumenting the driver instead of providing 
the code necessary to enable a driver ecosystem that enables migration.

I don't know if what we need is a full hardware IOMMU.  It seems like a 
good way to take care of the need to flag dirty pages for DMA capable 
devices would be to add functionality to the dma_map_ops calls 
sync_{sg|single}for_cpu and unmap_{page|sg} so that they would take care 
of mapping the pages as dirty for us when needed.  We could probably 
make do with just a few tweaks to existing API in order to make this work.

As far as the descriptor rings I would argue they are invalid as soon as 
we migrate.  The problem is there is no way to guarantee ordering as we 
cannot pre-emptively mark an Rx data buffer as being a dirty page when 
we haven't even looked at the Rx descriptor for the given buffer yet. 
Tx has similar issues as we cannot guarantee the Tx will disable itself 
after a complete frame.  As such I would say the moment we migrate we 
should just give up on the frames that are still in the descriptor 
rings, drop them, and then start over with fresh rings.

- Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 12/12] IXGBEVF: Track dma dirty pages
  2015-10-21 16:37 ` [RFC Patch 12/12] IXGBEVF: Track dma dirty pages Lan Tianyu
@ 2015-10-22 12:30   ` Michael S. Tsirkin
  0 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2015-10-22 12:30 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On Thu, Oct 22, 2015 at 12:37:44AM +0800, Lan Tianyu wrote:
> Migration relies on tracking dirty page to migrate memory.
> Hardware can't automatically mark a page as dirty after DMA
> memory access. VF descriptor rings and data buffers are modified
> by hardware when receive and transmit data. To track such dirty memory
> manually, do dummy writes(read a byte and write it back) during receive
> and transmit data.
> 
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index d22160f..ce7bd7a 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -414,6 +414,9 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
>  		if (!(eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
>  			break;
>  
> +		/* write back status to mark page dirty */

Which page? the descriptor ring?  What does marking it dirty accomplish
though, given that we might migrate right before this happens?

It might be a good idea to just specify addresses of rings
to hypervisor, and have it send the ring pages after VM
and the VF are stopped.


> +		eop_desc->wb.status = eop_desc->wb.status;
> +
Compiler is likely to optimize this out.
You also probably need a wmb here ...

>  		/* clear next_to_watch to prevent false hangs */
>  		tx_buffer->next_to_watch = NULL;
>  		tx_buffer->desc_num = 0;
> @@ -946,15 +949,17 @@ static struct sk_buff *ixgbevf_fetch_rx_buffer(struct ixgbevf_ring *rx_ring,
>  {
>  	struct ixgbevf_rx_buffer *rx_buffer;
>  	struct page *page;
> +	u8 *page_addr;
>  
>  	rx_buffer = &rx_ring->rx_buffer_info[rx_ring->next_to_clean];
>  	page = rx_buffer->page;
>  	prefetchw(page);
>  
> -	if (likely(!skb)) {
> -		void *page_addr = page_address(page) +
> -				  rx_buffer->page_offset;
> +	/* Mark page dirty */

Looks like there's a race condition here: VM could
migrate at this point. RX ring will indicate
packet has been received, but page data would be stale.


One solution I see is explicitly testing for this
condition and discarding the packet.
For example, hypervisor could increment some counter
in RAM during migration.

Then:

	x = read counter

	get packet from rx ring
	mark page dirty

	y = read counter

	if (x != y)
		discard packet


> +	page_addr = page_address(page) + rx_buffer->page_offset;
> +	*page_addr = *page_addr;

Compiler is likely to optimize this out.
You also probably need a wmb here ...


>  
> +	if (likely(!skb)) {
>  		/* prefetch first cache line of first page */
>  		prefetch(page_addr);

prefetch makes no sense if you read it right here.

>  #if L1_CACHE_BYTES < 128
> @@ -1032,6 +1037,9 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
>  		if (!ixgbevf_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
>  			break;
>  
> +		/* Write back status to mark page dirty */
> +		rx_desc->wb.upper.status_error = rx_desc->wb.upper.status_error;
> +

same question as for tx.

>  		/* This memory barrier is needed to keep us from reading
>  		 * any other fields out of the rx_desc until we know the
>  		 * RXD_STAT_DD bit is set
> -- 
> 1.8.4.rc0.1.g8f6a3e5.dirty
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-21 19:20   ` Alex Williamson
  2015-10-21 23:26     ` Alexander Duyck
@ 2015-10-22 12:32     ` Michael S. Tsirkin
  2015-10-22 13:01       ` Alex Williamson
  2015-10-22 15:58     ` Or Gerlitz
  2 siblings, 1 reply; 56+ messages in thread
From: Michael S. Tsirkin @ 2015-10-22 12:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Or Gerlitz, emil.s.tantilov, kvm, linux-pci, qemu-devel,
	Jesse Brandeburg, carolyn.wyborny, Skidmore, Donald C, agraf,
	matthew.vick, intel-wired-lan, Jeff Kirsher, yang.z.zhang,
	Mitch Williams, nrupal.jani, bhelgaas, Lan Tianyu,
	Linux Netdev List, Shannon Nelson, eddie.dong, Linux Kernel,
	john.ronciak, Paolo Bonzini

On Wed, Oct 21, 2015 at 01:20:27PM -0600, Alex Williamson wrote:
> The trouble here is that the VF needs to be unplugged prior to the start
> of migration because we can't do effective dirty page tracking while the
> device is connected and doing DMA.

That's exactly what patch 12/12 is trying to accomplish.

I do see some problems with it, but I also suggested some solutions.

-- 
MST

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 11/12] IXGBEVF: Migrate VF statistic data
  2015-10-21 16:37 ` [RFC Patch 11/12] IXGBEVF: Migrate VF statistic data Lan Tianyu
@ 2015-10-22 12:36   ` Michael S. Tsirkin
  0 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2015-10-22 12:36 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: shannon.nelson, emil.s.tantilov, kvm, linux-pci,
	donald.c.skidmore, mitch.a.williams, eddie.dong, agraf,
	qemu-devel, yang.z.zhang, nrupal.jani, john.ronciak,
	intel-wired-lan, jeffrey.t.kirsher, jesse.brandeburg, bhelgaas,
	pbonzini, carolyn.wyborny, matthew.vick, netdev, linux-kernel

On Thu, Oct 22, 2015 at 12:37:43AM +0800, Lan Tianyu wrote:
> VF statistic regs are read-only and can't be migrated via writing back
> directly.
> 
> Currently, statistic data returned to user space by the driver is not equal
> to value of statistic regs. VF driver records value of statistic regs as base data
> when net interface is up or open, calculate increased count of regs during
> last period of online service and added it to saved_reset data. When user
> space collects statistic data, VF driver returns result of
> "current - base + saved_reset". "Current" is reg value at that point.
> 
> Restoring net function after migration just likes net interface is up or open.
> Call existed function to update base and saved_reset data to keep statistic
> data continual during migration.
> 
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index 04b6ce7..d22160f 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -3005,6 +3005,7 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter)
>  			return 0;
>  
>  		del_timer_sync(&adapter->service_timer);
> +		ixgbevf_update_stats(adapter);
>  		pr_info("migration start\n");
>  		migration_status = MIGRATION_IN_PROGRESS; 
>  

So far, it seems that the only two things done when
starting migration are very small.

It doesn't seem worth it to let guests defer migration for things like
this.  Surely, cancelling a timer can be done later after VM is
migrated?



> @@ -3017,6 +3018,8 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter)
>  			return 1;
>  
>  		ixgbevf_restore_state(adapter);
> +		ixgbevf_save_reset_stats(adapter);
> +		ixgbevf_init_last_counter_stats(adapter);
>  		migration_status = MIGRATION_COMPLETED;
>  		pr_info("migration end\n");
>  		return 0;
> -- 
> 1.8.4.rc0.1.g8f6a3e5.dirty
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation
  2015-10-21 16:37 ` [RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation Lan Tianyu
  2015-10-21 21:55   ` Alexander Duyck
@ 2015-10-22 12:40   ` Michael S. Tsirkin
  1 sibling, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2015-10-22 12:40 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On Thu, Oct 22, 2015 at 12:37:42AM +0800, Lan Tianyu wrote:
> Ring shifting during restoring VF function maybe race with original
> ring operation(transmit/receive package). This patch is to add tx/rx
> lock to protect ring related data.
> 
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>

That's adding a bunch of locking on data path - what's the
performance impact?

Can't you do something faster? E.g. migration things
are slow path - can't you use something like RCU
to flush outstanding work?


> ---
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      |  2 ++
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 28 ++++++++++++++++++++---
>  2 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> index 6eab402e..3a748c8 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> @@ -448,6 +448,8 @@ struct ixgbevf_adapter {
>  
>  	spinlock_t mbx_lock;
>  	unsigned long last_reset;
> +	spinlock_t mg_rx_lock;
> +	spinlock_t mg_tx_lock;
>  };
>  
>  enum ixbgevf_state_t {
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index 15ec361..04b6ce7 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -227,8 +227,10 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring *ring)
>  
>  int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
>  {
> +	struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
>  	struct ixgbevf_tx_buffer *tx_buffer = NULL;
>  	static union ixgbevf_desc *tx_desc = NULL;
> +	unsigned long flags;
>  
>  	tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
>  	if (!tx_buffer)
> @@ -238,6 +240,7 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
>  	if (!tx_desc)
>  		return -ENOMEM;
>  
> +	spin_lock_irqsave(&adapter->mg_tx_lock, flags);
>  	memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
>  	memcpy(r->desc, &tx_desc[head], sizeof(union ixgbevf_desc) * (r->count - head));
>  	memcpy(&r->desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * head);
> @@ -256,6 +259,8 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
>  	else
>  		r->next_to_use += (r->count - head);
>  
> +	spin_unlock_irqrestore(&adapter->mg_tx_lock, flags);
> +
>  	vfree(tx_buffer);
>  	vfree(tx_desc);
>  	return 0;
> @@ -263,8 +268,10 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
>  
>  int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
>  {
> +	struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
>  	struct ixgbevf_rx_buffer *rx_buffer = NULL;
>  	static union ixgbevf_desc *rx_desc = NULL;
> +	unsigned long flags;	
>  
>  	rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count));
>  	if (!rx_buffer)
> @@ -274,6 +281,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
>  	if (!rx_desc)
>  		return -ENOMEM;
>  
> +	spin_lock_irqsave(&adapter->mg_rx_lock, flags);
>  	memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count));
>  	memcpy(r->desc, &rx_desc[head], sizeof(union ixgbevf_desc) * (r->count - head));
>  	memcpy(&r->desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * head);
> @@ -291,6 +299,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
>  		r->next_to_use -= head;
>  	else
>  		r->next_to_use += (r->count - head);
> +	spin_unlock_irqrestore(&adapter->mg_rx_lock, flags);
>  
>  	vfree(rx_buffer);
>  	vfree(rx_desc);
> @@ -377,6 +386,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
>  	if (test_bit(__IXGBEVF_DOWN, &adapter->state))
>  		return true;
>  
> +	spin_lock(&adapter->mg_tx_lock);
> +	i = tx_ring->next_to_clean;
>  	tx_buffer = &tx_ring->tx_buffer_info[i];
>  	tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
>  	i -= tx_ring->count;
> @@ -471,6 +482,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
>  	q_vector->tx.total_bytes += total_bytes;
>  	q_vector->tx.total_packets += total_packets;
>  
> +	spin_unlock(&adapter->mg_tx_lock);
> +
>  	if (check_for_tx_hang(tx_ring) && ixgbevf_check_tx_hang(tx_ring)) {
>  		struct ixgbe_hw *hw = &adapter->hw;
>  		union ixgbe_adv_tx_desc *eop_desc;
> @@ -999,10 +1012,12 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
>  				struct ixgbevf_ring *rx_ring,
>  				int budget)
>  {
> +	struct ixgbevf_adapter *adapter = netdev_priv(rx_ring->netdev);
>  	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>  	u16 cleaned_count = ixgbevf_desc_unused(rx_ring);
>  	struct sk_buff *skb = rx_ring->skb;
>  
> +	spin_lock(&adapter->mg_rx_lock);
>  	while (likely(total_rx_packets < budget)) {
>  		union ixgbe_adv_rx_desc *rx_desc;
>  
> @@ -1078,6 +1093,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
>  	q_vector->rx.total_packets += total_rx_packets;
>  	q_vector->rx.total_bytes += total_rx_bytes;
>  
> +	spin_unlock(&adapter->mg_rx_lock);
>  	return total_rx_packets;
>  }
>  
> @@ -3572,14 +3588,17 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
>  	struct ixgbevf_tx_buffer *tx_buffer;
>  	union ixgbe_adv_tx_desc *tx_desc;
>  	struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[0];
> +	struct ixgbevf_adapter *adapter = netdev_priv(tx_ring->netdev);
>  	unsigned int data_len = skb->data_len;
>  	unsigned int size = skb_headlen(skb);
>  	unsigned int paylen = skb->len - hdr_len;
> +	unsigned long flags;
>  	u32 tx_flags = first->tx_flags;
>  	__le32 cmd_type;
> -	u16 i = tx_ring->next_to_use;
> -	u16 start;
> +	u16 i, start;     
>  
> +	spin_lock_irqsave(&adapter->mg_tx_lock, flags);
> +	i = tx_ring->next_to_use;
>  	tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
>  
>  	ixgbevf_tx_olinfo_status(tx_desc, tx_flags, paylen);
> @@ -3673,7 +3692,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
>  
>  	/* notify HW of packet */
>  	ixgbevf_write_tail(tx_ring, i);
> -
> +	spin_unlock_irqrestore(&adapter->mg_tx_lock, flags);
>  	return;
>  dma_error:
>  	dev_err(tx_ring->dev, "TX DMA map failed\n");
> @@ -3690,6 +3709,7 @@ dma_error:
>  	}
>  
>  	tx_ring->next_to_use = i;
> +	spin_unlock_irqrestore(&adapter->mg_tx_lock, flags);
>  }
>  
>  static int __ixgbevf_maybe_stop_tx(struct ixgbevf_ring *tx_ring, int size)
> @@ -4188,6 +4208,8 @@ static int ixgbevf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  		break;
>  	}
>  
> +	spin_lock_init(&adapter->mg_tx_lock);
> +	spin_lock_init(&adapter->mg_rx_lock);
>  	return 0;
>  
>  err_register:
> -- 
> 1.8.4.rc0.1.g8f6a3e5.dirty
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver
  2015-10-21 16:37 ` [RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver Lan Tianyu
  2015-10-21 21:48   ` Alexander Duyck
@ 2015-10-22 12:46   ` Michael S. Tsirkin
  1 sibling, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2015-10-22 12:46 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On Thu, Oct 22, 2015 at 12:37:41AM +0800, Lan Tianyu wrote:
> To let VF driver in the guest to know migration status, Qemu will
> fake PCI configure reg 0xF0 and 0xF1 to show migrate status and
> get ack from VF driver.

I guess this works for current devices but not using
0xF0/0xF1 registers is not architectural, is it?

So it could conflict with future devices.

Maybe it's better to just have a dedicated para-virtualized
device (PCI,ACPI,etc) for this migration-related activity.
This driver would then register with this it.


> When migration starts, Qemu will set reg "0xF0" to 1, notify
> VF driver via triggering mail box msg and wait for VF driver to tell
> it's ready for migration(set reg "0xF1" to 1).

This waiting for driver is problematic: high load is one of the reasons
people migrate VMs out.  It would be much better if we could support
migration while VM is completely stopped.


> After migration, Qemu
> will set reg "0xF0" to 0 and notify VF driver by mail box irq. VF
> driver begins to restore tx/rx function after detecting sttatus change.
> 
> When VF receives mail box irq, it will check reg "0xF0" in the service
> task function to get migration status and performs related operations
> according its value.
> 
> Steps of restarting receive and transmit function
> 1) Restore VF status in the PF driver via sending mail event to PF driver
> 2) Write back reg values recorded by self emulation layer
> 3) Restart rx/tx ring
> 4) Recovery interrupt
> 
> Transmit/Receive descriptor head regs are read-only and can't
> be restored via writing back recording reg value directly and they
> are set to 0 during VF reset. To reuse original tx/rx rings, shift
> desc ring in order to move the desc pointed by original head reg to
> first entry of the ring and then enable tx/rx rings. VF restarts to
> receive and transmit from original head desc.
> 
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>  drivers/net/ethernet/intel/ixgbevf/defines.h       |   6 ++
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h       |   7 +-
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  | 115 ++++++++++++++++++++-
>  .../net/ethernet/intel/ixgbevf/self-emulation.c    | 107 +++++++++++++++++++
>  4 files changed, 232 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h b/drivers/net/ethernet/intel/ixgbevf/defines.h
> index 770e21a..113efd2 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/defines.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/defines.h
> @@ -239,6 +239,12 @@ struct ixgbe_adv_tx_context_desc {
>  	__le32 mss_l4len_idx;
>  };
>  
> +union ixgbevf_desc {
> +	union ixgbe_adv_tx_desc rx_desc;
> +	union ixgbe_adv_rx_desc tx_desc;
> +	struct ixgbe_adv_tx_context_desc tx_context_desc;
> +};
> +
>  /* Adv Transmit Descriptor Config Masks */
>  #define IXGBE_ADVTXD_DTYP_MASK	0x00F00000 /* DTYP mask */
>  #define IXGBE_ADVTXD_DTYP_CTXT	0x00200000 /* Advanced Context Desc */
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> index c823616..6eab402e 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> @@ -109,7 +109,7 @@ struct ixgbevf_ring {
>  	struct ixgbevf_ring *next;
>  	struct net_device *netdev;
>  	struct device *dev;
> -	void *desc;			/* descriptor ring memory */
> +	union ixgbevf_desc *desc;	/* descriptor ring memory */
>  	dma_addr_t dma;			/* phys. address of descriptor ring */
>  	unsigned int size;		/* length in bytes */
>  	u16 count;			/* amount of descriptors */
> @@ -493,6 +493,11 @@ extern void ixgbevf_write_eitr(struct ixgbevf_q_vector *q_vector);
>  
>  void ixgbe_napi_add_all(struct ixgbevf_adapter *adapter);
>  void ixgbe_napi_del_all(struct ixgbevf_adapter *adapter);
> +int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head);
> +int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head);
> +void ixgbevf_restore_state(struct ixgbevf_adapter *adapter);
> +inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter);
> +
>  
>  #ifdef DEBUG
>  char *ixgbevf_get_hw_dev_name(struct ixgbe_hw *hw);
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index 056841c..15ec361 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -91,6 +91,10 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit Virtual Function Network Driver");
>  MODULE_LICENSE("GPL");
>  MODULE_VERSION(DRV_VERSION);
>  
> +
> +#define MIGRATION_COMPLETED   0x00
> +#define MIGRATION_IN_PROGRESS 0x01
> +
>  #define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK)
>  static int debug = -1;
>  module_param(debug, int, 0);
> @@ -221,6 +225,78 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring *ring)
>  	return ring->stats.packets;
>  }
>  
> +int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
> +{
> +	struct ixgbevf_tx_buffer *tx_buffer = NULL;
> +	static union ixgbevf_desc *tx_desc = NULL;
> +
> +	tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
> +	if (!tx_buffer)
> +		return -ENOMEM;
> +
> +	tx_desc = vmalloc(sizeof(union ixgbevf_desc) * r->count);
> +	if (!tx_desc)
> +		return -ENOMEM;
> +
> +	memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
> +	memcpy(r->desc, &tx_desc[head], sizeof(union ixgbevf_desc) * (r->count - head));
> +	memcpy(&r->desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * head);
> +
> +	memcpy(tx_buffer, r->tx_buffer_info, sizeof(struct ixgbevf_tx_buffer) * r->count);
> +	memcpy(r->tx_buffer_info, &tx_buffer[head], sizeof(struct ixgbevf_tx_buffer) * (r->count - head));
> +	memcpy(&r->tx_buffer_info[r->count - head], tx_buffer, sizeof(struct ixgbevf_tx_buffer) * head);
> +
> +	if (r->next_to_clean >= head)
> +		r->next_to_clean -= head;
> +	else
> +		r->next_to_clean += (r->count - head);
> +
> +	if (r->next_to_use >= head)
> +		r->next_to_use -= head;
> +	else
> +		r->next_to_use += (r->count - head);
> +
> +	vfree(tx_buffer);
> +	vfree(tx_desc);
> +	return 0;
> +}
> +
> +int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
> +{
> +	struct ixgbevf_rx_buffer *rx_buffer = NULL;
> +	static union ixgbevf_desc *rx_desc = NULL;
> +
> +	rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count));
> +	if (!rx_buffer)
> +		return -ENOMEM;
> +
> +	rx_desc = vmalloc(sizeof(union ixgbevf_desc) * (r->count));
> +	if (!rx_desc)
> +		return -ENOMEM;
> +
> +	memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count));
> +	memcpy(r->desc, &rx_desc[head], sizeof(union ixgbevf_desc) * (r->count - head));
> +	memcpy(&r->desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * head);
> +
> +	memcpy(rx_buffer, r->rx_buffer_info, sizeof(struct ixgbevf_rx_buffer) * (r->count));
> +	memcpy(r->rx_buffer_info, &rx_buffer[head], sizeof(struct ixgbevf_rx_buffer) * (r->count - head));
> +	memcpy(&r->rx_buffer_info[r->count - head], rx_buffer, sizeof(struct ixgbevf_rx_buffer) * head);
> +
> +	if (r->next_to_clean >= head)
> +		r->next_to_clean -= head;
> +	else
> +		r->next_to_clean += (r->count - head);
> +
> +	if (r->next_to_use >= head)
> +		r->next_to_use -= head;
> +	else
> +		r->next_to_use += (r->count - head);
> +
> +	vfree(rx_buffer);
> +	vfree(rx_desc);
> +	return 0;
> +}
> +
>  static u32 ixgbevf_get_tx_pending(struct ixgbevf_ring *ring)
>  {
>  	struct ixgbevf_adapter *adapter = netdev_priv(ring->netdev);
> @@ -1122,7 +1198,7 @@ static int ixgbevf_busy_poll_recv(struct napi_struct *napi)
>   * ixgbevf_configure_msix sets up the hardware to properly generate MSI-X
>   * interrupts.
>   **/
> -static void ixgbevf_configure_msix(struct ixgbevf_adapter *adapter)
> +static  void ixgbevf_configure_msix(struct ixgbevf_adapter *adapter)
>  {
>  	struct ixgbevf_q_vector *q_vector;
>  	int q_vectors, v_idx;
> @@ -1534,7 +1610,7 @@ static inline void ixgbevf_irq_disable(struct ixgbevf_adapter *adapter)
>   * ixgbevf_irq_enable - Enable default interrupt generation settings
>   * @adapter: board private structure
>   **/
> -static inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter)
> +inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter)
>  {
>  	struct ixgbe_hw *hw = &adapter->hw;
>  
> @@ -2901,6 +2977,36 @@ static void ixgbevf_watchdog_subtask(struct ixgbevf_adapter *adapter)
>  	ixgbevf_update_stats(adapter);
>  }
>  
> +int ixgbevf_live_mg(struct ixgbevf_adapter *adapter)
> +{
> +	struct pci_dev *pdev = adapter->pdev;
> + 	static int migration_status = MIGRATION_COMPLETED;
> +	u8 val;
> +
> +	if (migration_status == MIGRATION_COMPLETED) {
> +		pci_read_config_byte(pdev, 0xf0, &val);
> +		if (!val)
> +			return 0;
> +
> +		del_timer_sync(&adapter->service_timer);
> +		pr_info("migration start\n");
> +		migration_status = MIGRATION_IN_PROGRESS; 
> +
> +		/* Tell Qemu VF is ready for migration. */
> +		pci_write_config_byte(pdev, 0xf1, 0x1);
> +		return 1;
> +	} else {
> +		pci_read_config_byte(pdev, 0xf0, &val);
> +		if (val)
> +			return 1;
> +
> +		ixgbevf_restore_state(adapter);
> +		migration_status = MIGRATION_COMPLETED;
> +		pr_info("migration end\n");
> +		return 0;
> +	}
> +}
> +
>  /**
>   * ixgbevf_service_task - manages and runs subtasks
>   * @work: pointer to work_struct containing our data
> @@ -2912,6 +3018,11 @@ static void ixgbevf_service_task(struct work_struct *work)
>  						       service_task);
>  	struct ixgbe_hw *hw = &adapter->hw;
>  
> +	if (ixgbevf_live_mg(adapter)) {
> +		ixgbevf_service_event_complete(adapter);
> +		return;
> +	}
> +
>  	if (IXGBE_REMOVED(hw->hw_addr)) {
>  		if (!test_bit(__IXGBEVF_DOWN, &adapter->state)) {
>  			rtnl_lock();
> diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
> index d74b2da..4476428 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
> @@ -9,6 +9,8 @@
>  
>  static u32 hw_regs[0x4000];
>  
> +#define RESTORE_REG(hw, reg) IXGBE_WRITE_REG(hw, reg, hw_regs[reg])
> +
>  u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr)
>  {
>  	u32 tmp;
> @@ -24,3 +26,108 @@ void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr)
>  	hw_regs[(unsigned long)addr] = val;
>  	writel(val, (volatile void __iomem *)(base + addr));
>  }
> +
> +static u32 restore_regs[] = {
> +	IXGBE_VTIVAR(0),
> +	IXGBE_VTIVAR(1),
> +	IXGBE_VTIVAR(2),
> +	IXGBE_VTIVAR(3),
> +	IXGBE_VTIVAR_MISC,
> +	IXGBE_VTEITR(0),
> +	IXGBE_VTEITR(1),
> +	IXGBE_VFPSRTYPE,
> +};
> +
> +void ixgbevf_restore_state(struct ixgbevf_adapter *adapter)
> +{
> +	struct ixgbe_hw *hw = &adapter->hw;
> +	struct ixgbe_mbx_info *mbx = &hw->mbx;
> +	int i;
> +	u32 timeout = IXGBE_VF_INIT_TIMEOUT, rdh, tdh, 	rxdctl, txdctl;
> +	u32 wait_loop = 10;
> +
> +	/* VF resetting */
> +	IXGBE_WRITE_REG(hw, IXGBE_VFCTRL, IXGBE_CTRL_RST);
> +	IXGBE_WRITE_FLUSH(hw);
> +
> +	while (!mbx->ops.check_for_rst(hw) && timeout) {
> +		timeout--;
> +		udelay(5);
> +	}
> +	if (!timeout)
> +		printk(KERN_ERR "[IXGBEVF] Unable to reset VF.\n");
> +
> +	/* Restoring VF status in the status */
> +	hw->mac.ops.notify_resume(hw);
> +
> +	/* Restoring regs value */
> +	for (i = 0; i < sizeof(restore_regs)/sizeof(u32); i++)
> +		writel(hw_regs[restore_regs[i]], (volatile void *)(restore_regs[i] + hw->hw_addr));
> +
> +	/* Restoring rx ring */
> +	for (i = 0; i < adapter->num_rx_queues; i++) {
> +		if (hw_regs[IXGBE_VFRXDCTL(i)] & IXGBE_RXDCTL_ENABLE) {
> +			RESTORE_REG(hw, IXGBE_VFRDBAL(i));
> +			RESTORE_REG(hw, IXGBE_VFRDBAH(i));
> +			RESTORE_REG(hw, IXGBE_VFRDLEN(i));
> +			RESTORE_REG(hw, IXGBE_VFDCA_RXCTRL(i));
> +			RESTORE_REG(hw, IXGBE_VFSRRCTL(i));
> +
> +			rdh = adapter->rx_ring[i]->next_to_clean;
> +			while (IXGBEVF_RX_DESC(adapter->rx_ring[i], rdh)->wb.upper.status_error
> +			       & cpu_to_le32(IXGBE_RXD_STAT_DD))
> +				rdh = (rdh + 1) % adapter->rx_ring[i]->count;
> +
> +			ixgbevf_rx_ring_shift(adapter->rx_ring[i], rdh);
> +
> +			wait_loop = 10;
> +			RESTORE_REG(hw, IXGBE_VFRXDCTL(i));
> +			do {
> +				udelay(10);
> +				rxdctl = IXGBE_READ_REG(hw, IXGBE_VFRXDCTL(i));
> +			} while (--wait_loop && !(rxdctl & IXGBE_RXDCTL_ENABLE));
> +
> +			if (!wait_loop)
> +				pr_err("RXDCTL.ENABLE queue %d not cleared while polling\n",
> +				       i);
> +
> +			IXGBE_WRITE_REG(hw, IXGBE_VFRDT(i), adapter->rx_ring[i]->next_to_use);
> +		}
> +	}
> +
> +	/* Restoring tx ring */
> +	for (i = 0; i < adapter->num_tx_queues; i++) {
> +		if (hw_regs[IXGBE_VFTXDCTL(i)] & IXGBE_TXDCTL_ENABLE) {
> +			RESTORE_REG(hw, IXGBE_VFTDBAL(i));
> +			RESTORE_REG(hw, IXGBE_VFTDBAH(i));
> +			RESTORE_REG(hw, IXGBE_VFTDLEN(i));
> +			RESTORE_REG(hw, IXGBE_VFDCA_TXCTRL(i));
> +
> +			tdh = adapter->tx_ring[i]->next_to_clean;
> +			while (IXGBEVF_TX_DESC(adapter->tx_ring[i], tdh)->wb.status
> +			       & cpu_to_le32(IXGBE_TXD_STAT_DD))
> +				tdh = (tdh + 1) % adapter->rx_ring[i]->count;
> +			ixgbevf_tx_ring_shift(adapter->tx_ring[i], tdh);
> +
> +			wait_loop = 10;
> +			RESTORE_REG(hw, IXGBE_VFTXDCTL(i));
> +			do {
> +				udelay(2000);
> +				txdctl = IXGBE_READ_REG(hw, IXGBE_VFTXDCTL(i));
> +			} while (--wait_loop && !(txdctl & IXGBE_TXDCTL_ENABLE));
> +
> +			if (!wait_loop)
> +				pr_err("Could not enable Tx Queue %d\n", i);
> +	
> +			IXGBE_WRITE_REG(hw, IXGBE_VFTDT(i), adapter->tx_ring[i]->next_to_use);
> +		}
> +	}
> +
> +	/* Restore irq */
> +	IXGBE_WRITE_REG(hw, IXGBE_VTEIMS, hw_regs[IXGBE_VTEIMS] & 0x7);
> +	IXGBE_WRITE_REG(hw, IXGBE_VTEIMC, (~hw_regs[IXGBE_VTEIMS]) & 0x7);
> +	IXGBE_WRITE_REG(hw, IXGBE_VTEICS, hw_regs[IXGBE_VTEICS]);
> +
> +	ixgbevf_irq_enable(adapter);
> +}
> +
> -- 
> 1.8.4.rc0.1.g8f6a3e5.dirty
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [RFC Patch 06/12] IXGBEVF: Add self emulation layer
  2015-10-21 20:58   ` Alexander Duyck
@ 2015-10-22 12:50     ` Michael S. Tsirkin
  2015-10-22 15:50       ` Alexander Duyck
  0 siblings, 1 reply; 56+ messages in thread
From: Michael S. Tsirkin @ 2015-10-22 12:50 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On Wed, Oct 21, 2015 at 01:58:19PM -0700, Alexander Duyck wrote:
> On 10/21/2015 09:37 AM, Lan Tianyu wrote:
> >In order to restore VF function after migration, add self emulation layer
> >to record regs' values during accessing regs.
> >
> >Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> >---
> >  drivers/net/ethernet/intel/ixgbevf/Makefile        |  3 ++-
> >  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |  2 +-
> >  .../net/ethernet/intel/ixgbevf/self-emulation.c    | 26 ++++++++++++++++++++++
> >  drivers/net/ethernet/intel/ixgbevf/vf.h            |  5 ++++-
> >  4 files changed, 33 insertions(+), 3 deletions(-)
> >  create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c
> >
> >diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile b/drivers/net/ethernet/intel/ixgbevf/Makefile
> >index 4ce4c97..841c884 100644
> >--- a/drivers/net/ethernet/intel/ixgbevf/Makefile
> >+++ b/drivers/net/ethernet/intel/ixgbevf/Makefile
> >@@ -31,7 +31,8 @@
> >  obj-$(CONFIG_IXGBEVF) += ixgbevf.o
> >-ixgbevf-objs := vf.o \
> >+ixgbevf-objs := self-emulation.o \
> >+		vf.o \
> >                  mbx.o \
> >                  ethtool.o \
> >                  ixgbevf_main.o
> >diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> >index a16d267..4446916 100644
> >--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> >+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> >@@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg)
> >  	if (IXGBE_REMOVED(reg_addr))
> >  		return IXGBE_FAILED_READ_REG;
> >-	value = readl(reg_addr + reg);
> >+	value = ixgbe_self_emul_readl(reg_addr, reg);
> >  	if (unlikely(value == IXGBE_FAILED_READ_REG))
> >  		ixgbevf_check_remove(hw, reg);
> >  	return value;
> >diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
> >new file mode 100644
> >index 0000000..d74b2da
> >--- /dev/null
> >+++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
> >@@ -0,0 +1,26 @@
> >+#include <linux/netdevice.h>
> >+#include <linux/pci.h>
> >+#include <linux/delay.h>
> >+#include <linux/interrupt.h>
> >+#include <net/arp.h>
> >+
> >+#include "vf.h"
> >+#include "ixgbevf.h"
> >+
> >+static u32 hw_regs[0x4000];
> >+
> >+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr)
> >+{
> >+	u32 tmp;
> >+
> >+	tmp = readl(base + addr);
> >+	hw_regs[(unsigned long)addr] = tmp;
> >+
> >+	return tmp;
> >+}
> >+
> >+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr)
> >+{
> >+	hw_regs[(unsigned long)addr] = val;
> >+	writel(val, (volatile void __iomem *)(base + addr));
> >+}
> 
> So I see what you are doing, however I don't think this adds much value.
> Many of the key registers for the device are not simple Read/Write
> registers.  Most of them are things like write 1 to clear or some other sort
> of value where writing doesn't set the bit but has some other side effect.
> Just take a look through the Datasheet at registers such as the VFCTRL,
> VFMAILBOX, or most of the interrupt registers.  The fact is simply storing
> the values off doesn't give you any real idea of what the state of things
> are.

It doesn't, but I guess the point is to isolate the migration-related logic
in the recovery code.

An alternative would be to have some smart logic all over the place to
only store what's required - that would be much more intrusive.


> >diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h b/drivers/net/ethernet/intel/ixgbevf/vf.h
> >index d40f036..6a3f4eb 100644
> >--- a/drivers/net/ethernet/intel/ixgbevf/vf.h
> >+++ b/drivers/net/ethernet/intel/ixgbevf/vf.h
> >@@ -39,6 +39,9 @@
> >  struct ixgbe_hw;
> >+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr);
> >+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr);
> >+
> >  /* iterator type for walking multicast address lists */
> >  typedef u8* (*ixgbe_mc_addr_itr) (struct ixgbe_hw *hw, u8 **mc_addr_ptr,
> >  				  u32 *vmdq);
> >@@ -182,7 +185,7 @@ static inline void ixgbe_write_reg(struct ixgbe_hw *hw, u32 reg, u32 value)
> >  	if (IXGBE_REMOVED(reg_addr))
> >  		return;
> >-	writel(value, reg_addr + reg);
> >+	ixgbe_self_emul_writel(value, reg_addr, reg);
> >  }
> >  #define IXGBE_WRITE_REG(h, r, v) ixgbe_write_reg(h, r, v)
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"
  2015-10-21 20:52   ` Alexander Duyck
@ 2015-10-22 12:51     ` Michael S. Tsirkin
  2015-10-24 15:43     ` Lan, Tianyu
  1 sibling, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2015-10-22 12:51 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: emil.s.tantilov, kvm, linux-pci, qemu-devel, jesse.brandeburg,
	carolyn.wyborny, donald.c.skidmore, agraf, matthew.vick,
	intel-wired-lan, jeffrey.t.kirsher, yang.z.zhang,
	mitch.a.williams, nrupal.jani, bhelgaas, Lan Tianyu, netdev,
	shannon.nelson, eddie.dong, linux-kernel, john.ronciak, pbonzini

On Wed, Oct 21, 2015 at 01:52:48PM -0700, Alexander Duyck wrote:
> Also have you even considered the MSI-X configuration on the VF?  I haven't
> seen anything anywhere that would have migrated the VF's MSI-X configuration
> from BAR 3 on one system to the new system.

Hypervisors do this for virtual devices so they can do this
for physical devices too.

-- 
MST

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
                   ` (12 preceding siblings ...)
  2015-10-21 18:45 ` [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Or Gerlitz
@ 2015-10-22 12:55 ` Michael S. Tsirkin
  2015-10-23 18:36 ` Alexander Duyck
  14 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2015-10-22 12:55 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On Thu, Oct 22, 2015 at 12:37:32AM +0800, Lan Tianyu wrote:
> This patchset is to propose a new solution to add live migration support for 82599
> SRIOV network card.
> 
> Im our solution, we prefer to put all device specific operation into VF and
> PF driver and make code in the Qemu more general.

Adding code to VF driver makes sense.  However, adding code to PF driver
is problematic: PF and VF run within different environments, you can't
assume PF and VF drivers are the same version.

I guess that would be acceptable if these messages make
it into the official intel spec, along with
hardware registers.

-- 
MST

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package
  2015-10-21 16:37 ` [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package Lan Tianyu
  2015-10-21 21:14   ` Alexander Duyck
@ 2015-10-22 12:58   ` Michael S. Tsirkin
  2015-10-24 16:08     ` Lan, Tianyu
  1 sibling, 1 reply; 56+ messages in thread
From: Michael S. Tsirkin @ 2015-10-22 12:58 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On Thu, Oct 22, 2015 at 12:37:40AM +0800, Lan Tianyu wrote:
> When transmit a package, the end transmit desc of package
> indicates whether package is sent already. Current code records
> the end desc's pointer in the next_to_watch of struct tx buffer.
> This code will be broken if shifting desc ring after migration.
> The pointer will be invalid. This patch is to replace recording
> pointer with recording the desc number of the package and find
> the end decs via the first desc and desc number.
> 
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>

Do you really need to play the shifting games?
Can't you just reset everything and re-initialize the rings?
It's slower but way less intrusive.
Also removes the need to track writes into rings.

> ---
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      |  1 +
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 19 ++++++++++++++++---
>  2 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> index 775d089..c823616 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> @@ -54,6 +54,7 @@
>   */
>  struct ixgbevf_tx_buffer {
>  	union ixgbe_adv_tx_desc *next_to_watch;
> +	u16 desc_num;
>  	unsigned long time_stamp;
>  	struct sk_buff *skb;
>  	unsigned int bytecount;
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index 4446916..056841c 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -210,6 +210,7 @@ static void ixgbevf_unmap_and_free_tx_resource(struct ixgbevf_ring *tx_ring,
>  			       DMA_TO_DEVICE);
>  	}
>  	tx_buffer->next_to_watch = NULL;
> +	tx_buffer->desc_num = 0;
>  	tx_buffer->skb = NULL;
>  	dma_unmap_len_set(tx_buffer, len, 0);
>  	/* tx_buffer must be completely set up in the transmit path */
> @@ -295,7 +296,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
>  	union ixgbe_adv_tx_desc *tx_desc;
>  	unsigned int total_bytes = 0, total_packets = 0;
>  	unsigned int budget = tx_ring->count / 2;
> -	unsigned int i = tx_ring->next_to_clean;
> +	int i, watch_index;
>  
>  	if (test_bit(__IXGBEVF_DOWN, &adapter->state))
>  		return true;
> @@ -305,9 +306,17 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
>  	i -= tx_ring->count;
>  
>  	do {
> -		union ixgbe_adv_tx_desc *eop_desc = tx_buffer->next_to_watch;
> +		union ixgbe_adv_tx_desc *eop_desc;
> +
> +		if (!tx_buffer->desc_num)
> +			break;
> +
> +		if (i + tx_buffer->desc_num >= 0)
> +			watch_index = i + tx_buffer->desc_num;
> +		else
> +			watch_index = i + tx_ring->count + tx_buffer->desc_num;
>  
> -		/* if next_to_watch is not set then there is no work pending */
> +		eop_desc = IXGBEVF_TX_DESC(tx_ring, watch_index);
>  		if (!eop_desc)
>  			break;
>  
> @@ -320,6 +329,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
>  
>  		/* clear next_to_watch to prevent false hangs */
>  		tx_buffer->next_to_watch = NULL;
> +		tx_buffer->desc_num = 0;
>  
>  		/* update the statistics for this packet */
>  		total_bytes += tx_buffer->bytecount;
> @@ -3457,6 +3467,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
>  	u32 tx_flags = first->tx_flags;
>  	__le32 cmd_type;
>  	u16 i = tx_ring->next_to_use;
> +	u16 start;
>  
>  	tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
>  
> @@ -3540,6 +3551,8 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
>  
>  	/* set next_to_watch value indicating a packet is present */
>  	first->next_to_watch = tx_desc;
> +	start = first - tx_ring->tx_buffer_info;
> +	first->desc_num = (i - start >= 0) ? i - start: i + tx_ring->count - start;
>  
>  	i++;
>  	if (i == tx_ring->count)
> -- 
> 1.8.4.rc0.1.g8f6a3e5.dirty
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-22 12:32     ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-10-22 13:01       ` Alex Williamson
  2015-10-22 13:06         ` Michael S. Tsirkin
  0 siblings, 1 reply; 56+ messages in thread
From: Alex Williamson @ 2015-10-22 13:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Or Gerlitz, emil.s.tantilov, kvm, linux-pci, qemu-devel,
	Jesse Brandeburg, carolyn.wyborny, Skidmore, Donald C, agraf,
	matthew.vick, intel-wired-lan, Jeff Kirsher, yang.z.zhang,
	Mitch Williams, nrupal.jani, bhelgaas, Lan Tianyu,
	Linux Netdev List, Shannon Nelson, eddie.dong, Linux Kernel,
	john.ronciak, Paolo Bonzini

On Thu, 2015-10-22 at 15:32 +0300, Michael S. Tsirkin wrote:
> On Wed, Oct 21, 2015 at 01:20:27PM -0600, Alex Williamson wrote:
> > The trouble here is that the VF needs to be unplugged prior to the start
> > of migration because we can't do effective dirty page tracking while the
> > device is connected and doing DMA.
> 
> That's exactly what patch 12/12 is trying to accomplish.
> 
> I do see some problems with it, but I also suggested some solutions.

I was replying to:

> So... what would you expect service down wise for the following
> solution which is zero touch and I think should work for any VF
> driver:

And then later note:

"Here it's done via an enlightened guest driver."

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-22 13:01       ` Alex Williamson
@ 2015-10-22 13:06         ` Michael S. Tsirkin
  0 siblings, 0 replies; 56+ messages in thread
From: Michael S. Tsirkin @ 2015-10-22 13:06 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Or Gerlitz, emil.s.tantilov, kvm, linux-pci, qemu-devel,
	Jesse Brandeburg, carolyn.wyborny, Skidmore, Donald C, agraf,
	matthew.vick, intel-wired-lan, Jeff Kirsher, yang.z.zhang,
	Mitch Williams, nrupal.jani, bhelgaas, Lan Tianyu,
	Linux Netdev List, Shannon Nelson, eddie.dong, Linux Kernel,
	john.ronciak, Paolo Bonzini

On Thu, Oct 22, 2015 at 07:01:01AM -0600, Alex Williamson wrote:
> On Thu, 2015-10-22 at 15:32 +0300, Michael S. Tsirkin wrote:
> > On Wed, Oct 21, 2015 at 01:20:27PM -0600, Alex Williamson wrote:
> > > The trouble here is that the VF needs to be unplugged prior to the start
> > > of migration because we can't do effective dirty page tracking while the
> > > device is connected and doing DMA.
> > 
> > That's exactly what patch 12/12 is trying to accomplish.
> > 
> > I do see some problems with it, but I also suggested some solutions.
> 
> I was replying to:
> 
> > So... what would you expect service down wise for the following
> > solution which is zero touch and I think should work for any VF
> > driver:
> 
> And then later note:
> 
> "Here it's done via an enlightened guest driver."

Oh, I misunderstood your intent. Sorry about that.

So we are actually in agreement between us then. That's nice.

-- 
MST

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [RFC Patch 06/12] IXGBEVF: Add self emulation layer
  2015-10-22 12:50     ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-10-22 15:50       ` Alexander Duyck
  0 siblings, 0 replies; 56+ messages in thread
From: Alexander Duyck @ 2015-10-22 15:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/22/2015 05:50 AM, Michael S. Tsirkin wrote:
> On Wed, Oct 21, 2015 at 01:58:19PM -0700, Alexander Duyck wrote:
>> On 10/21/2015 09:37 AM, Lan Tianyu wrote:
>>> In order to restore VF function after migration, add self emulation layer
>>> to record regs' values during accessing regs.
>>>
>>> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
>>> ---
>>>   drivers/net/ethernet/intel/ixgbevf/Makefile        |  3 ++-
>>>   drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |  2 +-
>>>   .../net/ethernet/intel/ixgbevf/self-emulation.c    | 26 ++++++++++++++++++++++
>>>   drivers/net/ethernet/intel/ixgbevf/vf.h            |  5 ++++-
>>>   4 files changed, 33 insertions(+), 3 deletions(-)
>>>   create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c
>>>
>>> diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile b/drivers/net/ethernet/intel/ixgbevf/Makefile
>>> index 4ce4c97..841c884 100644
>>> --- a/drivers/net/ethernet/intel/ixgbevf/Makefile
>>> +++ b/drivers/net/ethernet/intel/ixgbevf/Makefile
>>> @@ -31,7 +31,8 @@
>>>   obj-$(CONFIG_IXGBEVF) += ixgbevf.o
>>> -ixgbevf-objs := vf.o \
>>> +ixgbevf-objs := self-emulation.o \
>>> +		vf.o \
>>>                   mbx.o \
>>>                   ethtool.o \
>>>                   ixgbevf_main.o
>>> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
>>> index a16d267..4446916 100644
>>> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
>>> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
>>> @@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg)
>>>   	if (IXGBE_REMOVED(reg_addr))
>>>   		return IXGBE_FAILED_READ_REG;
>>> -	value = readl(reg_addr + reg);
>>> +	value = ixgbe_self_emul_readl(reg_addr, reg);
>>>   	if (unlikely(value == IXGBE_FAILED_READ_REG))
>>>   		ixgbevf_check_remove(hw, reg);
>>>   	return value;
>>> diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
>>> new file mode 100644
>>> index 0000000..d74b2da
>>> --- /dev/null
>>> +++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
>>> @@ -0,0 +1,26 @@
>>> +#include <linux/netdevice.h>
>>> +#include <linux/pci.h>
>>> +#include <linux/delay.h>
>>> +#include <linux/interrupt.h>
>>> +#include <net/arp.h>
>>> +
>>> +#include "vf.h"
>>> +#include "ixgbevf.h"
>>> +
>>> +static u32 hw_regs[0x4000];
>>> +
>>> +u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr)
>>> +{
>>> +	u32 tmp;
>>> +
>>> +	tmp = readl(base + addr);
>>> +	hw_regs[(unsigned long)addr] = tmp;
>>> +
>>> +	return tmp;
>>> +}
>>> +
>>> +void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr)
>>> +{
>>> +	hw_regs[(unsigned long)addr] = val;
>>> +	writel(val, (volatile void __iomem *)(base + addr));
>>> +}
>> So I see what you are doing, however I don't think this adds much value.
>> Many of the key registers for the device are not simple Read/Write
>> registers.  Most of them are things like write 1 to clear or some other sort
>> of value where writing doesn't set the bit but has some other side effect.
>> Just take a look through the Datasheet at registers such as the VFCTRL,
>> VFMAILBOX, or most of the interrupt registers.  The fact is simply storing
>> the values off doesn't give you any real idea of what the state of things
>> are.
> It doesn't, but I guess the point is to isolate the migration-related logic
> in the recovery code.
>
> An alternative would be to have some smart logic all over the place to
> only store what's required - that would be much more intrusive.

After reviewing all of the patches yesterday I would say that almost all 
the values being stored aren't needed.  They can be restored from the 
settings of the driver itself anyway.  Copying the values out don't make 
much sense here since there are already enough caches for almost all of 
this data.

- Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-21 19:20   ` Alex Williamson
  2015-10-21 23:26     ` Alexander Duyck
  2015-10-22 12:32     ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-10-22 15:58     ` Or Gerlitz
  2015-10-22 16:17       ` Alex Williamson
  2 siblings, 1 reply; 56+ messages in thread
From: Or Gerlitz @ 2015-10-22 15:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Lan Tianyu,
	Michael S. Tsirkin <mst@redhat.com> (mst@redhat.com),
	Bjorn Helgaas, carolyn.wyborny, Skidmore, Donald C, eddie.dong,
	nrupal.jani, yang.z.zhang, Alexander Graf, kvm, Paolo Bonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, Jeff Kirsher,
	Jesse Brandeburg, john.ronciak, Linux Kernel, linux-pci,
	matthew.vick, Mitch Williams, Linux Netdev List, Shannon Nelson

On Wed, Oct 21, 2015 at 10:20 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:

> This is why the typical VF agnostic approach here is to using bonding
> and fail over to a emulated device during migration, so performance
> suffers, but downtime is something acceptable.

bonding in the VM isn't a zero touch solution, right? is it really acceptable?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-22 15:58     ` Or Gerlitz
@ 2015-10-22 16:17       ` Alex Williamson
  0 siblings, 0 replies; 56+ messages in thread
From: Alex Williamson @ 2015-10-22 16:17 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Lan Tianyu,
	Michael S. Tsirkin <mst@redhat.com> (mst@redhat.com),
	Bjorn Helgaas, carolyn.wyborny, Skidmore, Donald C, eddie.dong,
	nrupal.jani, yang.z.zhang, Alexander Graf, kvm, Paolo Bonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, Jeff Kirsher,
	Jesse Brandeburg, john.ronciak, Linux Kernel, linux-pci,
	matthew.vick, Mitch Williams, Linux Netdev List, Shannon Nelson

On Thu, 2015-10-22 at 18:58 +0300, Or Gerlitz wrote:
> On Wed, Oct 21, 2015 at 10:20 PM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> 
> > This is why the typical VF agnostic approach here is to using bonding
> > and fail over to a emulated device during migration, so performance
> > suffers, but downtime is something acceptable.
> 
> bonding in the VM isn't a zero touch solution, right? is it really acceptable?

The bonding solution requires configuring the bond in the guest and
doing the hot unplug/re-plug around migration.  It's zero touch in that
it works on current code with any PF/VF, but it's certainly not zero
configuration in the guest.  Is what acceptable?  The configuration?
The performance?  The downtime?  I don't think we can hope to improve on
the downtime of an emulated device, but obviously the configuration and
performance are not always acceptable or we wouldn't be seeing so many
people working on migration of assigned devices.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
                   ` (13 preceding siblings ...)
  2015-10-22 12:55 ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-10-23 18:36 ` Alexander Duyck
  2015-10-23 19:05   ` Alex Williamson
  2015-10-26  5:36   ` Lan Tianyu
  14 siblings, 2 replies; 56+ messages in thread
From: Alexander Duyck @ 2015-10-23 18:36 UTC (permalink / raw)
  To: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/21/2015 09:37 AM, Lan Tianyu wrote:
> This patchset is to propose a new solution to add live migration support for 82599
> SRIOV network card.
>
> Im our solution, we prefer to put all device specific operation into VF and
> PF driver and make code in the Qemu more general.
>
>
> VF status migration
> =================================================================
> VF status can be divided into 4 parts
> 1) PCI configure regs
> 2) MSIX configure
> 3) VF status in the PF driver
> 4) VF MMIO regs
>
> The first three status are all handled by Qemu.
> The PCI configure space regs and MSIX configure are originally
> stored in Qemu. To save and restore "VF status in the PF driver"
> by Qemu during migration, adds new sysfs node "state_in_pf" under
> VF sysfs directory.
>
> For VF MMIO regs, we introduce self emulation layer in the VF
> driver to record MMIO reg values during reading or writing MMIO
> and put these data in the guest memory. It will be migrated with
> guest memory to new machine.
>
>
> VF function restoration
> ================================================================
> Restoring VF function operation are done in the VF and PF driver.
>
> In order to let VF driver to know migration status, Qemu fakes VF
> PCI configure regs to indicate migration status and add new sysfs
> node "notify_vf" to trigger VF mailbox irq in order to notify VF
> about migration status change.
>
> Transmit/Receive descriptor head regs are read-only and can't
> be restored via writing back recording reg value directly and they
> are set to 0 during VF reset. To reuse original tx/rx rings, shift
> desc ring in order to move the desc pointed by original head reg to
> first entry of the ring and then enable tx/rx rings. VF restarts to
> receive and transmit from original head desc.
>
>
> Tracking DMA accessed memory
> =================================================================
> Migration relies on tracking dirty page to migrate memory.
> Hardware can't automatically mark a page as dirty after DMA
> memory access. VF descriptor rings and data buffers are modified
> by hardware when receive and transmit data. To track such dirty memory
> manually, do dummy writes(read a byte and write it back) when receive
> and transmit data.

I was thinking about it and I am pretty sure the dummy write approach is 
problematic at best.  Specifically the issue is that while you are 
performing a dummy write you risk pulling in descriptors for data that 
hasn't been dummy written to yet.  So when you resume and restore your 
descriptors you will have once that may contain Rx descriptors 
indicating they contain data when after the migration they don't.

I really think the best approach to take would be to look at 
implementing an emulated IOMMU so that you could track DMA mapped pages 
and avoid migrating the ones marked as DMA_FROM_DEVICE until they are 
unmapped.  The advantage to this is that in the case of the ixgbevf 
driver it now reuses the same pages for Rx DMA.  As a result it will be 
rewriting the same pages often and if you are marking those pages as 
dirty and transitioning them it is possible for a flow of small packets 
to really make a mess of things since you would be rewriting the same 
pages in a loop while the device is processing packets.

Beyond that I would say you could suspend/resume the device in order to 
get it to stop and flush the descriptor rings and any outstanding 
packets.  The code for suspend would unmap the DMA memory which would 
then be the trigger to flush it across in the migration, and the resume 
code would take care of any state restoration needed beyond any values 
that can be configured with the ip link command.

If you wanted to do a proof of concept of this you could probably do so 
with very little overhead.  Basically you would need the "page_addr" 
portion of patch 12 to emulate a slightly migration aware DMA API, and 
then beyond that you would need something like patch 9 but instead of 
adding new functions and API you would be switching things on and off 
via the ixgbevf_suspend/resume calls.

- Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-23 18:36 ` Alexander Duyck
@ 2015-10-23 19:05   ` Alex Williamson
  2015-10-23 20:01     ` Alexander Duyck
  2015-10-26  5:36   ` Lan Tianyu
  1 sibling, 1 reply; 56+ messages in thread
From: Alex Williamson @ 2015-10-23 19:05 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: emil.s.tantilov, kvm, linux-pci, qemu-devel, jesse.brandeburg,
	carolyn.wyborny, donald.c.skidmore, agraf, matthew.vick,
	intel-wired-lan, jeffrey.t.kirsher, yang.z.zhang,
	mitch.a.williams, nrupal.jani, bhelgaas, Lan Tianyu, netdev,
	shannon.nelson, eddie.dong, linux-kernel, john.ronciak, pbonzini

On Fri, 2015-10-23 at 11:36 -0700, Alexander Duyck wrote:
> On 10/21/2015 09:37 AM, Lan Tianyu wrote:
> > This patchset is to propose a new solution to add live migration support for 82599
> > SRIOV network card.
> >
> > Im our solution, we prefer to put all device specific operation into VF and
> > PF driver and make code in the Qemu more general.
> >
> >
> > VF status migration
> > =================================================================
> > VF status can be divided into 4 parts
> > 1) PCI configure regs
> > 2) MSIX configure
> > 3) VF status in the PF driver
> > 4) VF MMIO regs
> >
> > The first three status are all handled by Qemu.
> > The PCI configure space regs and MSIX configure are originally
> > stored in Qemu. To save and restore "VF status in the PF driver"
> > by Qemu during migration, adds new sysfs node "state_in_pf" under
> > VF sysfs directory.
> >
> > For VF MMIO regs, we introduce self emulation layer in the VF
> > driver to record MMIO reg values during reading or writing MMIO
> > and put these data in the guest memory. It will be migrated with
> > guest memory to new machine.
> >
> >
> > VF function restoration
> > ================================================================
> > Restoring VF function operation are done in the VF and PF driver.
> >
> > In order to let VF driver to know migration status, Qemu fakes VF
> > PCI configure regs to indicate migration status and add new sysfs
> > node "notify_vf" to trigger VF mailbox irq in order to notify VF
> > about migration status change.
> >
> > Transmit/Receive descriptor head regs are read-only and can't
> > be restored via writing back recording reg value directly and they
> > are set to 0 during VF reset. To reuse original tx/rx rings, shift
> > desc ring in order to move the desc pointed by original head reg to
> > first entry of the ring and then enable tx/rx rings. VF restarts to
> > receive and transmit from original head desc.
> >
> >
> > Tracking DMA accessed memory
> > =================================================================
> > Migration relies on tracking dirty page to migrate memory.
> > Hardware can't automatically mark a page as dirty after DMA
> > memory access. VF descriptor rings and data buffers are modified
> > by hardware when receive and transmit data. To track such dirty memory
> > manually, do dummy writes(read a byte and write it back) when receive
> > and transmit data.
> 
> I was thinking about it and I am pretty sure the dummy write approach is 
> problematic at best.  Specifically the issue is that while you are 
> performing a dummy write you risk pulling in descriptors for data that 
> hasn't been dummy written to yet.  So when you resume and restore your 
> descriptors you will have once that may contain Rx descriptors 
> indicating they contain data when after the migration they don't.
> 
> I really think the best approach to take would be to look at 
> implementing an emulated IOMMU so that you could track DMA mapped pages 
> and avoid migrating the ones marked as DMA_FROM_DEVICE until they are 
> unmapped.  The advantage to this is that in the case of the ixgbevf 
> driver it now reuses the same pages for Rx DMA.  As a result it will be 
> rewriting the same pages often and if you are marking those pages as 
> dirty and transitioning them it is possible for a flow of small packets 
> to really make a mess of things since you would be rewriting the same 
> pages in a loop while the device is processing packets.

I'd be concerned that an emulated IOMMU on the DMA path would reduce
throughput to the point where we shouldn't even bother with assigning
the device in the first place and should be using virtio-net instead.
POWER systems have a guest visible IOMMU and it's been challenging for
them to get to 10Gbps, requiring real-mode tricks.  virtio-net may add
some latency, but it's not that hard to get it to 10Gbps and it already
supports migration.  An emulated IOMMU in the guest is really only good
for relatively static mappings, the latency for anything else is likely
too high.  Maybe there are shadow page table tricks that could help, but
it's imposing overhead the whole time the guest is running, not only on
migration.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-23 19:05   ` Alex Williamson
@ 2015-10-23 20:01     ` Alexander Duyck
  0 siblings, 0 replies; 56+ messages in thread
From: Alexander Duyck @ 2015-10-23 20:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/23/2015 12:05 PM, Alex Williamson wrote:
> On Fri, 2015-10-23 at 11:36 -0700, Alexander Duyck wrote:
>> On 10/21/2015 09:37 AM, Lan Tianyu wrote:
>>> This patchset is to propose a new solution to add live migration support for 82599
>>> SRIOV network card.
>>>
>>> Im our solution, we prefer to put all device specific operation into VF and
>>> PF driver and make code in the Qemu more general.
>>>
>>>
>>> VF status migration
>>> =================================================================
>>> VF status can be divided into 4 parts
>>> 1) PCI configure regs
>>> 2) MSIX configure
>>> 3) VF status in the PF driver
>>> 4) VF MMIO regs
>>>
>>> The first three status are all handled by Qemu.
>>> The PCI configure space regs and MSIX configure are originally
>>> stored in Qemu. To save and restore "VF status in the PF driver"
>>> by Qemu during migration, adds new sysfs node "state_in_pf" under
>>> VF sysfs directory.
>>>
>>> For VF MMIO regs, we introduce self emulation layer in the VF
>>> driver to record MMIO reg values during reading or writing MMIO
>>> and put these data in the guest memory. It will be migrated with
>>> guest memory to new machine.
>>>
>>>
>>> VF function restoration
>>> ================================================================
>>> Restoring VF function operation are done in the VF and PF driver.
>>>
>>> In order to let VF driver to know migration status, Qemu fakes VF
>>> PCI configure regs to indicate migration status and add new sysfs
>>> node "notify_vf" to trigger VF mailbox irq in order to notify VF
>>> about migration status change.
>>>
>>> Transmit/Receive descriptor head regs are read-only and can't
>>> be restored via writing back recording reg value directly and they
>>> are set to 0 during VF reset. To reuse original tx/rx rings, shift
>>> desc ring in order to move the desc pointed by original head reg to
>>> first entry of the ring and then enable tx/rx rings. VF restarts to
>>> receive and transmit from original head desc.
>>>
>>>
>>> Tracking DMA accessed memory
>>> =================================================================
>>> Migration relies on tracking dirty page to migrate memory.
>>> Hardware can't automatically mark a page as dirty after DMA
>>> memory access. VF descriptor rings and data buffers are modified
>>> by hardware when receive and transmit data. To track such dirty memory
>>> manually, do dummy writes(read a byte and write it back) when receive
>>> and transmit data.
>>
>> I was thinking about it and I am pretty sure the dummy write approach is
>> problematic at best.  Specifically the issue is that while you are
>> performing a dummy write you risk pulling in descriptors for data that
>> hasn't been dummy written to yet.  So when you resume and restore your
>> descriptors you will have once that may contain Rx descriptors
>> indicating they contain data when after the migration they don't.
>>
>> I really think the best approach to take would be to look at
>> implementing an emulated IOMMU so that you could track DMA mapped pages
>> and avoid migrating the ones marked as DMA_FROM_DEVICE until they are
>> unmapped.  The advantage to this is that in the case of the ixgbevf
>> driver it now reuses the same pages for Rx DMA.  As a result it will be
>> rewriting the same pages often and if you are marking those pages as
>> dirty and transitioning them it is possible for a flow of small packets
>> to really make a mess of things since you would be rewriting the same
>> pages in a loop while the device is processing packets.
>
> I'd be concerned that an emulated IOMMU on the DMA path would reduce
> throughput to the point where we shouldn't even bother with assigning
> the device in the first place and should be using virtio-net instead.
> POWER systems have a guest visible IOMMU and it's been challenging for
> them to get to 10Gbps, requiring real-mode tricks.  virtio-net may add
> some latency, but it's not that hard to get it to 10Gbps and it already
> supports migration.  An emulated IOMMU in the guest is really only good
> for relatively static mappings, the latency for anything else is likely
> too high.  Maybe there are shadow page table tricks that could help, but
> it's imposing overhead the whole time the guest is running, not only on
> migration.  Thanks,
>

The big overhead I have seen with IOMMU implementations is the fact that 
they almost always have some sort of locked table or tree that prevents 
multiple CPUs from accessing resources in any kind of timely fashion. 
As a result things like Tx is usually slowed down for network workloads 
when multiple CPUs are enabled.

I admit doing a guest visible IOMMU would probably add some overhead, 
but this current patch set as implemented already has some of the hints 
of that as the descriptor rings are locked which means we cannot unmap 
in the Tx clean-up while we are mapping on another Tx queue for instance.

One approach for this would be to implement or extend a lightweight DMA 
API such as swiotlb or nommu.  The code would need to have a bit in 
there so it can take care of marking the pages as dirty on sync_for_cpu 
and unmap calls when set for BIDIRECTIONAL or FROM_DEVICE.  Then if we 
could somehow have some mechanism for the hypervisor to tell us when the 
feature is needed or not we could probably drop the overhead for page 
dirtying as well.  That was why I even mentioned IOMMU, but the fact is 
all we really need is some means of tracking if we should be marking the 
pages as dirty or not.

- Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device
  2015-10-21 18:07   ` Alexander Duyck
@ 2015-10-24 14:46     ` Lan, Tianyu
  0 siblings, 0 replies; 56+ messages in thread
From: Lan, Tianyu @ 2015-10-24 14:46 UTC (permalink / raw)
  To: Alexander Duyck, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson



On 10/22/2015 2:07 AM, Alexander Duyck wrote:
> On 10/21/2015 09:37 AM, Lan Tianyu wrote:
>> Add "virtfn_index" member in the struct pci_device to record VF sequence
>> of PF. This will be used in the VF sysfs node handle.
>>
>> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
>> ---
>>   drivers/pci/iov.c   | 1 +
>>   include/linux/pci.h | 1 +
>>   2 files changed, 2 insertions(+)
>>
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index ee0ebff..065b6bb 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -136,6 +136,7 @@ static int virtfn_add(struct pci_dev *dev, int id,
>> int reset)
>>       virtfn->physfn = pci_dev_get(dev);
>>       virtfn->is_virtfn = 1;
>>       virtfn->multifunction = 0;
>> +    virtfn->virtfn_index = id;
>>       for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>>           res = &dev->resource[i + PCI_IOV_RESOURCES];
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 353db8d..85c5531 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -356,6 +356,7 @@ struct pci_dev {
>>       unsigned int    io_window_1k:1;    /* Intel P2P bridge 1K I/O
>> windows */
>>       unsigned int    irq_managed:1;
>>       pci_dev_flags_t dev_flags;
>> +    unsigned int    virtfn_index;
>>       atomic_t    enable_cnt;    /* pci_enable_device has been called */
>>       u32        saved_config_space[16]; /* config space saved at
>> suspend time */
>>
>
> Can't you just calculate the VF index based on the VF BDF number
> combined with the information in the PF BDF number and VF
> offset/stride?  Seems kind of pointless to add a variable that is only
> used by one driver and is in a slowpath when you can just calculate it
> pretty quickly.

Good suggestion. Will try it.

>
> - Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"
  2015-10-21 20:52   ` Alexander Duyck
  2015-10-22 12:51     ` Michael S. Tsirkin
@ 2015-10-24 15:43     ` Lan, Tianyu
  2015-10-25  6:03       ` Alexander Duyck
  1 sibling, 1 reply; 56+ messages in thread
From: Lan, Tianyu @ 2015-10-24 15:43 UTC (permalink / raw)
  To: Alexander Duyck, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson


On 10/22/2015 4:52 AM, Alexander Duyck wrote:
> Also have you even considered the MSI-X configuration on the VF?  I
> haven't seen anything anywhere that would have migrated the VF's MSI-X
> configuration from BAR 3 on one system to the new system.

MSI-X migration is done by Hypervisor(Qemu).
Following link is my Qemu patch to do that.
http://marc.info/?l=kvm&m=144544706530484&w=2

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package
  2015-10-22 12:58   ` Michael S. Tsirkin
@ 2015-10-24 16:08     ` Lan, Tianyu
  0 siblings, 0 replies; 56+ messages in thread
From: Lan, Tianyu @ 2015-10-24 16:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: bhelgaas, carolyn.wyborny, donald.c.skidmore, eddie.dong,
	nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini, qemu-devel,
	emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson



On 10/22/2015 8:58 PM, Michael S. Tsirkin wrote:
> Do you really need to play the shifting games?
> Can't you just reset everything and re-initialize the rings?
> It's slower but way less intrusive.
> Also removes the need to track writes into rings.

Shift ring is to avoid losing those packets in the ring.
This may cause some race condition and so I introduced a
lock to prevent such cases in the latter patch.
Yes, reset everything after migration can make thing easy.
But just like you said it would affect performance and loss
more packets. I can do a test later to get data about these
two way.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package
  2015-10-21 21:14   ` Alexander Duyck
@ 2015-10-24 16:12     ` Lan, Tianyu
  0 siblings, 0 replies; 56+ messages in thread
From: Lan, Tianyu @ 2015-10-24 16:12 UTC (permalink / raw)
  To: Alexander Duyck, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson


On 10/22/2015 5:14 AM, Alexander Duyck wrote:
> Where is i being initialized?  It was here but you removed it.  Are you
> using i without initializing it?

Sorry, the initialization was put into patch 10 by mistake. "i" is
assigned with "tx_ring->next_to_clean".

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"
  2015-10-24 15:43     ` Lan, Tianyu
@ 2015-10-25  6:03       ` Alexander Duyck
  2015-10-25  6:45         ` Lan, Tianyu
  0 siblings, 1 reply; 56+ messages in thread
From: Alexander Duyck @ 2015-10-25  6:03 UTC (permalink / raw)
  To: Lan, Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/24/2015 08:43 AM, Lan, Tianyu wrote:
>
> On 10/22/2015 4:52 AM, Alexander Duyck wrote:
>> Also have you even considered the MSI-X configuration on the VF?  I
>> haven't seen anything anywhere that would have migrated the VF's MSI-X
>> configuration from BAR 3 on one system to the new system.
>
> MSI-X migration is done by Hypervisor(Qemu).
> Following link is my Qemu patch to do that.
> http://marc.info/?l=kvm&m=144544706530484&w=2

I really don't like the idea of trying to migrate the MSI-X across from 
host to host while it is still active.  I really think Qemu shouldn't be 
moving this kind of data over in a migration.

I think that having the VF do a suspend/resume is the best way to go.  
Then it simplifies things as all you have to deal with is the dirty page 
tracking for the Rx DMA and you should be able to do this without making 
things too difficult.

- Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"
  2015-10-25  6:03       ` Alexander Duyck
@ 2015-10-25  6:45         ` Lan, Tianyu
  0 siblings, 0 replies; 56+ messages in thread
From: Lan, Tianyu @ 2015-10-25  6:45 UTC (permalink / raw)
  To: Alexander Duyck, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson



On 10/25/2015 2:03 PM, Alexander Duyck wrote:
> On 10/24/2015 08:43 AM, Lan, Tianyu wrote:
>>
>> On 10/22/2015 4:52 AM, Alexander Duyck wrote:
>>> Also have you even considered the MSI-X configuration on the VF?  I
>>> haven't seen anything anywhere that would have migrated the VF's MSI-X
>>> configuration from BAR 3 on one system to the new system.
>>
>> MSI-X migration is done by Hypervisor(Qemu).
>> Following link is my Qemu patch to do that.
>> http://marc.info/?l=kvm&m=144544706530484&w=2
>
> I really don't like the idea of trying to migrate the MSI-X across from
> host to host while it is still active.  I really think Qemu shouldn't be
> moving this kind of data over in a migration.

Hi Alex:

VF MSI-X regs in the VM are faked by Qemu and Qemu maps host vectors of
VF with guest's vector. The MSIX data migrated are for faked regs rather
than the one on the host. After migration, Qemu will remap guest vectors
with host vectors on the new machine. Moreover, VM is stopped during
migrating MSI-X data.


>
> I think that having the VF do a suspend/resume is the best way to go.
> Then it simplifies things as all you have to deal with is the dirty page
> tracking for the Rx DMA and you should be able to do this without making
> things too difficult.
>

Yes, that will be simple and most concern is service down time. I will
test later.


> - Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver
  2015-10-21 20:45   ` Alexander Duyck
@ 2015-10-25  7:21     ` Lan, Tianyu
  0 siblings, 0 replies; 56+ messages in thread
From: Lan, Tianyu @ 2015-10-25  7:21 UTC (permalink / raw)
  To: Alexander Duyck, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson



On 10/22/2015 4:45 AM, Alexander Duyck wrote:
>> +    /* Record states hold by PF */
>> +    memcpy(&state->vf_data, &adapter->vfinfo[vfn], sizeof(struct
>> vf_data_storage));
>> +
>> +    vf_shift = vfn % 32;
>> +    reg_offset = vfn / 32;
>> +
>> +    reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
>> +    reg &= ~(1 << vf_shift);
>> +    IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
>> +
>> +    reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
>> +    reg &= ~(1 << vf_shift);
>> +    IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
>> +
>> +    reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
>> +    reg &= ~(1 << vf_shift);
>> +    IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);
>> +
>> +    return sizeof(struct state_in_pf);
>> +}
>> +
>
> This is a read.  Why does it need to switch off the VF?  Also why turn
> of the anti-spoof, it doesn't make much sense.

This is to prevent packets which target to VM from delivering to 
original VF after migration. E,G After migration, VM pings the PF of 
original machine and the ping reply packet will forward to original
VF if it wasn't disabled.

BTW, the read is done when VM has been stopped on the source machine.


>
>> +static ssize_t ixgbe_store_state_in_pf(struct device *dev,
>> +                       struct device_attribute *attr,
>> +                       const char *buf, size_t count)
>> +{
>> +    struct ixgbe_adapter *adapter = to_adapter(dev);
>> +    struct pci_dev *pdev = adapter->pdev, *vdev;
>> +    struct pci_dev *vf_pdev = to_pci_dev(dev);
>> +    struct state_in_pf *state = (struct state_in_pf *)buf;
>> +    int vfn = vf_pdev->virtfn_index;
>> +
>> +    /* Check struct size */
>> +    if (count != sizeof(struct state_in_pf)) {
>> +        printk(KERN_ERR "State in PF size does not fit.\n");
>> +        goto out;
>> +    }
>> +
>> +    /* Restore PCI configurations */
>> +    vdev = ixgbe_get_virtfn_dev(pdev, vfn);
>> +    if (vdev) {
>> +        pci_write_config_word(vdev, IXGBE_PCI_VFCOMMAND,
>> state->command);
>> +        pci_write_config_word(vdev, IXGBE_PCI_VFMSIXMC,
>> state->msix_message_control);
>> +    }
>> +
>> +    /* Restore states hold by PF */
>> +    memcpy(&adapter->vfinfo[vfn], &state->vf_data, sizeof(struct
>> vf_data_storage));
>> +
>> +  out:
>> +    return count;
>> +}
>
> Just doing a memcpy to move the vfinfo over adds no value.  The fact is
> there are a number of filters that have to be configured in hardware
> after, and it isn't as simple as just migrating the values stored.

Restoring VF status in the PF is triggered by VF driver via new mailbox
msg and call ixgbe_restore_setting(). Here just copy data into vfinfo.
If configure hardware early, state will be cleared by FLR which is
trigger by restoring operation in the VF driver.


>  As I
> mentioned in the case of the 82598 there is also jumbo frames to take
> into account.  If the first PF didn't have it enabled, but the second
> one does that implies the state of the VF needs to change to account for
> that.

Yes, that will be a problem and VF driver also need to know this change 
after migration and reconfigure jumbo frame

>
> I really think you would be better off only migrating the data related
> to what can be configured using the ip link command and leaving other
> values such as clear_to_send at the reset value of 0. Then you can at
> least restore state from the VF after just a couple of quick messages.

This sounds good. I will try it later.

>
>> +static struct device_attribute ixgbe_per_state_in_pf_attribute =
>> +    __ATTR(state_in_pf, S_IRUGO | S_IWUSR,
>> +        ixgbe_show_state_in_pf, ixgbe_store_state_in_pf);
>> +
>> +void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
>> +{
>> +    struct pci_dev *pdev = adapter->pdev;
>> +    struct pci_dev *vfdev;
>> +    unsigned short vf_id;
>> +    int pos, ret;
>> +
>> +    pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
>> +    if (!pos)
>> +        return;
>> +
>> +    /* get the device ID for the VF */
>> +    pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, &vf_id);
>> +
>> +    vfdev = pci_get_device(pdev->vendor, vf_id, NULL);
>> +
>> +    while (vfdev) {
>> +        if (vfdev->is_virtfn) {
>> +            ret = device_create_file(&vfdev->dev,
>> +                    &ixgbe_per_state_in_pf_attribute);
>> +            if (ret)
>> +                pr_warn("Unable to add VF attribute for dev %s,\n",
>> +                    dev_name(&vfdev->dev));
>> +        }
>> +
>> +        vfdev = pci_get_device(pdev->vendor, vf_id, vfdev);
>> +    }
>> +}
>
> Driver specific sysfs is a no-go.  Otherwise we will end up with a
> different implementation of this for every driver.  You will need to
> find a way to make this generic in order to have a hope of getting this
> to be acceptable.

Yes, Alex Williamson proposed to get/put data via VFIO interface. This
will be more general. I will do more research about how to communicate
between PF driver and VFIO subsystem.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-23 18:36 ` Alexander Duyck
  2015-10-23 19:05   ` Alex Williamson
@ 2015-10-26  5:36   ` Lan Tianyu
  2015-10-26 15:03     ` Alexander Duyck
  1 sibling, 1 reply; 56+ messages in thread
From: Lan Tianyu @ 2015-10-26  5:36 UTC (permalink / raw)
  To: Alexander Duyck, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 2015年10月24日 02:36, Alexander Duyck wrote:
> I was thinking about it and I am pretty sure the dummy write approach is
> problematic at best.  Specifically the issue is that while you are
> performing a dummy write you risk pulling in descriptors for data that
> hasn't been dummy written to yet.  So when you resume and restore your
> descriptors you will have once that may contain Rx descriptors
> indicating they contain data when after the migration they don't.

How about changing sequence? dummy writing Rx packet data fist and then
its desc. This can ensure that RX data is migrated before its desc and
prevent such case.

-- 
Best regards
Tianyu Lan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-26  5:36   ` Lan Tianyu
@ 2015-10-26 15:03     ` Alexander Duyck
  2015-10-29  6:12       ` Lan Tianyu
  0 siblings, 1 reply; 56+ messages in thread
From: Alexander Duyck @ 2015-10-26 15:03 UTC (permalink / raw)
  To: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/25/2015 10:36 PM, Lan Tianyu wrote:
> On 2015年10月24日 02:36, Alexander Duyck wrote:
>> I was thinking about it and I am pretty sure the dummy write approach is
>> problematic at best.  Specifically the issue is that while you are
>> performing a dummy write you risk pulling in descriptors for data that
>> hasn't been dummy written to yet.  So when you resume and restore your
>> descriptors you will have once that may contain Rx descriptors
>> indicating they contain data when after the migration they don't.
> How about changing sequence? dummy writing Rx packet data fist and then
> its desc. This can ensure that RX data is migrated before its desc and
> prevent such case.

No.  I think you are missing the fact that there are 256 descriptors per 
page.  As such if you dirty just 1 you will be pulling in 255 more, of 
which you may or may not have pulled in the receive buffer for.

So for example if you have the descriptor ring size set to 256 then that 
means you are going to get whatever the descriptor ring has since you 
will be marking the entire ring dirty with every packet processed, 
however you cannot guarantee that you are going to get all of the 
receive buffers unless you go through and flush the entire ring prior to 
migrating.

This is why I have said you will need to do something to force the rings 
to be flushed such as initiating a PM suspend prior to migrating.  You 
need to do something to stop the DMA and flush the remaining Rx buffers 
if you want to have any hope of being able to migrate the Rx in a 
consistent state.  Beyond that the only other thing you have to worry 
about are the Rx buffers that have already been handed off to the 
stack.  However those should be handled if you do a suspend and somehow 
flag pages as dirty when they are unmapped from the DMA.

- Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-26 15:03     ` Alexander Duyck
@ 2015-10-29  6:12       ` Lan Tianyu
  2015-10-29  6:58         ` Alexander Duyck
  0 siblings, 1 reply; 56+ messages in thread
From: Lan Tianyu @ 2015-10-29  6:12 UTC (permalink / raw)
  To: Alexander Duyck, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 2015年10月26日 23:03, Alexander Duyck wrote:
> No.  I think you are missing the fact that there are 256 descriptors per
> page.  As such if you dirty just 1 you will be pulling in 255 more, of
> which you may or may not have pulled in the receive buffer for.
> 
> So for example if you have the descriptor ring size set to 256 then that
> means you are going to get whatever the descriptor ring has since you
> will be marking the entire ring dirty with every packet processed,
> however you cannot guarantee that you are going to get all of the
> receive buffers unless you go through and flush the entire ring prior to
> migrating.


Yes, that will be a problem. How about adding tag for each Rx buffer and
check the tag when deliver the Rx buffer to stack? If tag has been
overwritten, this means the packet data has been migrated.


> 
> This is why I have said you will need to do something to force the rings
> to be flushed such as initiating a PM suspend prior to migrating.  You
> need to do something to stop the DMA and flush the remaining Rx buffers
> if you want to have any hope of being able to migrate the Rx in a
> consistent state.  Beyond that the only other thing you have to worry
> about are the Rx buffers that have already been handed off to the
> stack.  However those should be handled if you do a suspend and somehow
> flag pages as dirty when they are unmapped from the DMA.
> 
> - Alex

This will be simple and maybe our first version to enable migration. But
we still hope to find a way not to disable DMA before stopping VCPU to
decrease service down time.

-- 
Best regards
Tianyu Lan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-29  6:12       ` Lan Tianyu
@ 2015-10-29  6:58         ` Alexander Duyck
  2015-10-29  8:33           ` Lan Tianyu
  0 siblings, 1 reply; 56+ messages in thread
From: Alexander Duyck @ 2015-10-29  6:58 UTC (permalink / raw)
  To: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/28/2015 11:12 PM, Lan Tianyu wrote:
> On 2015年10月26日 23:03, Alexander Duyck wrote:
>> No.  I think you are missing the fact that there are 256 descriptors per
>> page.  As such if you dirty just 1 you will be pulling in 255 more, of
>> which you may or may not have pulled in the receive buffer for.
>>
>> So for example if you have the descriptor ring size set to 256 then that
>> means you are going to get whatever the descriptor ring has since you
>> will be marking the entire ring dirty with every packet processed,
>> however you cannot guarantee that you are going to get all of the
>> receive buffers unless you go through and flush the entire ring prior to
>> migrating.
>
> Yes, that will be a problem. How about adding tag for each Rx buffer and
> check the tag when deliver the Rx buffer to stack? If tag has been
> overwritten, this means the packet data has been migrated.

Then you have to come up with a pattern that you can guarantee is the 
tag and not part of the packet data.  That isn't going to be something 
that is easy to do.  It would also have a serious performance impact on 
the VF.

>> This is why I have said you will need to do something to force the rings
>> to be flushed such as initiating a PM suspend prior to migrating.  You
>> need to do something to stop the DMA and flush the remaining Rx buffers
>> if you want to have any hope of being able to migrate the Rx in a
>> consistent state.  Beyond that the only other thing you have to worry
>> about are the Rx buffers that have already been handed off to the
>> stack.  However those should be handled if you do a suspend and somehow
>> flag pages as dirty when they are unmapped from the DMA.
>>
>> - Alex
> This will be simple and maybe our first version to enable migration. But
> we still hope to find a way not to disable DMA before stopping VCPU to
> decrease service down time.

You have to stop the Rx DMA at some point anyway.  It is the only means 
to guarantee that the device stops updating buffers and descriptors so 
that you will have a consistent state.

Your code was having to do a bunch of shuffling in order to get things 
set up so that you could bring the interface back up.  I would argue 
that it may actually be faster at least on the bring-up to just drop the 
old rings and start over since it greatly reduced the complexity and the 
amount of device related data that has to be moved.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-29  6:58         ` Alexander Duyck
@ 2015-10-29  8:33           ` Lan Tianyu
  2015-10-29 16:17             ` Alexander Duyck
  0 siblings, 1 reply; 56+ messages in thread
From: Lan Tianyu @ 2015-10-29  8:33 UTC (permalink / raw)
  To: Alexander Duyck, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 2015年10月29日 14:58, Alexander Duyck wrote:
> 
> Your code was having to do a bunch of shuffling in order to get things
> set up so that you could bring the interface back up.  I would argue
> that it may actually be faster at least on the bring-up to just drop the
> old rings and start over since it greatly reduced the complexity and the
> amount of device related data that has to be moved.

If give up the old ring after migration and keep DMA running before
stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and
just make sure that all Rx buffers delivered to stack has been migrated.

1) Dummy write Rx buffer before checking Rx descriptor to ensure packet
migrated first.

2) Make a copy of Rx descriptor and then use the copied data to check
buffer status. Not use the original descriptor because it won't be
migrated and migration may happen between two access of the Rx descriptor.

-- 
Best regards
Tianyu Lan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-29  8:33           ` Lan Tianyu
@ 2015-10-29 16:17             ` Alexander Duyck
  2015-10-30  2:41               ` Lan Tianyu
  0 siblings, 1 reply; 56+ messages in thread
From: Alexander Duyck @ 2015-10-29 16:17 UTC (permalink / raw)
  To: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/29/2015 01:33 AM, Lan Tianyu wrote:
> On 2015年10月29日 14:58, Alexander Duyck wrote:
>> Your code was having to do a bunch of shuffling in order to get things
>> set up so that you could bring the interface back up.  I would argue
>> that it may actually be faster at least on the bring-up to just drop the
>> old rings and start over since it greatly reduced the complexity and the
>> amount of device related data that has to be moved.
> If give up the old ring after migration and keep DMA running before
> stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and
> just make sure that all Rx buffers delivered to stack has been migrated.
>
> 1) Dummy write Rx buffer before checking Rx descriptor to ensure packet
> migrated first.

Don't dummy write the Rx descriptor.  You should only really need to 
dummy write the Rx buffer and you would do so after checking the 
descriptor, not before.  Otherwise you risk corrupting the Rx buffer 
because it is possible for you to read the Rx buffer, DMA occurs, and 
then you write back the Rx buffer and now you have corrupted the memory.

> 2) Make a copy of Rx descriptor and then use the copied data to check
> buffer status. Not use the original descriptor because it won't be
> migrated and migration may happen between two access of the Rx descriptor.

Do not just blindly copy the Rx descriptor ring.  That is a recipe for 
disaster.  The problem is DMA has to happen in a very specific order for 
things to function correctly.  The Rx buffer has to be written and then 
the Rx descriptor.  The problem is you will end up getting a read-ahead 
on the Rx descriptor ring regardless of which order you dirty things in.

The descriptor is only 16 bytes, you can fit 256 of them in a single 
page.  There is a good chance you probably wouldn't be able to migrate 
if you were under heavy network stress, however you could still have 
several buffers written in the time it takes for you to halt the VM and 
migrate the remaining pages.  Those buffers wouldn't be marked as dirty 
but odds are the page the descriptors are in would be.  As such you will 
end up with the descriptors but not the buffers.

The only way you could possibly migrate the descriptors rings cleanly 
would be to have enough knowledge about the layout of things to force 
the descriptor rings to be migrated first followed by all of the 
currently mapped Rx buffers.  In addition you would need to have some 
means of tracking all of the Rx buffers such as an emulated IOMMU as you 
would need to migrate all of them, not just part.  By doing it this way 
you would get the Rx descriptor rings in the earliest state possible and 
would be essentially emulating the Rx buffer writes occurring before the 
Rx descriptor writes.  You would likely have several Rx buffer writes 
that would be discarded in the process as there would be no descriptor 
for them but at least the state of the system would be consistent.

- Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-29 16:17             ` Alexander Duyck
@ 2015-10-30  2:41               ` Lan Tianyu
  2015-10-30 18:04                 ` Alexander Duyck
  0 siblings, 1 reply; 56+ messages in thread
From: Lan Tianyu @ 2015-10-30  2:41 UTC (permalink / raw)
  To: Alexander Duyck, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 2015年10月30日 00:17, Alexander Duyck wrote:
> On 10/29/2015 01:33 AM, Lan Tianyu wrote:
>> On 2015年10月29日 14:58, Alexander Duyck wrote:
>>> Your code was having to do a bunch of shuffling in order to get things
>>> set up so that you could bring the interface back up.  I would argue
>>> that it may actually be faster at least on the bring-up to just drop the
>>> old rings and start over since it greatly reduced the complexity and the
>>> amount of device related data that has to be moved.
>> If give up the old ring after migration and keep DMA running before
>> stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and
>> just make sure that all Rx buffers delivered to stack has been migrated.
>>
>> 1) Dummy write Rx buffer before checking Rx descriptor to ensure packet
>> migrated first.
> 
> Don't dummy write the Rx descriptor.  You should only really need to
> dummy write the Rx buffer and you would do so after checking the
> descriptor, not before.  Otherwise you risk corrupting the Rx buffer
> because it is possible for you to read the Rx buffer, DMA occurs, and
> then you write back the Rx buffer and now you have corrupted the memory.
> 
>> 2) Make a copy of Rx descriptor and then use the copied data to check
>> buffer status. Not use the original descriptor because it won't be
>> migrated and migration may happen between two access of the Rx
>> descriptor.
> 
> Do not just blindly copy the Rx descriptor ring.  That is a recipe for
> disaster.  The problem is DMA has to happen in a very specific order for
> things to function correctly.  The Rx buffer has to be written and then
> the Rx descriptor.  The problem is you will end up getting a read-ahead
> on the Rx descriptor ring regardless of which order you dirty things in.


Sorry, I didn't say clearly.
I meant to copy one Rx descriptor when receive rx irq and handle Rx ring.

Current code in the ixgbevf_clean_rx_irq() checks status of the Rx
descriptor whether its Rx buffer has been populated data and then read
the packet length from Rx descriptor to handle the Rx buffer.

My idea is to do the following three steps when receive Rx buffer in the
ixgbevf_clean_rx_irq().

(1) dummy write the Rx buffer first,
(2) make a copy of its Rx descriptor
(3) Check the buffer status and get length from the copy.

Migration may happen every time.
Happen between (1) and (2). If the Rx buffer has been populated data, VF
driver will not know that on the new machine because the Rx descriptor
isn't migrated. But it's still safe.

Happen between (2) and (3). The copy will be migrated to new machine
and Rx buffer is migrated firstly. If there is data in the Rx buffer,
VF driver still can handle the buffer without migrating Rx descriptor.

The next buffers will be ignored since we don't migrate Rx descriptor
for them. Their status will be not completed on the new machine.

-- 
Best regards
Tianyu Lan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
  2015-10-30  2:41               ` Lan Tianyu
@ 2015-10-30 18:04                 ` Alexander Duyck
  0 siblings, 0 replies; 56+ messages in thread
From: Alexander Duyck @ 2015-10-30 18:04 UTC (permalink / raw)
  To: Lan Tianyu, bhelgaas, carolyn.wyborny, donald.c.skidmore,
	eddie.dong, nrupal.jani, yang.z.zhang, agraf, kvm, pbonzini,
	qemu-devel, emil.s.tantilov, intel-wired-lan, jeffrey.t.kirsher,
	jesse.brandeburg, john.ronciak, linux-kernel, linux-pci,
	matthew.vick, mitch.a.williams, netdev, shannon.nelson

On 10/29/2015 07:41 PM, Lan Tianyu wrote:
> On 2015年10月30日 00:17, Alexander Duyck wrote:
>> On 10/29/2015 01:33 AM, Lan Tianyu wrote:
>>> On 2015年10月29日 14:58, Alexander Duyck wrote:
>>>> Your code was having to do a bunch of shuffling in order to get things
>>>> set up so that you could bring the interface back up.  I would argue
>>>> that it may actually be faster at least on the bring-up to just drop the
>>>> old rings and start over since it greatly reduced the complexity and the
>>>> amount of device related data that has to be moved.
>>> If give up the old ring after migration and keep DMA running before
>>> stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and
>>> just make sure that all Rx buffers delivered to stack has been migrated.
>>>
>>> 1) Dummy write Rx buffer before checking Rx descriptor to ensure packet
>>> migrated first.
>> Don't dummy write the Rx descriptor.  You should only really need to
>> dummy write the Rx buffer and you would do so after checking the
>> descriptor, not before.  Otherwise you risk corrupting the Rx buffer
>> because it is possible for you to read the Rx buffer, DMA occurs, and
>> then you write back the Rx buffer and now you have corrupted the memory.
>>
>>> 2) Make a copy of Rx descriptor and then use the copied data to check
>>> buffer status. Not use the original descriptor because it won't be
>>> migrated and migration may happen between two access of the Rx
>>> descriptor.
>> Do not just blindly copy the Rx descriptor ring.  That is a recipe for
>> disaster.  The problem is DMA has to happen in a very specific order for
>> things to function correctly.  The Rx buffer has to be written and then
>> the Rx descriptor.  The problem is you will end up getting a read-ahead
>> on the Rx descriptor ring regardless of which order you dirty things in.
>
> Sorry, I didn't say clearly.
> I meant to copy one Rx descriptor when receive rx irq and handle Rx ring.

No, I understood what you are saying.  My explanation was that it will 
not work.

> Current code in the ixgbevf_clean_rx_irq() checks status of the Rx
> descriptor whether its Rx buffer has been populated data and then read
> the packet length from Rx descriptor to handle the Rx buffer.

That part you have correct.  However there are very explicit rules about 
the ordering of the reads.

> My idea is to do the following three steps when receive Rx buffer in the
> ixgbevf_clean_rx_irq().
>
> (1) dummy write the Rx buffer first,

You cannot dummy write the Rx buffer without first being given ownership 
of it.  In the driver this is handled in two phases. First we have to 
read the DD bit to see if it is set.  If it is we can take ownership of 
the buffer.  Second we have to either do a dma_sync_range_for_cpu or 
dma_unmap_page call so that we can guarantee the data has been moved to 
the buffer by the DMA API and that it knows it should no longer be 
accessing it.

> (2) make a copy of its Rx descriptor

This is not advisable.  Unless you can guarantee you are going to only 
read the descriptor after the DD bit is set you cannot guarantee that 
you won't race with device DMA.  The problem is you could have the 
migration occur right in the middle of (2).  If that occurs then you 
will have valid status bits, but the rest of the descriptor would be 
invalid data.

> (3) Check the buffer status and get length from the copy.

I believe this is the assumption that is leading you down the wrong 
path.  You would have to read the status before you could do the copy.  
You cannot do it after.

> Migration may happen every time.
> Happen between (1) and (2). If the Rx buffer has been populated data, VF
> driver will not know that on the new machine because the Rx descriptor
> isn't migrated. But it's still safe.

The part I think you are not getting is that DMA can occur between (1) 
and (2).  So if for example you were doing your dummy write while DMA 
was occurring you pull in your value, DMA occurs, you write your value 
and now you have corrupted an Rx frame by writing stale data back into it.

> Happen between (2) and (3). The copy will be migrated to new machine
> and Rx buffer is migrated firstly. If there is data in the Rx buffer,
> VF driver still can handle the buffer without migrating Rx descriptor.
>
> The next buffers will be ignored since we don't migrate Rx descriptor
> for them. Their status will be not completed on the new machine.

You have kind of lost me on this part.  Why do you believe there 
statuses will not be completed?  How are you going to prevent the Rx 
descriptor ring from being migrated as it will be a dirty page by the 
virtue of the fact that it is a bidirectional DMA mapping where the Rx 
path provides new buffers and writes those addresses in while the device 
is writing back the status bits and length back.  This is kind of what I 
was getting at.  The Rx descriptor ring will show up as one of the 
dirtiest spots on the driver since it is constantly being overwritten by 
the CPU in ixgbevf_alloc_rx_buffers.

Anyway we are kind of getting side tracked and I really think the 
solution you have proposed is kind of a dead-end.

What we have to do is come up with a solution that can deal with the 
fact that you are racing against two different entities.  You have to 
avoid racing with the device, while at the same time you have to avoid 
racing with the dirty page migration code.  There are essentially 2 
problems you have to solve.

1.  Rx pages handed off to the stack must be marked as dirty.  For now 
your code seemed to address this via this snippet below from patch 12/12:

> @@ -946,15 +949,17 @@ static struct sk_buff *ixgbevf_fetch_rx_buffer(struct ixgbevf_ring *rx_ring,
>   {
>   	struct ixgbevf_rx_buffer *rx_buffer;
>   	struct page *page;
> +	u8 *page_addr;
>   
>   	rx_buffer = &rx_ring->rx_buffer_info[rx_ring->next_to_clean];
>   	page = rx_buffer->page;
>   	prefetchw(page);
>   
> -	if (likely(!skb)) {
> -		void *page_addr = page_address(page) +
> -				  rx_buffer->page_offset;
> +	/* Mark page dirty */
> +	page_addr = page_address(page) + rx_buffer->page_offset;
> +	*page_addr = *page_addr;
>   
> +	if (likely(!skb)) {
>   		/* prefetch first cache line of first page */
>   		prefetch(page_addr);
>   #if L1_CACHE_BYTES < 128

It will work for now as a proof of concept, but I really would prefer to 
see a solution that is driver agnostic.  Maybe something that could take 
care of it in the DMA API.  For example if you were to use 
"swiotlb=force" in the guest this code wouldn't even be necessary since 
that forces bounce buffers which would mean your DMA mappings are dirty 
pages anyway.

2.  How to deal with a device that might be in the middle of an 
interrupt routine when you decide to migrate.  This is the bit I think 
you might be focusing on a bit too much, and the current solutions you 
have proposed will result in Rx data corruption in the generic case even 
without migration.  There are essentially 2 possible solutions that you 
could explore.

2a.  Have a VF device that is aware something is taking place and have 
it yield via something like a PCI hot-plug pause request.  I don't know 
if the Linux kernel supports something like that now since pause support 
in the OS is optional in the PCI hot-plug specification, but essentially 
it would be a request to do a PM suspend.  You would issue a hot-plug 
pause and know when it is completed by the fact that the PCI Bus Master 
bit is cleared in the VF.  Then you complete the migration and in the 
new guest you could issue a hot-plug event to restart operation.

2b.  Come up with some sort of pseudo IOMMU interface the VF has to use 
to map DMA, and provide an interface to quiesce the devices attached to 
the VM so that DMA can no longer occur.  Once you have disabled bus 
mastering on the VF you could then go through and migrate all DMA mapped 
pages.  As far as resuming on the other side you would somehow need to 
poke the VF to get it to realize the rings are no longer initialized and 
the mailbox is out-of-sync.  Once that happens the VF could reset and 
resume operation.

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2015-10-30 18:04 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-21 16:37 [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Lan Tianyu
2015-10-21 16:37 ` [RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device Lan Tianyu
2015-10-21 18:07   ` Alexander Duyck
2015-10-24 14:46     ` Lan, Tianyu
2015-10-21 16:37 ` [RFC Patch 02/12] IXGBE: Add new mail box event to restore VF status in the PF driver Lan Tianyu
2015-10-21 20:34   ` Alexander Duyck
2015-10-21 16:37 ` [RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate " Lan Tianyu
2015-10-21 20:45   ` Alexander Duyck
2015-10-25  7:21     ` Lan, Tianyu
2015-10-21 16:37 ` [RFC Patch 04/12] IXGBE: Add ixgbe_ping_vf() to notify a specified VF via mailbox msg Lan Tianyu
2015-10-21 16:37 ` [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf" Lan Tianyu
2015-10-21 20:52   ` Alexander Duyck
2015-10-22 12:51     ` Michael S. Tsirkin
2015-10-24 15:43     ` Lan, Tianyu
2015-10-25  6:03       ` Alexander Duyck
2015-10-25  6:45         ` Lan, Tianyu
2015-10-21 16:37 ` [RFC Patch 06/12] IXGBEVF: Add self emulation layer Lan Tianyu
2015-10-21 20:58   ` Alexander Duyck
2015-10-22 12:50     ` [Qemu-devel] " Michael S. Tsirkin
2015-10-22 15:50       ` Alexander Duyck
2015-10-21 16:37 ` [RFC Patch 07/12] IXGBEVF: Add new mail box event for migration Lan Tianyu
2015-10-21 16:37 ` [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package Lan Tianyu
2015-10-21 21:14   ` Alexander Duyck
2015-10-24 16:12     ` Lan, Tianyu
2015-10-22 12:58   ` Michael S. Tsirkin
2015-10-24 16:08     ` Lan, Tianyu
2015-10-21 16:37 ` [RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver Lan Tianyu
2015-10-21 21:48   ` Alexander Duyck
2015-10-22 12:46   ` Michael S. Tsirkin
2015-10-21 16:37 ` [RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation Lan Tianyu
2015-10-21 21:55   ` Alexander Duyck
2015-10-22 12:40   ` Michael S. Tsirkin
2015-10-21 16:37 ` [RFC Patch 11/12] IXGBEVF: Migrate VF statistic data Lan Tianyu
2015-10-22 12:36   ` Michael S. Tsirkin
2015-10-21 16:37 ` [RFC Patch 12/12] IXGBEVF: Track dma dirty pages Lan Tianyu
2015-10-22 12:30   ` Michael S. Tsirkin
2015-10-21 18:45 ` [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Or Gerlitz
2015-10-21 19:20   ` Alex Williamson
2015-10-21 23:26     ` Alexander Duyck
2015-10-22 12:32     ` [Qemu-devel] " Michael S. Tsirkin
2015-10-22 13:01       ` Alex Williamson
2015-10-22 13:06         ` Michael S. Tsirkin
2015-10-22 15:58     ` Or Gerlitz
2015-10-22 16:17       ` Alex Williamson
2015-10-22 12:55 ` [Qemu-devel] " Michael S. Tsirkin
2015-10-23 18:36 ` Alexander Duyck
2015-10-23 19:05   ` Alex Williamson
2015-10-23 20:01     ` Alexander Duyck
2015-10-26  5:36   ` Lan Tianyu
2015-10-26 15:03     ` Alexander Duyck
2015-10-29  6:12       ` Lan Tianyu
2015-10-29  6:58         ` Alexander Duyck
2015-10-29  8:33           ` Lan Tianyu
2015-10-29 16:17             ` Alexander Duyck
2015-10-30  2:41               ` Lan Tianyu
2015-10-30 18:04                 ` Alexander Duyck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).