All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 0/7] PCI: Linux kernel SR-IOV support
@ 2009-02-20  6:54 Yu Zhao
  2009-02-20  6:54 ` [PATCH v10 1/7] PCI: initialize and release SR-IOV capability Yu Zhao
                   ` (8 more replies)
  0 siblings, 9 replies; 45+ messages in thread
From: Yu Zhao @ 2009-02-20  6:54 UTC (permalink / raw)
  To: jbarnes; +Cc: linux-pci, kvm, linux-kernel, Yu Zhao

Greetings,

Following patches are intended to support SR-IOV capability in the
Linux kernel. With these patches, people can turn a PCI device with
the capability into multiple ones from software perspective, which
will benefit KVM and achieve other purposes such as QoS, security,
and etc.

SR-IOV specification can be found at:
  http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf
(it requires membership.)

Devices that support SR-IOV are available from following vendors:
  http://download.intel.com/design/network/ProdBrf/320025.pdf
  http://www.myri.com/vlsi/Lanai_Z8ES_Datasheet.pdf
  http://www.neterion.com/products/pdfs/X3100ProductBrief.pdf

Physical Function driver patches for Intel 82576 NIC are available:
  http://patchwork.kernel.org/patch/8063/
  http://patchwork.kernel.org/patch/8064/
  http://patchwork.kernel.org/patch/8065/
  http://patchwork.kernel.org/patch/8066/

Major changes from v9 to v10:
  1, minor fix in pci_restore_iov_state().
  2, respin against the latest tree.

Yu Zhao (7):
  PCI: initialize and release SR-IOV capability
  PCI: restore saved SR-IOV state
  PCI: reserve bus range for SR-IOV device
  PCI: add SR-IOV API for Physical Function driver
  PCI: handle SR-IOV Virtual Function Migration
  PCI: document SR-IOV sysfs entries
  PCI: manual for SR-IOV user and driver developer

 Documentation/ABI/testing/sysfs-bus-pci |   27 ++
 Documentation/DocBook/kernel-api.tmpl   |    1 +
 Documentation/PCI/pci-iov-howto.txt     |   99 +++++
 drivers/pci/Kconfig                     |   13 +
 drivers/pci/Makefile                    |    3 +
 drivers/pci/iov.c                       |  711 +++++++++++++++++++++++++++++++
 drivers/pci/pci.c                       |    8 +
 drivers/pci/pci.h                       |   53 +++
 drivers/pci/probe.c                     |    7 +
 include/linux/pci.h                     |   28 ++
 include/linux/pci_regs.h                |   33 ++
 11 files changed, 983 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/PCI/pci-iov-howto.txt
 create mode 100644 drivers/pci/iov.c


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v10 1/7] PCI: initialize and release SR-IOV capability
  2009-02-20  6:54 [PATCH v10 0/7] PCI: Linux kernel SR-IOV support Yu Zhao
@ 2009-02-20  6:54 ` Yu Zhao
  2009-03-06 20:08   ` Matthew Wilcox
  2009-02-20  6:54 ` [PATCH v10 2/7] PCI: restore saved SR-IOV state Yu Zhao
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Yu Zhao @ 2009-02-20  6:54 UTC (permalink / raw)
  To: jbarnes; +Cc: linux-pci, kvm, linux-kernel, Yu Zhao

Signed-off-by: Yu Zhao <yu.zhao@intel.com>
---
 drivers/pci/Kconfig      |   13 ++++
 drivers/pci/Makefile     |    3 +
 drivers/pci/iov.c        |  181 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/pci/pci.c        |    7 ++
 drivers/pci/pci.h        |   37 ++++++++++
 drivers/pci/probe.c      |    4 +
 include/linux/pci.h      |    8 ++
 include/linux/pci_regs.h |   33 +++++++++
 8 files changed, 286 insertions(+), 0 deletions(-)
 create mode 100644 drivers/pci/iov.c

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 2a4501d..e8ea3e8 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -59,3 +59,16 @@ config HT_IRQ
 	   This allows native hypertransport devices to use interrupts.
 
 	   If unsure say Y.
+
+config PCI_IOV
+	bool "PCI IOV support"
+	depends on PCI
+	select PCI_MSI
+	default n
+	help
+	  PCI-SIG I/O Virtualization (IOV) Specifications support.
+	  Single Root IOV: allows the Physical Function driver to enable
+	  the hardware capability, so the Virtual Function is accessible
+	  via the PCI Configuration Space using its own Bus, Device and
+	  Function Numbers. Each Virtual Function also has the PCI Memory
+	  Space to map the device specific register set.
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 3d07ce2..ba99282 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -29,6 +29,9 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o
 
 obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o
 
+# PCI IOV support
+obj-$(CONFIG_PCI_IOV) += iov.o
+
 #
 # Some architectures use the generic PCI setup functions
 #
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
new file mode 100644
index 0000000..e6736d4
--- /dev/null
+++ b/drivers/pci/iov.c
@@ -0,0 +1,181 @@
+/*
+ * drivers/pci/iov.c
+ *
+ * Copyright (C) 2009 Intel Corporation, Yu Zhao <yu.zhao@intel.com>
+ *
+ * PCI Express I/O Virtualization (IOV) support.
+ *   Single Root IOV 1.0
+ */
+
+#include <linux/pci.h>
+#include <linux/mutex.h>
+#include <linux/string.h>
+#include <linux/delay.h>
+#include "pci.h"
+
+
+static int sriov_init(struct pci_dev *dev, int pos)
+{
+	int i;
+	int rc;
+	int nres;
+	u32 pgsz;
+	u16 ctrl, total, offset, stride;
+	struct pci_sriov *iov;
+	struct resource *res;
+	struct pci_dev *pdev;
+
+	if (dev->pcie_type != PCI_EXP_TYPE_RC_END &&
+	    dev->pcie_type != PCI_EXP_TYPE_ENDPOINT)
+		return -ENODEV;
+
+	pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, &ctrl);
+	if (ctrl & PCI_SRIOV_CTRL_VFE) {
+		pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0);
+		ssleep(1);
+	}
+
+	pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, &total);
+	if (!total)
+		return 0;
+
+	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
+		if (pdev->sriov)
+			break;
+	if (list_empty(&dev->bus->devices) || !pdev->sriov)
+		pdev = NULL;
+
+	ctrl = 0;
+	if (!pdev && pci_ari_enabled(dev->bus))
+		ctrl |= PCI_SRIOV_CTRL_ARI;
+
+	pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
+	pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total);
+	pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset);
+	pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride);
+	if (!offset || (total > 1 && !stride))
+		return -EIO;
+
+	pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, &pgsz);
+	i = PAGE_SHIFT > 12 ? PAGE_SHIFT - 12 : 0;
+	pgsz &= ~((1 << i) - 1);
+	if (!pgsz)
+		return -EIO;
+
+	pgsz &= ~(pgsz - 1);
+	pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
+
+	nres = 0;
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = dev->resource + PCI_SRIOV_RESOURCES + i;
+		i += __pci_read_base(dev, pci_bar_unknown, res,
+				     pos + PCI_SRIOV_BAR + i * 4);
+		if (!res->flags)
+			continue;
+		if (resource_size(res) & (PAGE_SIZE - 1)) {
+			rc = -EIO;
+			goto failed;
+		}
+		res->end = res->start + resource_size(res) * total - 1;
+		nres++;
+	}
+
+	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
+	if (!iov) {
+		rc = -ENOMEM;
+		goto failed;
+	}
+
+	iov->pos = pos;
+	iov->nres = nres;
+	iov->ctrl = ctrl;
+	iov->total = total;
+	iov->offset = offset;
+	iov->stride = stride;
+	iov->pgsz = pgsz;
+	iov->self = dev;
+	pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap);
+	pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, &iov->link);
+
+	if (pdev)
+		iov->pdev = pci_dev_get(pdev);
+	else {
+		iov->pdev = dev;
+		mutex_init(&iov->lock);
+	}
+
+	dev->sriov = iov;
+
+	return 0;
+
+failed:
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = dev->resource + PCI_SRIOV_RESOURCES + i;
+		res->flags = 0;
+	}
+
+	return rc;
+}
+
+static void sriov_release(struct pci_dev *dev)
+{
+	if (dev == dev->sriov->pdev)
+		mutex_destroy(&dev->sriov->lock);
+	else
+		pci_dev_put(dev->sriov->pdev);
+
+	kfree(dev->sriov);
+	dev->sriov = NULL;
+}
+
+/**
+ * pci_iov_init - initialize the IOV capability
+ * @dev: the PCI device
+ *
+ * Returns 0 on success, or negative on failure.
+ */
+int pci_iov_init(struct pci_dev *dev)
+{
+	int pos;
+
+	if (!dev->is_pcie)
+		return -ENODEV;
+
+	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
+	if (pos)
+		return sriov_init(dev, pos);
+
+	return -ENODEV;
+}
+
+/**
+ * pci_iov_release - release resources used by the IOV capability
+ * @dev: the PCI device
+ */
+void pci_iov_release(struct pci_dev *dev)
+{
+	if (dev->sriov)
+		sriov_release(dev);
+}
+
+/**
+ * pci_iov_resource_bar - get position of the SR-IOV BAR
+ * @dev: the PCI device
+ * @resno: the resource number
+ * @type: the BAR type to be filled in
+ *
+ * Returns position of the BAR encapsulated in the SR-IOV capability.
+ */
+int pci_iov_resource_bar(struct pci_dev *dev, int resno,
+			 enum pci_bar_type *type)
+{
+	if (resno < PCI_SRIOV_RESOURCES || resno > PCI_SRIOV_RESOURCE_END)
+		return 0;
+
+	BUG_ON(!dev->sriov);
+
+	*type = pci_bar_unknown;
+
+	return dev->sriov->pos + PCI_SRIOV_BAR +
+		4 * (resno - PCI_SRIOV_RESOURCES);
+}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 6d61200..2eba2a5 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2346,12 +2346,19 @@ int pci_select_bars(struct pci_dev *dev, unsigned long flags)
  */
 int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type)
 {
+	int reg;
+
 	if (resno < PCI_ROM_RESOURCE) {
 		*type = pci_bar_unknown;
 		return PCI_BASE_ADDRESS_0 + 4 * resno;
 	} else if (resno == PCI_ROM_RESOURCE) {
 		*type = pci_bar_mem32;
 		return dev->rom_base_reg;
+	} else if (resno < PCI_BRIDGE_RESOURCES) {
+		/* device specific resource */
+		reg = pci_iov_resource_bar(dev, resno, type);
+		if (reg)
+			return reg;
 	}
 
 	dev_err(&dev->dev, "BAR: invalid resource #%d\n", resno);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 07c0aa5..451db74 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -195,4 +195,41 @@ static inline int pci_ari_enabled(struct pci_bus *bus)
 	return bus->self && bus->self->ari_enabled;
 }
 
+/* Single Root I/O Virtualization */
+struct pci_sriov {
+	int pos;		/* capability position */
+	int nres;		/* number of resources */
+	u32 cap;		/* SR-IOV Capabilities */
+	u16 ctrl;		/* SR-IOV Control */
+	u16 total;		/* total VFs associated with the PF */
+	u16 offset;		/* first VF Routing ID offset */
+	u16 stride;		/* following VF stride */
+	u32 pgsz;		/* page size for BAR alignment */
+	u8 link;		/* Function Dependency Link */
+	struct pci_dev *pdev;	/* lowest numbered PF */
+	struct pci_dev *self;	/* this PF */
+	struct mutex lock;	/* lock for VF bus */
+};
+
+#ifdef CONFIG_PCI_IOV
+extern int pci_iov_init(struct pci_dev *dev);
+extern void pci_iov_release(struct pci_dev *dev);
+extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
+				enum pci_bar_type *type);
+#else
+static inline int pci_iov_init(struct pci_dev *dev)
+{
+	return -ENODEV;
+}
+static inline void pci_iov_release(struct pci_dev *dev)
+
+{
+}
+static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno,
+				       enum pci_bar_type *type)
+{
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 #endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 55ec44a..03b6f29 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -785,6 +785,7 @@ static int pci_setup_device(struct pci_dev * dev)
 static void pci_release_capabilities(struct pci_dev *dev)
 {
 	pci_vpd_release(dev);
+	pci_iov_release(dev);
 }
 
 /**
@@ -972,6 +973,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
 
 	/* Alternative Routing-ID Forwarding */
 	pci_enable_ari(dev);
+
+	/* Single Root I/O Virtualization */
+	pci_iov_init(dev);
 }
 
 void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 7bd624b..f4d740e 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -93,6 +93,12 @@ enum {
 	/* #6: expansion ROM resource */
 	PCI_ROM_RESOURCE,
 
+	/* device specific resources */
+#ifdef CONFIG_PCI_IOV
+	PCI_SRIOV_RESOURCES,
+	PCI_SRIOV_RESOURCE_END = PCI_SRIOV_RESOURCES + PCI_SRIOV_NUM_BARS - 1,
+#endif
+
 	/* resources assigned to buses behind the bridge */
 #define PCI_BRIDGE_RESOURCE_NUM 4
 
@@ -180,6 +186,7 @@ struct pci_cap_saved_state {
 
 struct pcie_link_state;
 struct pci_vpd;
+struct pci_sriov;
 
 /*
  * The pci_dev structure is used to describe PCI devices.
@@ -270,6 +277,7 @@ struct pci_dev {
 	struct list_head msi_list;
 #endif
 	struct pci_vpd *vpd;
+	struct pci_sriov *sriov;	/* SR-IOV capability related */
 };
 
 extern struct pci_dev *alloc_pci_dev(void);
diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index 027815b..4ce5eb0 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -375,6 +375,7 @@
 #define  PCI_EXP_TYPE_UPSTREAM	0x5	/* Upstream Port */
 #define  PCI_EXP_TYPE_DOWNSTREAM 0x6	/* Downstream Port */
 #define  PCI_EXP_TYPE_PCI_BRIDGE 0x7	/* PCI/PCI-X Bridge */
+#define  PCI_EXP_TYPE_RC_END	0x9	/* Root Complex Integrated Endpoint */
 #define PCI_EXP_FLAGS_SLOT	0x0100	/* Slot implemented */
 #define PCI_EXP_FLAGS_IRQ	0x3e00	/* Interrupt message number */
 #define PCI_EXP_DEVCAP		4	/* Device capabilities */
@@ -498,6 +499,7 @@
 #define PCI_EXT_CAP_ID_DSN	3
 #define PCI_EXT_CAP_ID_PWR	4
 #define PCI_EXT_CAP_ID_ARI	14
+#define PCI_EXT_CAP_ID_SRIOV	16
 
 /* Advanced Error Reporting */
 #define PCI_ERR_UNCOR_STATUS	4	/* Uncorrectable Error Status */
@@ -615,4 +617,35 @@
 #define  PCI_ARI_CTRL_ACS	0x0002	/* ACS Function Groups Enable */
 #define  PCI_ARI_CTRL_FG(x)	(((x) >> 4) & 7) /* Function Group */
 
+/* Single Root I/O Virtualization */
+#define PCI_SRIOV_CAP		0x04	/* SR-IOV Capabilities */
+#define  PCI_SRIOV_CAP_VFM	0x01	/* VF Migration Capable */
+#define  PCI_SRIOV_CAP_INTR(x)	((x) >> 21) /* Interrupt Message Number */
+#define PCI_SRIOV_CTRL		0x08	/* SR-IOV Control */
+#define  PCI_SRIOV_CTRL_VFE	0x01	/* VF Enable */
+#define  PCI_SRIOV_CTRL_VFM	0x02	/* VF Migration Enable */
+#define  PCI_SRIOV_CTRL_INTR	0x04	/* VF Migration Interrupt Enable */
+#define  PCI_SRIOV_CTRL_MSE	0x08	/* VF Memory Space Enable */
+#define  PCI_SRIOV_CTRL_ARI	0x10	/* ARI Capable Hierarchy */
+#define PCI_SRIOV_STATUS	0x0a	/* SR-IOV Status */
+#define  PCI_SRIOV_STATUS_VFM	0x01	/* VF Migration Status */
+#define PCI_SRIOV_INITIAL_VF	0x0c	/* Initial VFs */
+#define PCI_SRIOV_TOTAL_VF	0x0e	/* Total VFs */
+#define PCI_SRIOV_NUM_VF	0x10	/* Number of VFs */
+#define PCI_SRIOV_FUNC_LINK	0x12	/* Function Dependency Link */
+#define PCI_SRIOV_VF_OFFSET	0x14	/* First VF Offset */
+#define PCI_SRIOV_VF_STRIDE	0x16	/* Following VF Stride */
+#define PCI_SRIOV_VF_DID	0x1a	/* VF Device ID */
+#define PCI_SRIOV_SUP_PGSIZE	0x1c	/* Supported Page Sizes */
+#define PCI_SRIOV_SYS_PGSIZE	0x20	/* System Page Size */
+#define PCI_SRIOV_BAR		0x24	/* VF BAR0 */
+#define  PCI_SRIOV_NUM_BARS	6	/* Number of VF BARs */
+#define PCI_SRIOV_VFM		0x3c	/* VF Migration State Array Offset*/
+#define  PCI_SRIOV_VFM_BIR(x)	((x) & 7)	/* State BIR */
+#define  PCI_SRIOV_VFM_OFFSET(x) ((x) & ~7)	/* State Offset */
+#define  PCI_SRIOV_VFM_UA	0x0	/* Inactive.Unavailable */
+#define  PCI_SRIOV_VFM_MI	0x1	/* Dormant.MigrateIn */
+#define  PCI_SRIOV_VFM_MO	0x2	/* Active.MigrateOut */
+#define  PCI_SRIOV_VFM_AV	0x3	/* Active.Available */
+
 #endif /* LINUX_PCI_REGS_H */
-- 
1.6.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v10 2/7] PCI: restore saved SR-IOV state
  2009-02-20  6:54 [PATCH v10 0/7] PCI: Linux kernel SR-IOV support Yu Zhao
  2009-02-20  6:54 ` [PATCH v10 1/7] PCI: initialize and release SR-IOV capability Yu Zhao
@ 2009-02-20  6:54 ` Yu Zhao
  2009-03-06 20:09   ` Matthew Wilcox
  2009-02-20  6:54 ` [PATCH v10 3/7] PCI: reserve bus range for SR-IOV device Yu Zhao
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Yu Zhao @ 2009-02-20  6:54 UTC (permalink / raw)
  To: jbarnes; +Cc: linux-pci, kvm, linux-kernel, Yu Zhao

Signed-off-by: Yu Zhao <yu.zhao@intel.com>
---
 drivers/pci/iov.c |   29 +++++++++++++++++++++++++++++
 drivers/pci/pci.c |    1 +
 drivers/pci/pci.h |    4 ++++
 3 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index e6736d4..3bca8f8 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -128,6 +128,25 @@ static void sriov_release(struct pci_dev *dev)
 	dev->sriov = NULL;
 }
 
+static void sriov_restore_state(struct pci_dev *dev)
+{
+	int i;
+	u16 ctrl;
+	struct pci_sriov *iov = dev->sriov;
+
+	pci_read_config_word(dev, iov->pos + PCI_SRIOV_CTRL, &ctrl);
+	if (ctrl & PCI_SRIOV_CTRL_VFE)
+		return;
+
+	for (i = PCI_SRIOV_RESOURCES; i <= PCI_SRIOV_RESOURCE_END; i++)
+		pci_update_resource(dev, i);
+
+	pci_write_config_dword(dev, iov->pos + PCI_SRIOV_SYS_PGSIZE, iov->pgsz);
+	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
+	if (iov->ctrl & PCI_SRIOV_CTRL_VFE)
+		msleep(100);
+}
+
 /**
  * pci_iov_init - initialize the IOV capability
  * @dev: the PCI device
@@ -179,3 +198,13 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno,
 	return dev->sriov->pos + PCI_SRIOV_BAR +
 		4 * (resno - PCI_SRIOV_RESOURCES);
 }
+
+/**
+ * pci_restore_iov_state - restore the state of the IOV capability
+ * @dev: the PCI device
+ */
+void pci_restore_iov_state(struct pci_dev *dev)
+{
+	if (dev->sriov)
+		sriov_restore_state(dev);
+}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 2eba2a5..8e21912 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -773,6 +773,7 @@ pci_restore_state(struct pci_dev *dev)
 	}
 	pci_restore_pcix_state(dev);
 	pci_restore_msi_state(dev);
+	pci_restore_iov_state(dev);
 
 	return 0;
 }
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 451db74..b24c9e2 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -216,6 +216,7 @@ extern int pci_iov_init(struct pci_dev *dev);
 extern void pci_iov_release(struct pci_dev *dev);
 extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
 				enum pci_bar_type *type);
+extern void pci_restore_iov_state(struct pci_dev *dev);
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
@@ -230,6 +231,9 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno,
 {
 	return 0;
 }
+static inline void pci_restore_iov_state(struct pci_dev *dev)
+{
+}
 #endif /* CONFIG_PCI_IOV */
 
 #endif /* DRIVERS_PCI_H */
-- 
1.6.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v10 3/7] PCI: reserve bus range for SR-IOV device
  2009-02-20  6:54 [PATCH v10 0/7] PCI: Linux kernel SR-IOV support Yu Zhao
  2009-02-20  6:54 ` [PATCH v10 1/7] PCI: initialize and release SR-IOV capability Yu Zhao
  2009-02-20  6:54 ` [PATCH v10 2/7] PCI: restore saved SR-IOV state Yu Zhao
@ 2009-02-20  6:54 ` Yu Zhao
  2009-03-06 20:20   ` Matthew Wilcox
  2009-02-20  6:54 ` [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver Yu Zhao
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Yu Zhao @ 2009-02-20  6:54 UTC (permalink / raw)
  To: jbarnes; +Cc: linux-pci, kvm, linux-kernel, Yu Zhao

Signed-off-by: Yu Zhao <yu.zhao@intel.com>
---
 drivers/pci/iov.c   |   34 ++++++++++++++++++++++++++++++++++
 drivers/pci/pci.h   |    5 +++++
 drivers/pci/probe.c |    3 +++
 3 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 3bca8f8..0b80437 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -14,6 +14,16 @@
 #include "pci.h"
 
 
+static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn)
+{
+	u16 bdf;
+
+	bdf = (dev->bus->number << 8) + dev->devfn +
+	      dev->sriov->offset + dev->sriov->stride * id;
+	*busnr = bdf >> 8;
+	*devfn = bdf & 0xff;
+}
+
 static int sriov_init(struct pci_dev *dev, int pos)
 {
 	int i;
@@ -208,3 +218,27 @@ void pci_restore_iov_state(struct pci_dev *dev)
 	if (dev->sriov)
 		sriov_restore_state(dev);
 }
+
+/**
+ * pci_iov_bus_range - find bus range used by Virtual Function
+ * @bus: the PCI bus
+ *
+ * Returns max number of buses (exclude current one) used by Virtual
+ * Functions.
+ */
+int pci_iov_bus_range(struct pci_bus *bus)
+{
+	int max = 0;
+	u8 busnr, devfn;
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		if (!dev->sriov)
+			continue;
+		virtfn_bdf(dev, dev->sriov->total - 1, &busnr, &devfn);
+		if (busnr > max)
+			max = busnr;
+	}
+
+	return max ? max - bus->number : 0;
+}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index b24c9e2..2cf32f5 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -217,6 +217,7 @@ extern void pci_iov_release(struct pci_dev *dev);
 extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
 				enum pci_bar_type *type);
 extern void pci_restore_iov_state(struct pci_dev *dev);
+extern int pci_iov_bus_range(struct pci_bus *bus);
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
@@ -234,6 +235,10 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno,
 static inline void pci_restore_iov_state(struct pci_dev *dev)
 {
 }
+static inline int pci_iov_bus_range(struct pci_bus *bus)
+{
+	return 0;
+}
 #endif /* CONFIG_PCI_IOV */
 
 #endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 03b6f29..4c8abd0 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1078,6 +1078,9 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus)
 	for (devfn = 0; devfn < 0x100; devfn += 8)
 		pci_scan_slot(bus, devfn);
 
+	/* Reserve buses for SR-IOV capability. */
+	max += pci_iov_bus_range(bus);
+
 	/*
 	 * After performing arch-dependent fixup of the bus, look behind
 	 * all PCI-to-PCI bridges on this bus.
-- 
1.6.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver
  2009-02-20  6:54 [PATCH v10 0/7] PCI: Linux kernel SR-IOV support Yu Zhao
                   ` (2 preceding siblings ...)
  2009-02-20  6:54 ` [PATCH v10 3/7] PCI: reserve bus range for SR-IOV device Yu Zhao
@ 2009-02-20  6:54 ` Yu Zhao
  2009-03-06 20:37   ` Matthew Wilcox
  2009-02-20  6:54 ` [PATCH v10 5/7] PCI: handle SR-IOV Virtual Function Migration Yu Zhao
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Yu Zhao @ 2009-02-20  6:54 UTC (permalink / raw)
  To: jbarnes; +Cc: linux-pci, kvm, linux-kernel, Yu Zhao

Signed-off-by: Yu Zhao <yu.zhao@intel.com>
---
 drivers/pci/iov.c   |  348 +++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/pci/pci.h   |    3 +
 include/linux/pci.h |   14 ++
 3 files changed, 365 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 0b80437..8096fc9 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -13,6 +13,8 @@
 #include <linux/delay.h>
 #include "pci.h"
 
+#define VIRTFN_ID_LEN	8
+
 
 static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn)
 {
@@ -24,6 +26,319 @@ static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn)
 	*devfn = bdf & 0xff;
 }
 
+static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr)
+{
+	int rc;
+	struct pci_bus *child;
+
+	if (bus->number == busnr)
+		return bus;
+
+	child = pci_find_bus(pci_domain_nr(bus), busnr);
+	if (child)
+		return child;
+
+	child = pci_add_new_bus(bus, NULL, busnr);
+	if (!child)
+		return NULL;
+
+	child->subordinate = busnr;
+	child->dev.parent = bus->bridge;
+	rc = pci_bus_add_child(child);
+	if (rc) {
+		pci_remove_bus(child);
+		return NULL;
+	}
+
+	return child;
+}
+
+static void virtfn_remove_bus(struct pci_bus *bus, int busnr)
+{
+	struct pci_bus *child;
+
+	if (bus->number == busnr)
+		return;
+
+	child = pci_find_bus(pci_domain_nr(bus), busnr);
+	BUG_ON(!child);
+
+	if (list_empty(&child->devices))
+		pci_remove_bus(child);
+}
+
+static int virtfn_add(struct pci_dev *dev, int id, int reset)
+{
+	int i;
+	int rc;
+	u64 size;
+	u8 busnr, devfn;
+	char buf[VIRTFN_ID_LEN];
+	struct pci_dev *virtfn;
+	struct resource *res;
+	struct pci_sriov *iov = dev->sriov;
+
+	virtfn = alloc_pci_dev();
+	if (!virtfn)
+		return -ENOMEM;
+
+	virtfn_bdf(dev, id, &busnr, &devfn);
+	mutex_lock(&iov->pdev->sriov->lock);
+	virtfn->bus = virtfn_add_bus(dev->bus, busnr);
+	if (!virtfn->bus) {
+		kfree(virtfn);
+		mutex_unlock(&iov->pdev->sriov->lock);
+		return -ENOMEM;
+	}
+
+	virtfn->sysdata = dev->bus->sysdata;
+	virtfn->dev.parent = dev->dev.parent;
+	virtfn->dev.bus = dev->dev.bus;
+	virtfn->devfn = devfn;
+	virtfn->hdr_type = PCI_HEADER_TYPE_NORMAL;
+	virtfn->cfg_size = PCI_CFG_SPACE_EXP_SIZE;
+	virtfn->error_state = pci_channel_io_normal;
+	virtfn->current_state = PCI_UNKNOWN;
+	virtfn->is_pcie = 1;
+	virtfn->pcie_type = PCI_EXP_TYPE_ENDPOINT;
+	virtfn->dma_mask = 0xffffffff;
+	virtfn->vendor = dev->vendor;
+	virtfn->subsystem_vendor = dev->subsystem_vendor;
+	virtfn->class = dev->class;
+	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_DID, &virtfn->device);
+	pci_read_config_byte(virtfn, PCI_REVISION_ID, &virtfn->revision);
+	pci_read_config_word(virtfn, PCI_SUBSYSTEM_ID,
+			     &virtfn->subsystem_device);
+
+	dev_set_name(&virtfn->dev, "%04x:%02x:%02x.%d",
+		     pci_domain_nr(virtfn->bus), busnr,
+		     PCI_SLOT(devfn), PCI_FUNC(devfn));
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = dev->resource + PCI_SRIOV_RESOURCES + i;
+		if (!res->parent)
+			continue;
+		virtfn->resource[i].name = pci_name(virtfn);
+		virtfn->resource[i].flags = res->flags;
+		size = resource_size(res);
+		do_div(size, iov->total);
+		virtfn->resource[i].start = res->start + size * id;
+		virtfn->resource[i].end = virtfn->resource[i].start + size - 1;
+		rc = request_resource(res, &virtfn->resource[i]);
+		BUG_ON(rc);
+	}
+
+	if (reset)
+		pci_execute_reset_function(virtfn);
+
+	pci_device_add(virtfn, virtfn->bus);
+	mutex_unlock(&iov->pdev->sriov->lock);
+
+	virtfn->physfn = pci_dev_get(dev);
+
+	rc = pci_bus_add_device(virtfn);
+	if (rc)
+		goto failed1;
+	sprintf(buf, "%d", id);
+	rc = sysfs_create_link(&iov->dev.kobj, &virtfn->dev.kobj, buf);
+	if (rc)
+		goto failed1;
+	rc = sysfs_create_link(&virtfn->dev.kobj, &dev->dev.kobj, "physfn");
+	if (rc)
+		goto failed2;
+
+	kobject_uevent(&virtfn->dev.kobj, KOBJ_CHANGE);
+
+	return 0;
+
+failed2:
+	sysfs_remove_link(&iov->dev.kobj, buf);
+failed1:
+	pci_dev_put(dev);
+	mutex_lock(&iov->pdev->sriov->lock);
+	pci_remove_bus_device(virtfn);
+	virtfn_remove_bus(dev->bus, busnr);
+	mutex_unlock(&iov->pdev->sriov->lock);
+
+	return rc;
+}
+
+static void virtfn_remove(struct pci_dev *dev, int id, int reset)
+{
+	u8 busnr, devfn;
+	char buf[VIRTFN_ID_LEN];
+	struct pci_bus *bus;
+	struct pci_dev *virtfn;
+	struct pci_sriov *iov = dev->sriov;
+
+	virtfn_bdf(dev, id, &busnr, &devfn);
+	bus = pci_find_bus(pci_domain_nr(dev->bus), busnr);
+	if (!bus)
+		return;
+
+	virtfn = pci_get_slot(bus, devfn);
+	if (!virtfn)
+		return;
+
+	pci_dev_put(virtfn);
+
+	if (reset) {
+		device_release_driver(&virtfn->dev);
+		pci_execute_reset_function(virtfn);
+	}
+
+	sprintf(buf, "%d", id);
+	sysfs_remove_link(&iov->dev.kobj, buf);
+	sysfs_remove_link(&virtfn->dev.kobj, "physfn");
+
+	mutex_lock(&iov->pdev->sriov->lock);
+	pci_remove_bus_device(virtfn);
+	virtfn_remove_bus(dev->bus, busnr);
+	mutex_unlock(&iov->pdev->sriov->lock);
+
+	pci_dev_put(dev);
+}
+
+static void sriov_release_dev(struct device *dev)
+{
+	struct pci_sriov *iov = container_of(dev, struct pci_sriov, dev);
+
+	iov->nr_virtfn = 0;
+}
+
+static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
+{
+	int rc;
+	int i, j;
+	int nres;
+	u8 busnr, devfn;
+	u16 offset, stride, initial;
+	struct resource *res;
+	struct pci_dev *link;
+	struct pci_sriov *iov = dev->sriov;
+
+	if (!nr_virtfn)
+		return 0;
+
+	if (iov->nr_virtfn)
+		return -EINVAL;
+
+	pci_read_config_word(dev, iov->pos + PCI_SRIOV_INITIAL_VF, &initial);
+	if (initial > iov->total ||
+	    (!(iov->cap & PCI_SRIOV_CAP_VFM) && (initial != iov->total)))
+		return -EIO;
+
+	if (nr_virtfn < 0 || nr_virtfn > iov->total ||
+	    (!(iov->cap & PCI_SRIOV_CAP_VFM) && (nr_virtfn > initial)))
+		return -EINVAL;
+
+	pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn);
+	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_OFFSET, &offset);
+	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_STRIDE, &stride);
+	if (!offset || (nr_virtfn > 1 && !stride))
+		return -EIO;
+
+	nres = 0;
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = dev->resource + PCI_SRIOV_RESOURCES + i;
+		if (!res->parent)
+			continue;
+		nres++;
+	}
+	if (nres != iov->nres) {
+		dev_err(&dev->dev, "no enough MMIO for SR-IOV\n");
+		return -ENOMEM;
+	}
+
+	iov->offset = offset;
+	iov->stride = stride;
+
+	virtfn_bdf(dev, nr_virtfn - 1, &busnr, &devfn);
+	if (busnr > dev->bus->subordinate) {
+		dev_err(&dev->dev, "no enough bus range for SR-IOV\n");
+		return -ENOMEM;
+	}
+
+	memset(&iov->dev, 0, sizeof(iov->dev));
+	strcpy(iov->dev.bus_id, "virtfn");
+	iov->dev.parent = &dev->dev;
+	iov->dev.release = sriov_release_dev;
+	rc = device_register(&iov->dev);
+	if (rc)
+		return rc;
+
+	if (iov->link != dev->devfn) {
+		rc = -ENODEV;
+		list_for_each_entry(link, &dev->bus->devices, bus_list) {
+			if (link->sriov && link->devfn == iov->link)
+				rc = sysfs_create_link(&iov->dev.kobj,
+						&link->dev.kobj, "dep_link");
+		}
+		if (rc)
+			goto failed1;
+	}
+
+	iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
+	pci_block_user_cfg_access(dev);
+	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
+	msleep(100);
+	pci_unblock_user_cfg_access(dev);
+
+	iov->initial = initial;
+	if (nr_virtfn < initial)
+		initial = nr_virtfn;
+
+	for (i = 0; i < initial; i++) {
+		rc = virtfn_add(dev, i, 0);
+		if (rc)
+			goto failed2;
+	}
+
+	kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
+	iov->nr_virtfn = nr_virtfn;
+
+	return 0;
+
+failed2:
+	for (j = 0; j < i; j++)
+		virtfn_remove(dev, j, 0);
+
+	iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
+	pci_block_user_cfg_access(dev);
+	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
+	ssleep(1);
+	pci_unblock_user_cfg_access(dev);
+
+	if (iov->link != dev->devfn)
+		sysfs_remove_link(&iov->dev.kobj, "dep_link");
+failed1:
+	device_unregister(&iov->dev);
+
+	return rc;
+}
+
+static void sriov_disable(struct pci_dev *dev)
+{
+	int i;
+	struct pci_sriov *iov = dev->sriov;
+
+	if (!iov->nr_virtfn)
+		return;
+
+	for (i = 0; i < iov->nr_virtfn; i++)
+		virtfn_remove(dev, i, 0);
+
+	iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
+	pci_block_user_cfg_access(dev);
+	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
+	ssleep(1);
+	pci_unblock_user_cfg_access(dev);
+
+	if (iov->link != dev->devfn)
+		sysfs_remove_link(&iov->dev.kobj, "dep_link");
+	device_unregister(&iov->dev);
+}
+
 static int sriov_init(struct pci_dev *dev, int pos)
 {
 	int i;
@@ -129,6 +444,8 @@ failed:
 
 static void sriov_release(struct pci_dev *dev)
 {
+	BUG_ON(dev->sriov->nr_virtfn);
+
 	if (dev == dev->sriov->pdev)
 		mutex_destroy(&dev->sriov->lock);
 	else
@@ -152,6 +469,7 @@ static void sriov_restore_state(struct pci_dev *dev)
 		pci_update_resource(dev, i);
 
 	pci_write_config_dword(dev, iov->pos + PCI_SRIOV_SYS_PGSIZE, iov->pgsz);
+	pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, iov->nr_virtfn);
 	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
 	if (iov->ctrl & PCI_SRIOV_CTRL_VFE)
 		msleep(100);
@@ -242,3 +560,33 @@ int pci_iov_bus_range(struct pci_bus *bus)
 
 	return max ? max - bus->number : 0;
 }
+
+/**
+ * pci_enable_sriov - enable the SR-IOV capability
+ * @dev: the PCI device
+ *
+ * Returns 0 on success, or negative on failure.
+ */
+int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
+{
+	might_sleep();
+
+	if (!dev->sriov)
+		return -ENODEV;
+
+	return sriov_enable(dev, nr_virtfn);
+}
+EXPORT_SYMBOL_GPL(pci_enable_sriov);
+
+/**
+ * pci_disable_sriov - disable the SR-IOV capability
+ * @dev: the PCI device
+ */
+void pci_disable_sriov(struct pci_dev *dev)
+{
+	might_sleep();
+
+	if (dev->sriov)
+		sriov_disable(dev);
+}
+EXPORT_SYMBOL_GPL(pci_disable_sriov);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 2cf32f5..9bbf868 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -202,6 +202,8 @@ struct pci_sriov {
 	u32 cap;		/* SR-IOV Capabilities */
 	u16 ctrl;		/* SR-IOV Control */
 	u16 total;		/* total VFs associated with the PF */
+	u16 initial;		/* initial VFs associated with the PF */
+	u16 nr_virtfn;		/* number of VFs available */
 	u16 offset;		/* first VF Routing ID offset */
 	u16 stride;		/* following VF stride */
 	u32 pgsz;		/* page size for BAR alignment */
@@ -209,6 +211,7 @@ struct pci_sriov {
 	struct pci_dev *pdev;	/* lowest numbered PF */
 	struct pci_dev *self;	/* this PF */
 	struct mutex lock;	/* lock for VF bus */
+	struct device dev;
 };
 
 #ifdef CONFIG_PCI_IOV
diff --git a/include/linux/pci.h b/include/linux/pci.h
index f4d740e..3a24ff5 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -278,6 +278,7 @@ struct pci_dev {
 #endif
 	struct pci_vpd *vpd;
 	struct pci_sriov *sriov;	/* SR-IOV capability related */
+	struct pci_dev *physfn;	/* Physical Function the device belongs to */
 };
 
 extern struct pci_dev *alloc_pci_dev(void);
@@ -1202,5 +1203,18 @@ int pci_ext_cfg_avail(struct pci_dev *dev);
 
 void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
 
+#ifdef CONFIG_PCI_IOV
+extern int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
+extern void pci_disable_sriov(struct pci_dev *dev);
+#else
+static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
+{
+	return -ENODEV;
+}
+static inline void pci_disable_sriov(struct pci_dev *dev)
+{
+}
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* LINUX_PCI_H */
-- 
1.6.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v10 5/7] PCI: handle SR-IOV Virtual Function Migration
  2009-02-20  6:54 [PATCH v10 0/7] PCI: Linux kernel SR-IOV support Yu Zhao
                   ` (3 preceding siblings ...)
  2009-02-20  6:54 ` [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver Yu Zhao
@ 2009-02-20  6:54 ` Yu Zhao
  2009-03-06 21:13   ` Matthew Wilcox
  2009-02-20  6:54 ` [PATCH v10 6/7] PCI: document SR-IOV sysfs entries Yu Zhao
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Yu Zhao @ 2009-02-20  6:54 UTC (permalink / raw)
  To: jbarnes; +Cc: linux-pci, kvm, linux-kernel, Yu Zhao

Signed-off-by: Yu Zhao <yu.zhao@intel.com>
---
 drivers/pci/iov.c   |  119 +++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/pci/pci.h   |    4 ++
 include/linux/pci.h |    6 +++
 3 files changed, 129 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 8096fc9..063fe74 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -206,6 +206,97 @@ static void sriov_release_dev(struct device *dev)
 	iov->nr_virtfn = 0;
 }
 
+static int sriov_migration(struct pci_dev *dev)
+{
+	u16 status;
+	struct pci_sriov *iov = dev->sriov;
+
+	if (!iov->nr_virtfn)
+		return 0;
+
+	if (!(iov->cap & PCI_SRIOV_CAP_VFM))
+		return 0;
+
+	pci_read_config_word(iov->self, iov->pos + PCI_SRIOV_STATUS, &status);
+	if (!(status & PCI_SRIOV_STATUS_VFM))
+		return 0;
+
+	schedule_work(&iov->mtask);
+
+	return 1;
+}
+
+static void sriov_migration_task(struct work_struct *work)
+{
+	int i;
+	u8 state;
+	u16 status;
+	struct pci_sriov *iov = container_of(work, struct pci_sriov, mtask);
+
+	for (i = iov->initial; i < iov->nr_virtfn; i++) {
+		state = readb(iov->mstate + i);
+		if (state == PCI_SRIOV_VFM_MI) {
+			writeb(PCI_SRIOV_VFM_AV, iov->mstate + i);
+			state = readb(iov->mstate + i);
+			if (state == PCI_SRIOV_VFM_AV)
+				virtfn_add(iov->self, i, 1);
+		} else if (state == PCI_SRIOV_VFM_MO) {
+			virtfn_remove(iov->self, i, 1);
+			writeb(PCI_SRIOV_VFM_UA, iov->mstate + i);
+			state = readb(iov->mstate + i);
+			if (state == PCI_SRIOV_VFM_AV)
+				virtfn_add(iov->self, i, 0);
+		}
+	}
+
+	pci_read_config_word(iov->self, iov->pos + PCI_SRIOV_STATUS, &status);
+	status &= ~PCI_SRIOV_STATUS_VFM;
+	pci_write_config_word(iov->self, iov->pos + PCI_SRIOV_STATUS, status);
+}
+
+static int sriov_enable_migration(struct pci_dev *dev, int nr_virtfn)
+{
+	int bir;
+	u32 table;
+	resource_size_t pa;
+	struct pci_sriov *iov = dev->sriov;
+
+	if (nr_virtfn <= iov->initial)
+		return 0;
+
+	pci_read_config_dword(dev, iov->pos + PCI_SRIOV_VFM, &table);
+	bir = PCI_SRIOV_VFM_BIR(table);
+	if (bir > PCI_STD_RESOURCE_END)
+		return -EIO;
+
+	table = PCI_SRIOV_VFM_OFFSET(table);
+	if (table + nr_virtfn > pci_resource_len(dev, bir))
+		return -EIO;
+
+	pa = pci_resource_start(dev, bir) + table;
+	iov->mstate = ioremap(pa, nr_virtfn);
+	if (!iov->mstate)
+		return -ENOMEM;
+
+	INIT_WORK(&iov->mtask, sriov_migration_task);
+
+	iov->ctrl |= PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR;
+	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
+
+	return 0;
+}
+
+static void sriov_disable_migration(struct pci_dev *dev)
+{
+	struct pci_sriov *iov = dev->sriov;
+
+	iov->ctrl &= ~(PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR);
+	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
+
+	cancel_work_sync(&iov->mtask);
+	iounmap(iov->mstate);
+}
+
 static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 {
 	int rc;
@@ -294,6 +385,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 			goto failed2;
 	}
 
+	if (iov->cap & PCI_SRIOV_CAP_VFM) {
+		rc = sriov_enable_migration(dev, nr_virtfn);
+		if (rc)
+			goto failed2;
+	}
+
 	kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
 	iov->nr_virtfn = nr_virtfn;
 
@@ -325,6 +422,9 @@ static void sriov_disable(struct pci_dev *dev)
 	if (!iov->nr_virtfn)
 		return;
 
+	if (iov->cap & PCI_SRIOV_CAP_VFM)
+		sriov_disable_migration(dev);
+
 	for (i = 0; i < iov->nr_virtfn; i++)
 		virtfn_remove(dev, i, 0);
 
@@ -590,3 +690,22 @@ void pci_disable_sriov(struct pci_dev *dev)
 		sriov_disable(dev);
 }
 EXPORT_SYMBOL_GPL(pci_disable_sriov);
+
+/**
+ * pci_sriov_migration - notify SR-IOV core of Virtual Function Migration
+ * @dev: the PCI device
+ *
+ * Returns IRQ_HANDLED if the IRQ is handled, or IRQ_NONE if not.
+ *
+ * Physical Function driver is responsible to register IRQ handler using
+ * VF Migration Interrupt Message Number, and call this function when the
+ * interrupt is generated by the hardware.
+ */
+irqreturn_t pci_sriov_migration(struct pci_dev *dev)
+{
+	if (!dev->sriov)
+		return IRQ_NONE;
+
+	return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE;
+}
+EXPORT_SYMBOL_GPL(pci_sriov_migration);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9bbf868..6764f02 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1,6 +1,8 @@
 #ifndef DRIVERS_PCI_H
 #define DRIVERS_PCI_H
 
+#include <linux/workqueue.h>
+
 #define PCI_CFG_SPACE_SIZE	256
 #define PCI_CFG_SPACE_EXP_SIZE	4096
 
@@ -211,6 +213,8 @@ struct pci_sriov {
 	struct pci_dev *pdev;	/* lowest numbered PF */
 	struct pci_dev *self;	/* this PF */
 	struct mutex lock;	/* lock for VF bus */
+	struct work_struct mtask; /* VF Migration task */
+	u8 __iomem *mstate;	/* VF Migration State Array */
 	struct device dev;
 };
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 3a24ff5..d16b913 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -52,6 +52,7 @@
 #include <asm/atomic.h>
 #include <linux/device.h>
 #include <linux/io.h>
+#include <linux/irqreturn.h>
 
 /* Include the ID list */
 #include <linux/pci_ids.h>
@@ -1206,6 +1207,7 @@ void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
 #ifdef CONFIG_PCI_IOV
 extern int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 extern void pci_disable_sriov(struct pci_dev *dev);
+extern irqreturn_t pci_sriov_migration(struct pci_dev *dev);
 #else
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 {
@@ -1214,6 +1216,10 @@ static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 static inline void pci_disable_sriov(struct pci_dev *dev)
 {
 }
+static inline irqreturn_t pci_sriov_migration(struct pci_dev *dev)
+{
+	return IRQ_NONE;
+}
 #endif
 
 #endif /* __KERNEL__ */
-- 
1.6.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v10 6/7] PCI: document SR-IOV sysfs entries
  2009-02-20  6:54 [PATCH v10 0/7] PCI: Linux kernel SR-IOV support Yu Zhao
                   ` (4 preceding siblings ...)
  2009-02-20  6:54 ` [PATCH v10 5/7] PCI: handle SR-IOV Virtual Function Migration Yu Zhao
@ 2009-02-20  6:54 ` Yu Zhao
  2009-03-06 21:16   ` Matthew Wilcox
  2009-02-20  6:54 ` [PATCH v10 7/7] PCI: manual for SR-IOV user and driver developer Yu Zhao
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Yu Zhao @ 2009-02-20  6:54 UTC (permalink / raw)
  To: jbarnes; +Cc: linux-pci, kvm, linux-kernel, Yu Zhao

Signed-off-by: Yu Zhao <yu.zhao@intel.com>
---
 Documentation/ABI/testing/sysfs-bus-pci |   27 +++++++++++++++++++++++++++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index ceddcff..84dc100 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -9,3 +9,30 @@ Description:
 		that some devices may have malformatted data.  If the
 		underlying VPD has a writable section then the
 		corresponding section of this file will be writable.
+
+What:		/sys/bus/pci/devices/.../virtfn/N
+Date:		February 2009
+Contact:	Yu Zhao <yu.zhao@intel.com>
+Description:
+		This symbol link appears when hardware supports SR-IOV
+		capability and Physical Function driver has enabled it.
+		The symbol link points to the PCI device sysfs entry of
+		Virtual Function whose index is N (0...MaxVFs-1).
+
+What:		/sys/bus/pci/devices/.../virtfn/dep_link
+Date:		February 2009
+Contact:	Yu Zhao <yu.zhao@intel.com>
+Description:
+		This symbol link appears when hardware supports SR-IOV
+		capability and Physical Function driver has enabled it,
+		and this device has vendor specific dependencies with
+		others. The symbol link points to the PCI device sysfs
+		entry of Physical Function this device depends on.
+
+What:		/sys/bus/pci/devices/.../physfn
+Date:		February 2009
+Contact:	Yu Zhao <yu.zhao@intel.com>
+Description:
+		This symbol link appears when a device is Virtual Function.
+		The symbol link points to the PCI device sysfs entry of
+		Physical Function this device associates with.
-- 
1.6.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v10 7/7] PCI: manual for SR-IOV user and driver developer
  2009-02-20  6:54 [PATCH v10 0/7] PCI: Linux kernel SR-IOV support Yu Zhao
                   ` (5 preceding siblings ...)
  2009-02-20  6:54 ` [PATCH v10 6/7] PCI: document SR-IOV sysfs entries Yu Zhao
@ 2009-02-20  6:54 ` Yu Zhao
  2009-03-06 21:17   ` Matthew Wilcox
  2009-02-24 10:47 ` [PATCH v10 0/7] PCI: Linux kernel SR-IOV support Avi Kivity
  2009-03-06 19:44 ` Matthew Wilcox
  8 siblings, 1 reply; 45+ messages in thread
From: Yu Zhao @ 2009-02-20  6:54 UTC (permalink / raw)
  To: jbarnes; +Cc: linux-pci, kvm, linux-kernel, Yu Zhao

Signed-off-by: Yu Zhao <yu.zhao@intel.com>
---
 Documentation/DocBook/kernel-api.tmpl |    1 +
 Documentation/PCI/pci-iov-howto.txt   |   99 +++++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/PCI/pci-iov-howto.txt

diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl
index 5818ff7..506e611 100644
--- a/Documentation/DocBook/kernel-api.tmpl
+++ b/Documentation/DocBook/kernel-api.tmpl
@@ -251,6 +251,7 @@ X!Edrivers/pci/hotplug.c
 -->
 !Edrivers/pci/probe.c
 !Edrivers/pci/rom.c
+!Edrivers/pci/iov.c
      </sect1>
      <sect1><title>PCI Hotplug Support Library</title>
 !Edrivers/pci/hotplug/pci_hotplug_core.c
diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt
new file mode 100644
index 0000000..fc73ef5
--- /dev/null
+++ b/Documentation/PCI/pci-iov-howto.txt
@@ -0,0 +1,99 @@
+		PCI Express I/O Virtualization Howto
+		Copyright (C) 2009 Intel Corporation
+		    Yu Zhao <yu.zhao@intel.com>
+
+
+1. Overview
+
+1.1 What is SR-IOV
+
+Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
+capability which makes one physical device appear as multiple virtual
+devices. The physical device is referred to as Physical Function (PF)
+while the virtual devices are referred to as Virtual Functions (VF).
+Allocation of the VF can be dynamically controlled by the PF via
+registers encapsulated in the capability. By default, this feature is
+not enabled and the PF behaves as traditional PCIe device. Once it's
+turned on, each VF's PCI configuration space can be accessed by its own
+Bus, Device and Function Number (Routing ID). And each VF also has PCI
+Memory Space, which is used to map its register set. VF device driver
+operates on the register set so it can be functional and appear as a
+real existing PCI device.
+
+2. User Guide
+
+2.1 How can I enable SR-IOV capability
+
+The device driver (PF driver) will control the enabling and disabling
+of the capability via API provided by SR-IOV core. If the hardware
+has SR-IOV capability, loading its PF driver would enable it and all
+VFs associated with the PF.
+
+2.2 How can I use the Virtual Functions
+
+The VF is treated as hot-plugged PCI devices in the kernel, so they
+should be able to work in the same way as real PCI devices. The VF
+requires device driver that is same as a normal PCI device's.
+
+3. Developer Guide
+
+3.1 SR-IOV API
+
+To enable SR-IOV capability:
+	int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
+	'nr_virtfn' is number of VFs to be enabled.
+
+To disable SR-IOV capability:
+	void pci_disable_sriov(struct pci_dev *dev);
+
+To notify SR-IOV core of Virtual Function Migration:
+	irqreturn_t pci_sriov_migration(struct pci_dev *dev);
+
+3.2 Usage example
+
+Following piece of code illustrates the usage of the SR-IOV API.
+
+static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
+{
+	pci_enable_sriov(dev, NR_VIRTFN);
+
+	...
+
+	return 0;
+}
+
+static void __devexit dev_remove(struct pci_dev *dev)
+{
+	pci_disable_sriov(dev);
+
+	...
+}
+
+static int dev_suspend(struct pci_dev *dev, pm_message_t state)
+{
+	...
+
+	return 0;
+}
+
+static int dev_resume(struct pci_dev *dev)
+{
+	...
+
+	return 0;
+}
+
+static void dev_shutdown(struct pci_dev *dev)
+{
+	...
+}
+
+static struct pci_driver dev_driver = {
+	.name =		"SR-IOV Physical Function driver",
+	.id_table =	dev_id_table,
+	.probe =	dev_probe,
+	.remove =	__devexit_p(dev_remove),
+	.suspend =	dev_suspend,
+	.resume =	dev_resume,
+	.shutdown =	dev_shutdown,
+};
-- 
1.6.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support
  2009-02-20  6:54 [PATCH v10 0/7] PCI: Linux kernel SR-IOV support Yu Zhao
                   ` (6 preceding siblings ...)
  2009-02-20  6:54 ` [PATCH v10 7/7] PCI: manual for SR-IOV user and driver developer Yu Zhao
@ 2009-02-24 10:47 ` Avi Kivity
  2009-02-25  1:36   ` Yu Zhao
  2009-03-06 19:33   ` Matthew Wilcox
  2009-03-06 19:44 ` Matthew Wilcox
  8 siblings, 2 replies; 45+ messages in thread
From: Avi Kivity @ 2009-02-24 10:47 UTC (permalink / raw)
  To: Yu Zhao; +Cc: jbarnes, linux-pci, kvm, linux-kernel

Yu Zhao wrote:
> Greetings,
>
> Following patches are intended to support SR-IOV capability in the
> Linux kernel. With these patches, people can turn a PCI device with
> the capability into multiple ones from software perspective, which
> will benefit KVM and achieve other purposes such as QoS, security,
> and etc.
>   

Do those patches allow using a VF on the host (in other words, does the 
kernel emulate config space accesses)?



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support
  2009-02-24 10:47 ` [PATCH v10 0/7] PCI: Linux kernel SR-IOV support Avi Kivity
@ 2009-02-25  1:36   ` Yu Zhao
  2009-03-06 19:33   ` Matthew Wilcox
  1 sibling, 0 replies; 45+ messages in thread
From: Yu Zhao @ 2009-02-25  1:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: jbarnes, linux-pci, kvm, linux-kernel

On Tue, Feb 24, 2009 at 06:47:38PM +0800, Avi Kivity wrote:
> Yu Zhao wrote:
> > Greetings,
> >
> > Following patches are intended to support SR-IOV capability in the
> > Linux kernel. With these patches, people can turn a PCI device with
> > the capability into multiple ones from software perspective, which
> > will benefit KVM and achieve other purposes such as QoS, security,
> > and etc.
> >   
> 
> Do those patches allow using a VF on the host (in other words, does the 
> kernel emulate config space accesses)?

Yes, if a VF's driver is loaded in the host, the VF works the same way
as normal PCI device.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support
  2009-02-24 10:47 ` [PATCH v10 0/7] PCI: Linux kernel SR-IOV support Avi Kivity
  2009-02-25  1:36   ` Yu Zhao
@ 2009-03-06 19:33   ` Matthew Wilcox
  2009-03-08 14:30     ` Avi Kivity
  1 sibling, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2009-03-06 19:33 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Yu Zhao, jbarnes, linux-pci, kvm, linux-kernel

On Tue, Feb 24, 2009 at 12:47:38PM +0200, Avi Kivity wrote:
> Do those patches allow using a VF on the host (in other words, does the 
> kernel emulate config space accesses)?

SR-IOV hardware handles config space accesses to virtual functions.  No
kernel changes needed for that aspect of it.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support
  2009-02-20  6:54 [PATCH v10 0/7] PCI: Linux kernel SR-IOV support Yu Zhao
                   ` (7 preceding siblings ...)
  2009-02-24 10:47 ` [PATCH v10 0/7] PCI: Linux kernel SR-IOV support Avi Kivity
@ 2009-03-06 19:44 ` Matthew Wilcox
  2009-03-07  2:34   ` Greg KH
  8 siblings, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2009-03-06 19:44 UTC (permalink / raw)
  To: Yu Zhao; +Cc: jbarnes, linux-pci, kvm, linux-kernel

On Fri, Feb 20, 2009 at 02:54:41PM +0800, Yu Zhao wrote:
> Following patches are intended to support SR-IOV capability in the
> Linux kernel. With these patches, people can turn a PCI device with
> the capability into multiple ones from software perspective, which
> will benefit KVM and achieve other purposes such as QoS, security,
> and etc.

I reviewed this round of patches on the plane ... I'll respond to each
patch individually, but in general this all looks much better than the
first round I reviewed.

> Physical Function driver patches for Intel 82576 NIC are available:
>   http://patchwork.kernel.org/patch/8063/
>   http://patchwork.kernel.org/patch/8064/
>   http://patchwork.kernel.org/patch/8065/
>   http://patchwork.kernel.org/patch/8066/

I need to review this driver; I haven't done that yet.  Has anyone else?

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 1/7] PCI: initialize and release SR-IOV capability
  2009-02-20  6:54 ` [PATCH v10 1/7] PCI: initialize and release SR-IOV capability Yu Zhao
@ 2009-03-06 20:08   ` Matthew Wilcox
  2009-03-06 22:03     ` Randy Dunlap
                       ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Matthew Wilcox @ 2009-03-06 20:08 UTC (permalink / raw)
  To: Yu Zhao; +Cc: jbarnes, linux-pci, kvm, linux-kernel, Randy.Dunlap

On Fri, Feb 20, 2009 at 02:54:42PM +0800, Yu Zhao wrote:
> +config PCI_IOV
> +	bool "PCI IOV support"
> +	depends on PCI
> +	select PCI_MSI

My understanding is that having 'select' of a config symbol that the
user can choose is bad.  I think we should probably make this 'depends
on PCI_MSI'.

PCI MSI can also be disabled at runtime (and Fedora do by default).
Since SR-IOV really does require MSI, we need to put in a runtime check
to see if pci_msi_enabled() is false.

We don't depend on PCIEPORTBUS (a horribly named symbol).  Should we?
SR-IOV is only supported for PCI Express machines.  I'm not sure of the
right answer here, but I thought I should raise the question.

> +	default n

You don't need this -- the default default is n ;-)

> +	help
> +	  PCI-SIG I/O Virtualization (IOV) Specifications support.
> +	  Single Root IOV: allows the Physical Function driver to enable
> +	  the hardware capability, so the Virtual Function is accessible
> +	  via the PCI Configuration Space using its own Bus, Device and
> +	  Function Numbers. Each Virtual Function also has the PCI Memory
> +	  Space to map the device specific register set.

I'm not convinced this is the most helpful we could be to the user who's
configuring their own kernel.  How about something like this?  (Randy, I
particularly look to you to make my prose less turgid).

	help
	  IO Virtualisation is a PCI feature supported by some devices
	  which allows you to create virtual PCI devices and assign them
	  to guest OSes.  This option needs to be selected in the host
	  or Dom0 kernel, but does not need to be selected in the guest
	  or DomU kernel.  If you don't know whether your hardware supports
	  it, you can check by using lspci to look for the SR-IOV capability.

	  If you have no idea what any of that means, it is safe to
	  answer 'N' here.

> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index 3d07ce2..ba99282 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -29,6 +29,9 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o
>  
>  obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o
>  
> +# PCI IOV support
> +obj-$(CONFIG_PCI_IOV) += iov.o

I see you're following the gerneal style in this file, but the comments
really add no value.  I should send a patch to take out the existing ones.

> +	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
> +		if (pdev->sriov)
> +			break;
> +	if (list_empty(&dev->bus->devices) || !pdev->sriov)
> +		pdev = NULL;
> +	ctrl = 0;
> +	if (!pdev && pci_ari_enabled(dev->bus))
> +		ctrl |= PCI_SRIOV_CTRL_ARI;
> +

I don't like this loop.  At the end of a list_for_each_entry() loop,
pdev will not be pointing at a pci_device, it'll be pointing to some
offset from &dev->bus->devices.  So checking pdev->sriov at this point
is really, really bad.  I would prefer to see something like this:

        ctrl = 0;
        list_for_each_entry(pdev, &dev->bus->devices, bus_list) {
                if (pdev->sriov)
                        goto ari_enabled;
        }

        if (pci_ari_enabled(dev->bus))
                ctrl = PCI_SRIOV_CTRL_ARI;
 ari_enabled:
        pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);

> +	if (pdev)
> +		iov->pdev = pci_dev_get(pdev);
> +	else {
> +		iov->pdev = dev;
> +		mutex_init(&iov->lock);
> +	}

Now I'm confused.  Why don't we need to init the mutex if there's another
device on the same bus which also has an iov capability?

> +static void sriov_release(struct pci_dev *dev)
> +{
> +	if (dev == dev->sriov->pdev)
> +		mutex_destroy(&dev->sriov->lock);
> +	else
> +		pci_dev_put(dev->sriov->pdev);
> +
> +	kfree(dev->sriov);
> +	dev->sriov = NULL;
> +}

> +void pci_iov_release(struct pci_dev *dev)
> +{
> +	if (dev->sriov)
> +		sriov_release(dev);
> +}

This seems to be a bit of a design pattern with you, and I'm not quite sure why you do it like this instead of just doing:

void pci_iov_release(struct pci_dev *dev)
{
	if (!dev->sriov)
		return;
	[...]
}

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 2/7] PCI: restore saved SR-IOV state
  2009-02-20  6:54 ` [PATCH v10 2/7] PCI: restore saved SR-IOV state Yu Zhao
@ 2009-03-06 20:09   ` Matthew Wilcox
  0 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2009-03-06 20:09 UTC (permalink / raw)
  To: Yu Zhao; +Cc: jbarnes, linux-pci, kvm, linux-kernel

On Fri, Feb 20, 2009 at 02:54:43PM +0800, Yu Zhao wrote:
> Signed-off-by: Yu Zhao <yu.zhao@intel.com>

It needs a changelog ...

> +static void sriov_restore_state(struct pci_dev *dev)
> +{
[...]
> +}
> +
[...]
> +/**
> + * pci_restore_iov_state - restore the state of the IOV capability
> + * @dev: the PCI device
> + */
> +void pci_restore_iov_state(struct pci_dev *dev)
> +{
> +	if (dev->sriov)
> +		sriov_restore_state(dev);
> +}

Apart from the design pattern I mentioned in my previous email also
appearing here, I see no problems with this patch.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 3/7] PCI: reserve bus range for SR-IOV device
  2009-02-20  6:54 ` [PATCH v10 3/7] PCI: reserve bus range for SR-IOV device Yu Zhao
@ 2009-03-06 20:20   ` Matthew Wilcox
  2009-03-09  8:13     ` Yu Zhao
  0 siblings, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2009-03-06 20:20 UTC (permalink / raw)
  To: Yu Zhao; +Cc: jbarnes, linux-pci, kvm, linux-kernel

On Fri, Feb 20, 2009 at 02:54:44PM +0800, Yu Zhao wrote:
> +static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn)
> +{
> +	u16 bdf;
> +
> +	bdf = (dev->bus->number << 8) + dev->devfn +
> +	      dev->sriov->offset + dev->sriov->stride * id;
> +	*busnr = bdf >> 8;
> +	*devfn = bdf & 0xff;
> +}

I find the interface here a bit clunky -- a function returning void
while having two OUT parameters.  How about this variation on the theme
(viewers are encouraged to come up with their own preferred
implementations and interfaces):

static inline __pure u16 virtfn_bdf(struct pci_dev *dev, int id)
{
	return (dev->bus->number << 8) + dev->devfn + dev->sriov->offset +
		dev->sriov->stride * id;
}

#define VIRT_BUS(dev, id)	(virtfn_bdf(dev, id) >> 8)
#define VIRT_DEVFN(dev, id)	(virtfn_bdf(dev, id) & 0xff)

We rely on GCC to do CSE and not actually invoke virtfn_bdf more than
once.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver
  2009-02-20  6:54 ` [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver Yu Zhao
@ 2009-03-06 20:37   ` Matthew Wilcox
  2009-03-06 21:48     ` Randy Dunlap
                       ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Matthew Wilcox @ 2009-03-06 20:37 UTC (permalink / raw)
  To: Yu Zhao; +Cc: jbarnes, linux-pci, kvm, linux-kernel, Randy.Dunlap

On Fri, Feb 20, 2009 at 02:54:45PM +0800, Yu Zhao wrote:
> +	virtfn->sysdata = dev->bus->sysdata;
> +	virtfn->dev.parent = dev->dev.parent;
> +	virtfn->dev.bus = dev->dev.bus;
> +	virtfn->devfn = devfn;
> +	virtfn->hdr_type = PCI_HEADER_TYPE_NORMAL;
> +	virtfn->cfg_size = PCI_CFG_SPACE_EXP_SIZE;
> +	virtfn->error_state = pci_channel_io_normal;
> +	virtfn->current_state = PCI_UNKNOWN;
> +	virtfn->is_pcie = 1;
> +	virtfn->pcie_type = PCI_EXP_TYPE_ENDPOINT;
> +	virtfn->dma_mask = 0xffffffff;
> +	virtfn->vendor = dev->vendor;
> +	virtfn->subsystem_vendor = dev->subsystem_vendor;
> +	virtfn->class = dev->class;

There seems to be a certain amount of commonality between this and
pci_scan_device().  Have you considered trying to make a common helper
function, or does it not work out well?

> +	pci_device_add(virtfn, virtfn->bus);

Greg is probably going to ding you here for adding the device, then
creating the symlinks.  I believe it's now best practice to create the
symlinks first, so there's no window where userspace can get confused.

> +	mutex_unlock(&iov->pdev->sriov->lock);

I question the existance of this mutex now.  What's it protecting?

Aren't we going to be implicitly protected by virtue of the Physical
Function device driver being the only one calling this function, and the
driver will be calling it from the ->probe routine which is not called
simultaneously for the same device.

> +	virtfn->physfn = pci_dev_get(dev);
> +
> +	rc = pci_bus_add_device(virtfn);
> +	if (rc)
> +		goto failed1;
> +	sprintf(buf, "%d", id);

%u, perhaps?  And maybe 'id' should always be unsigned?  Just a thought.

> +	rc = sysfs_create_link(&iov->dev.kobj, &virtfn->dev.kobj, buf);
> +	if (rc)
> +		goto failed1;
> +	rc = sysfs_create_link(&virtfn->dev.kobj, &dev->dev.kobj, "physfn");
> +	if (rc)
> +		goto failed2;

I'm glad to see these symlinks documented in later patches!

> +	nres = 0;
> +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> +		res = dev->resource + PCI_SRIOV_RESOURCES + i;
> +		if (!res->parent)
> +			continue;
> +		nres++;
> +	}

Can't this be written more simply as:

	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
		res = dev->resource + PCI_SRIOV_RESOURCES + i;
		if (res->parent)
			nres++;
	}
?

> +	if (nres != iov->nres) {
> +		dev_err(&dev->dev, "no enough MMIO for SR-IOV\n");
> +		return -ENOMEM;
> +	}

Randy, can you help us out with better wording here?

> +		dev_err(&dev->dev, "no enough bus range for SR-IOV\n");

and here.

> +	if (iov->link != dev->devfn) {
> +		rc = -ENODEV;
> +		list_for_each_entry(link, &dev->bus->devices, bus_list) {
> +			if (link->sriov && link->devfn == iov->link)
> +				rc = sysfs_create_link(&iov->dev.kobj,
> +						&link->dev.kobj, "dep_link");

I skipped to the end and read patch 7/7 and I still don't understand
what dep_link is for.  Can you explain please?  In particular, how is it
different from physfn?

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 5/7] PCI: handle SR-IOV Virtual Function Migration
  2009-02-20  6:54 ` [PATCH v10 5/7] PCI: handle SR-IOV Virtual Function Migration Yu Zhao
@ 2009-03-06 21:13   ` Matthew Wilcox
  2009-03-09  8:28     ` Yu Zhao
  0 siblings, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2009-03-06 21:13 UTC (permalink / raw)
  To: Yu Zhao; +Cc: jbarnes, linux-pci, kvm, linux-kernel

On Fri, Feb 20, 2009 at 02:54:46PM +0800, Yu Zhao wrote:
> +static int sriov_migration(struct pci_dev *dev)
> +{
> +	u16 status;
> +	struct pci_sriov *iov = dev->sriov;
> +
> +	if (!iov->nr_virtfn)
> +		return 0;
> +
> +	if (!(iov->cap & PCI_SRIOV_CAP_VFM))
> +		return 0;
> +
> +	pci_read_config_word(iov->self, iov->pos + PCI_SRIOV_STATUS, &status);

You passed in dev here, you don't need to use iov->self, right?

> +	if (!(status & PCI_SRIOV_STATUS_VFM))
> +		return 0;
> +
> +	schedule_work(&iov->mtask);
> +
> +	return 1;
> +}

> +/**
> + * pci_sriov_migration - notify SR-IOV core of Virtual Function Migration
> + * @dev: the PCI device
> + *
> + * Returns IRQ_HANDLED if the IRQ is handled, or IRQ_NONE if not.
> + *
> + * Physical Function driver is responsible to register IRQ handler using
> + * VF Migration Interrupt Message Number, and call this function when the
> + * interrupt is generated by the hardware.
> + */
> +irqreturn_t pci_sriov_migration(struct pci_dev *dev)
> +{
> +	if (!dev->sriov)
> +		return IRQ_NONE;
> +
> +	return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE;
> +}
> +EXPORT_SYMBOL_GPL(pci_sriov_migration);

OK, I think I get it -- you've basically written an interrupt handler
for the driver to call from its interrupt handler.  Am I right in
thinking that the reason the driver needs to do the interrupt handler
here is because we don't currently have an interface that looks like:

int pci_get_msix_interrupt(struct pci_dev *dev, unsigned vector);

?  If so, we should probably add it; I want it for my MSI-X rewrite
anyway.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 6/7] PCI: document SR-IOV sysfs entries
  2009-02-20  6:54 ` [PATCH v10 6/7] PCI: document SR-IOV sysfs entries Yu Zhao
@ 2009-03-06 21:16   ` Matthew Wilcox
  2009-03-06 22:35     ` Randy Dunlap
  0 siblings, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2009-03-06 21:16 UTC (permalink / raw)
  To: Yu Zhao; +Cc: jbarnes, linux-pci, kvm, linux-kernel, Randy.Dunlap


Randy, can you wordsmith this one?

I think I'm starting to understand the difference between physfn and
dep_link, but an example would definitely help.  It may or may not be
appropriate to put it in.

On Fri, Feb 20, 2009 at 02:54:47PM +0800, Yu Zhao wrote:
> Signed-off-by: Yu Zhao <yu.zhao@intel.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci |   27 +++++++++++++++++++++++++++
>  1 files changed, 27 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index ceddcff..84dc100 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -9,3 +9,30 @@ Description:
>  		that some devices may have malformatted data.  If the
>  		underlying VPD has a writable section then the
>  		corresponding section of this file will be writable.
> +
> +What:		/sys/bus/pci/devices/.../virtfn/N
> +Date:		February 2009
> +Contact:	Yu Zhao <yu.zhao@intel.com>
> +Description:
> +		This symbol link appears when hardware supports SR-IOV
> +		capability and Physical Function driver has enabled it.
> +		The symbol link points to the PCI device sysfs entry of
> +		Virtual Function whose index is N (0...MaxVFs-1).
> +
> +What:		/sys/bus/pci/devices/.../virtfn/dep_link
> +Date:		February 2009
> +Contact:	Yu Zhao <yu.zhao@intel.com>
> +Description:
> +		This symbol link appears when hardware supports SR-IOV
> +		capability and Physical Function driver has enabled it,
> +		and this device has vendor specific dependencies with
> +		others. The symbol link points to the PCI device sysfs
> +		entry of Physical Function this device depends on.
> +
> +What:		/sys/bus/pci/devices/.../physfn
> +Date:		February 2009
> +Contact:	Yu Zhao <yu.zhao@intel.com>
> +Description:
> +		This symbol link appears when a device is Virtual Function.
> +		The symbol link points to the PCI device sysfs entry of
> +		Physical Function this device associates with.
> -- 
> 1.6.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 7/7] PCI: manual for SR-IOV user and driver developer
  2009-02-20  6:54 ` [PATCH v10 7/7] PCI: manual for SR-IOV user and driver developer Yu Zhao
@ 2009-03-06 21:17   ` Matthew Wilcox
  0 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2009-03-06 21:17 UTC (permalink / raw)
  To: Yu Zhao; +Cc: jbarnes, linux-pci, kvm, linux-kernel


I think we'll need to go a few rounds on sorting this one out.  I don't
have time right now, but I think having this document is important.

On Fri, Feb 20, 2009 at 02:54:48PM +0800, Yu Zhao wrote:
> Signed-off-by: Yu Zhao <yu.zhao@intel.com>
> ---
>  Documentation/DocBook/kernel-api.tmpl |    1 +
>  Documentation/PCI/pci-iov-howto.txt   |   99 +++++++++++++++++++++++++++++++++
>  2 files changed, 100 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/PCI/pci-iov-howto.txt
> 
> diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl
> index 5818ff7..506e611 100644
> --- a/Documentation/DocBook/kernel-api.tmpl
> +++ b/Documentation/DocBook/kernel-api.tmpl
> @@ -251,6 +251,7 @@ X!Edrivers/pci/hotplug.c
>  -->
>  !Edrivers/pci/probe.c
>  !Edrivers/pci/rom.c
> +!Edrivers/pci/iov.c
>       </sect1>
>       <sect1><title>PCI Hotplug Support Library</title>
>  !Edrivers/pci/hotplug/pci_hotplug_core.c
> diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt
> new file mode 100644
> index 0000000..fc73ef5
> --- /dev/null
> +++ b/Documentation/PCI/pci-iov-howto.txt
> @@ -0,0 +1,99 @@
> +		PCI Express I/O Virtualization Howto
> +		Copyright (C) 2009 Intel Corporation
> +		    Yu Zhao <yu.zhao@intel.com>
> +
> +
> +1. Overview
> +
> +1.1 What is SR-IOV
> +
> +Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
> +capability which makes one physical device appear as multiple virtual
> +devices. The physical device is referred to as Physical Function (PF)
> +while the virtual devices are referred to as Virtual Functions (VF).
> +Allocation of the VF can be dynamically controlled by the PF via
> +registers encapsulated in the capability. By default, this feature is
> +not enabled and the PF behaves as traditional PCIe device. Once it's
> +turned on, each VF's PCI configuration space can be accessed by its own
> +Bus, Device and Function Number (Routing ID). And each VF also has PCI
> +Memory Space, which is used to map its register set. VF device driver
> +operates on the register set so it can be functional and appear as a
> +real existing PCI device.
> +
> +2. User Guide
> +
> +2.1 How can I enable SR-IOV capability
> +
> +The device driver (PF driver) will control the enabling and disabling
> +of the capability via API provided by SR-IOV core. If the hardware
> +has SR-IOV capability, loading its PF driver would enable it and all
> +VFs associated with the PF.
> +
> +2.2 How can I use the Virtual Functions
> +
> +The VF is treated as hot-plugged PCI devices in the kernel, so they
> +should be able to work in the same way as real PCI devices. The VF
> +requires device driver that is same as a normal PCI device's.
> +
> +3. Developer Guide
> +
> +3.1 SR-IOV API
> +
> +To enable SR-IOV capability:
> +	int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
> +	'nr_virtfn' is number of VFs to be enabled.
> +
> +To disable SR-IOV capability:
> +	void pci_disable_sriov(struct pci_dev *dev);
> +
> +To notify SR-IOV core of Virtual Function Migration:
> +	irqreturn_t pci_sriov_migration(struct pci_dev *dev);
> +
> +3.2 Usage example
> +
> +Following piece of code illustrates the usage of the SR-IOV API.
> +
> +static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
> +{
> +	pci_enable_sriov(dev, NR_VIRTFN);
> +
> +	...
> +
> +	return 0;
> +}
> +
> +static void __devexit dev_remove(struct pci_dev *dev)
> +{
> +	pci_disable_sriov(dev);
> +
> +	...
> +}
> +
> +static int dev_suspend(struct pci_dev *dev, pm_message_t state)
> +{
> +	...
> +
> +	return 0;
> +}
> +
> +static int dev_resume(struct pci_dev *dev)
> +{
> +	...
> +
> +	return 0;
> +}
> +
> +static void dev_shutdown(struct pci_dev *dev)
> +{
> +	...
> +}
> +
> +static struct pci_driver dev_driver = {
> +	.name =		"SR-IOV Physical Function driver",
> +	.id_table =	dev_id_table,
> +	.probe =	dev_probe,
> +	.remove =	__devexit_p(dev_remove),
> +	.suspend =	dev_suspend,
> +	.resume =	dev_resume,
> +	.shutdown =	dev_shutdown,
> +};
> -- 
> 1.6.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver
  2009-03-06 20:37   ` Matthew Wilcox
@ 2009-03-06 21:48     ` Randy Dunlap
  2009-03-09  8:29       ` Yu Zhao
  2009-03-07  2:40     ` Greg KH
  2009-03-09  8:25     ` Yu Zhao
  2 siblings, 1 reply; 45+ messages in thread
From: Randy Dunlap @ 2009-03-06 21:48 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Yu Zhao, jbarnes, linux-pci, kvm, linux-kernel

Matthew Wilcox wrote:
> On Fri, Feb 20, 2009 at 02:54:45PM +0800, Yu Zhao wrote:
> 
>> +	if (nres != iov->nres) {
>> +		dev_err(&dev->dev, "no enough MMIO for SR-IOV\n");
>> +		return -ENOMEM;
>> +	}

		"not enough MMIO BARs for SR-IOV"
	or
		"not enough MMIO resources for SR-IOV"
	or
		"too few MMIO BARs for SR-IOV"
?

> Randy, can you help us out with better wording here?
> 
>> +		dev_err(&dev->dev, "no enough bus range for SR-IOV\n");
> 
> and here.

		"SR-IOV: bus number too large"
	or
		"SR-IOV: bus number out of range"
	or
		"SR-IOV: cannot allocate valid bus number"
?

>> +	if (iov->link != dev->devfn) {
>> +		rc = -ENODEV;
>> +		list_for_each_entry(link, &dev->bus->devices, bus_list) {
>> +			if (link->sriov && link->devfn == iov->link)
>> +				rc = sysfs_create_link(&iov->dev.kobj,
>> +						&link->dev.kobj, "dep_link");


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 1/7] PCI: initialize and release SR-IOV capability
  2009-03-06 20:08   ` Matthew Wilcox
@ 2009-03-06 22:03     ` Randy Dunlap
  2009-03-06 23:31       ` Duyck, Alexander H
  2009-03-07  2:38     ` Greg KH
  2009-03-09  8:12     ` Yu Zhao
  2 siblings, 1 reply; 45+ messages in thread
From: Randy Dunlap @ 2009-03-06 22:03 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Yu Zhao, jbarnes, linux-pci, kvm, linux-kernel

Matthew Wilcox wrote:
> On Fri, Feb 20, 2009 at 02:54:42PM +0800, Yu Zhao wrote:
>> +config PCI_IOV
>> +	bool "PCI IOV support"
>> +	depends on PCI
>> +	select PCI_MSI
> 
> My understanding is that having 'select' of a config symbol that the
> user can choose is bad.  I think we should probably make this 'depends
> on PCI_MSI'.

Ack.

> PCI MSI can also be disabled at runtime (and Fedora do by default).
> Since SR-IOV really does require MSI, we need to put in a runtime check
> to see if pci_msi_enabled() is false.
> 
> We don't depend on PCIEPORTBUS (a horribly named symbol).  Should we?
> SR-IOV is only supported for PCI Express machines.  I'm not sure of the
> right answer here, but I thought I should raise the question.
> 
>> +	help
>> +	  PCI-SIG I/O Virtualization (IOV) Specifications support.
>> +	  Single Root IOV: allows the Physical Function driver to enable
>> +	  the hardware capability, so the Virtual Function is accessible
>> +	  via the PCI Configuration Space using its own Bus, Device and
>> +	  Function Numbers. Each Virtual Function also has the PCI Memory
>> +	  Space to map the device specific register set.

Too spec. and implementation specific for users.

> I'm not convinced this is the most helpful we could be to the user who's
> configuring their own kernel.  How about something like this?  (Randy, I
> particularly look to you to make my prose less turgid).
> 
> 	help
> 	  IO Virtualisation is a PCI feature supported by some devices
	             z ;)
> 	  which allows you to create virtual PCI devices and assign them
> 	  to guest OSes.  This option needs to be selected in the host
> 	  or Dom0 kernel, but does not need to be selected in the guest
> 	  or DomU kernel.  If you don't know whether your hardware supports
> 	  it, you can check by using lspci to look for the SR-IOV capability.
> 
> 	  If you have no idea what any of that means, it is safe to
> 	  answer 'N' here.

That's certainly more readable and user-friendly.
I don't know what else it needs.  Looks good to me.

~Randy


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 6/7] PCI: document SR-IOV sysfs entries
  2009-03-06 21:16   ` Matthew Wilcox
@ 2009-03-06 22:35     ` Randy Dunlap
  0 siblings, 0 replies; 45+ messages in thread
From: Randy Dunlap @ 2009-03-06 22:35 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Yu Zhao, jbarnes, linux-pci, kvm, linux-kernel

Matthew Wilcox wrote:
> Randy, can you wordsmith this one?

I'll try.

> I think I'm starting to understand the difference between physfn and
> dep_link, but an example would definitely help.  It may or may not be
> appropriate to put it in.
> 
> On Fri, Feb 20, 2009 at 02:54:47PM +0800, Yu Zhao wrote:
>> Signed-off-by: Yu Zhao <yu.zhao@intel.com>
>> ---
>>  Documentation/ABI/testing/sysfs-bus-pci |   27 +++++++++++++++++++++++++++
>>  1 files changed, 27 insertions(+), 0 deletions(-)
>>
>> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
>> index ceddcff..84dc100 100644
>> --- a/Documentation/ABI/testing/sysfs-bus-pci
>> +++ b/Documentation/ABI/testing/sysfs-bus-pci
>> @@ -9,3 +9,30 @@ Description:
>>  		that some devices may have malformatted data.  If the
>>  		underlying VPD has a writable section then the
>>  		corresponding section of this file will be writable.
>> +
>> +What:		/sys/bus/pci/devices/.../virtfn/N
>> +Date:		February 2009
>> +Contact:	Yu Zhao <yu.zhao@intel.com>
>> +Description:
>> +		This symbol link appears when hardware supports SR-IOV

		     symbolic                          supports the SR-IOV

>> +		capability and Physical Function driver has enabled it.

		               ^the | its | a

>> +		The symbol link points to the PCI device sysfs entry of

		    symbolic 

>> +		Virtual Function whose index is N (0...MaxVFs-1).

		the Virtual Function

>> +
>> +What:		/sys/bus/pci/devices/.../virtfn/dep_link
>> +Date:		February 2009
>> +Contact:	Yu Zhao <yu.zhao@intel.com>
>> +Description:
>> +		This symbol link appears when hardware supports SR-IOV

		     symbolic                          supports the SR-IOV

>> +		capability and Physical Function driver has enabled it,

		               ^its | the | a

>> +		and this device has vendor specific dependencies with
>> +		others. The symbol link points to the PCI device sysfs

		            symbolic

>> +		entry of Physical Function this device depends on.
>> +
>> +What:		/sys/bus/pci/devices/.../physfn
>> +Date:		February 2009
>> +Contact:	Yu Zhao <yu.zhao@intel.com>
>> +Description:
>> +		This symbol link appears when a device is Virtual Function.

		     symbolic                          is a Virtual Function.

>> +		The symbol link points to the PCI device sysfs entry of

		    symbolic                                   entry of the

>> +		Physical Function this device associates with.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v10 1/7] PCI: initialize and release SR-IOV capability
  2009-03-06 22:03     ` Randy Dunlap
@ 2009-03-06 23:31       ` Duyck, Alexander H
  0 siblings, 0 replies; 45+ messages in thread
From: Duyck, Alexander H @ 2009-03-06 23:31 UTC (permalink / raw)
  To: Randy Dunlap, Matthew Wilcox
  Cc: Zhao, Yu, jbarnes, linux-pci, kvm, linux-kernel

Randy Dunlap wrote:
> Matthew Wilcox wrote:
>> On Fri, Feb 20, 2009 at 02:54:42PM +0800, Yu Zhao wrote:
>> PCI MSI can also be disabled at runtime (and Fedora do by default).
>> Since SR-IOV really does require MSI, we need to put in a runtime
>> check to see if pci_msi_enabled() is false.
>> 
>> We don't depend on PCIEPORTBUS (a horribly named symbol).  Should we?
>> SR-IOV is only supported for PCI Express machines.  I'm not sure of
>> the right answer here, but I thought I should raise the question.
>> 
>>> +	help
>>> +	  PCI-SIG I/O Virtualization (IOV) Specifications support.
>>> +	  Single Root IOV: allows the Physical Function driver to enable
>>> +	  the hardware capability, so the Virtual Function is accessible
>>> +	  via the PCI Configuration Space using its own Bus, Device and
>>> +	  Function Numbers. Each Virtual Function also has the PCI Memory
>>> +	  Space to map the device specific register set.
> 
> Too spec. and implementation specific for users.
> 
>> I'm not convinced this is the most helpful we could be to the user
>> who's configuring their own kernel.  How about something like this? 
>> (Randy, I particularly look to you to make my prose less turgid).
>> 
>> 	help
>> 	  IO Virtualisation is a PCI feature supported by some devices 	   
>> 	  z ;) which allows you to create virtual PCI devices and assign
>> 	  them to guest OSes.  This option needs to be selected in the host
>> 	  or Dom0 kernel, but does not need to be selected in the guest
>> 	  or DomU kernel.  If you don't know whether your hardware supports
>> 	  it, you can check by using lspci to look for the SR-IOV
>> capability. 
>> 
>> 	  If you have no idea what any of that means, it is safe to
>> 	  answer 'N' here.
> 
> That's certainly more readable and user-friendly.
> I don't know what else it needs.  Looks good to me.
> 
> ~Randy

I'm not sure about this help text because SR-IOV and direct assignment are two very different features, and based on this text I would think that SR-IOV is all you need to direct assign VFs into guests when all it actually does is generate the devices that can then be assigned.  We already have questions about if VFs can be used on the host OS and this help text doesn't resolve that and would likely lead to similar questions in the future.

I would recommend keeping things simple and just stating "IO Virtualization is a PCI feature, supported by some devices, which allows the creation of virtual PCI devices that contain a subset of the original device's resources.  If you don't know if you hardware supports it, you can check by using lspci to check for the SR-IOV capability".

-Alex


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support
  2009-03-06 19:44 ` Matthew Wilcox
@ 2009-03-07  2:34   ` Greg KH
  2009-03-10  1:11     ` Yu Zhao
  0 siblings, 1 reply; 45+ messages in thread
From: Greg KH @ 2009-03-07  2:34 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Yu Zhao, jbarnes, linux-pci, kvm, linux-kernel

On Fri, Mar 06, 2009 at 12:44:11PM -0700, Matthew Wilcox wrote:
> > Physical Function driver patches for Intel 82576 NIC are available:
> >   http://patchwork.kernel.org/patch/8063/
> >   http://patchwork.kernel.org/patch/8064/
> >   http://patchwork.kernel.org/patch/8065/
> >   http://patchwork.kernel.org/patch/8066/
> 
> I need to review this driver; I haven't done that yet.  Has anyone else?

The driver was rejected by the upstream developers, who said it would
never be accepted.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 1/7] PCI: initialize and release SR-IOV capability
  2009-03-06 20:08   ` Matthew Wilcox
  2009-03-06 22:03     ` Randy Dunlap
@ 2009-03-07  2:38     ` Greg KH
  2009-03-10  1:19       ` Yu Zhao
  2009-03-09  8:12     ` Yu Zhao
  2 siblings, 1 reply; 45+ messages in thread
From: Greg KH @ 2009-03-07  2:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yu Zhao, jbarnes, linux-pci, kvm, linux-kernel, Randy.Dunlap

On Fri, Mar 06, 2009 at 01:08:10PM -0700, Matthew Wilcox wrote:
> On Fri, Feb 20, 2009 at 02:54:42PM +0800, Yu Zhao wrote:
> > +	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
> > +		if (pdev->sriov)
> > +			break;
> > +	if (list_empty(&dev->bus->devices) || !pdev->sriov)
> > +		pdev = NULL;
> > +	ctrl = 0;
> > +	if (!pdev && pci_ari_enabled(dev->bus))
> > +		ctrl |= PCI_SRIOV_CTRL_ARI;
> > +
> 
> I don't like this loop.  At the end of a list_for_each_entry() loop,
> pdev will not be pointing at a pci_device, it'll be pointing to some
> offset from &dev->bus->devices.  So checking pdev->sriov at this point
> is really, really bad.  I would prefer to see something like this:
> 
>         ctrl = 0;
>         list_for_each_entry(pdev, &dev->bus->devices, bus_list) {
>                 if (pdev->sriov)
>                         goto ari_enabled;
>         }
> 
>         if (pci_ari_enabled(dev->bus))
>                 ctrl = PCI_SRIOV_CTRL_ARI;
>  ari_enabled:
>         pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);

No, please use bus_for_each_dev() instead, or bus_find_device(), don't
walk the bus list by hand.  I'm kind of surprised that even builds.  Hm,
in looking at the 2.6.29-rc kernels, I notice it will not even build at
all, you are now forced to use those functions, which is good.

Has anyone even tried to build this patch recently?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver
  2009-03-06 20:37   ` Matthew Wilcox
  2009-03-06 21:48     ` Randy Dunlap
@ 2009-03-07  2:40     ` Greg KH
  2009-03-09  8:25     ` Yu Zhao
  2 siblings, 0 replies; 45+ messages in thread
From: Greg KH @ 2009-03-07  2:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yu Zhao, jbarnes, linux-pci, kvm, linux-kernel, Randy.Dunlap

On Fri, Mar 06, 2009 at 01:37:18PM -0700, Matthew Wilcox wrote:
> On Fri, Feb 20, 2009 at 02:54:45PM +0800, Yu Zhao wrote:
> > +	virtfn->sysdata = dev->bus->sysdata;
> > +	virtfn->dev.parent = dev->dev.parent;
> > +	virtfn->dev.bus = dev->dev.bus;
> > +	virtfn->devfn = devfn;
> > +	virtfn->hdr_type = PCI_HEADER_TYPE_NORMAL;
> > +	virtfn->cfg_size = PCI_CFG_SPACE_EXP_SIZE;
> > +	virtfn->error_state = pci_channel_io_normal;
> > +	virtfn->current_state = PCI_UNKNOWN;
> > +	virtfn->is_pcie = 1;
> > +	virtfn->pcie_type = PCI_EXP_TYPE_ENDPOINT;
> > +	virtfn->dma_mask = 0xffffffff;
> > +	virtfn->vendor = dev->vendor;
> > +	virtfn->subsystem_vendor = dev->subsystem_vendor;
> > +	virtfn->class = dev->class;
> 
> There seems to be a certain amount of commonality between this and
> pci_scan_device().  Have you considered trying to make a common helper
> function, or does it not work out well?
> 
> > +	pci_device_add(virtfn, virtfn->bus);
> 
> Greg is probably going to ding you here for adding the device, then
> creating the symlinks.  I believe it's now best practice to create the
> symlinks first, so there's no window where userspace can get confused.

If the uevent gets sent before the symlinks are created, it's a bug.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support
  2009-03-06 19:33   ` Matthew Wilcox
@ 2009-03-08 14:30     ` Avi Kivity
  2009-03-08 15:01       ` Matthew Wilcox
  2009-03-09  3:42       ` Yang, Sheng
  0 siblings, 2 replies; 45+ messages in thread
From: Avi Kivity @ 2009-03-08 14:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yu Zhao, jbarnes, linux-pci, kvm, linux-kernel, Yang, Sheng

Matthew Wilcox wrote:
> On Tue, Feb 24, 2009 at 12:47:38PM +0200, Avi Kivity wrote:
>   
>> Do those patches allow using a VF on the host (in other words, does the 
>> kernel emulate config space accesses)?
>>     
>
> SR-IOV hardware handles config space accesses to virtual functions.  No
> kernel changes needed for that aspect of it.
>   

Patches 2 and 3 of the patchset that enables SR/IOV in kvm [1] suggest 
that at the config space is only partially implemented.

[1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/29034

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support
  2009-03-08 14:30     ` Avi Kivity
@ 2009-03-08 15:01       ` Matthew Wilcox
  2009-03-09  0:45         ` Greg KH
  2009-03-09  3:42       ` Yang, Sheng
  1 sibling, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2009-03-08 15:01 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Yu Zhao, jbarnes, linux-pci, kvm, linux-kernel, Yang, Sheng

On Sun, Mar 08, 2009 at 04:30:16PM +0200, Avi Kivity wrote:
> Matthew Wilcox wrote:
> >On Tue, Feb 24, 2009 at 12:47:38PM +0200, Avi Kivity wrote:
> >>Do those patches allow using a VF on the host (in other words, does the 
> >>kernel emulate config space accesses)?
> >
> >SR-IOV hardware handles config space accesses to virtual functions.  No
> >kernel changes needed for that aspect of it.
> 
> Patches 2 and 3 of the patchset that enables SR/IOV in kvm [1] suggest 
> that at the config space is only partially implemented.
> 
> [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/29034

I believe Red Hat are now members of the PCI SIG, you really should
download the SR-IOV spec that Yu Zhao keeps pointing at.  It's going to
be very hard to have a sensible discussion if you haven't read it.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support
  2009-03-08 15:01       ` Matthew Wilcox
@ 2009-03-09  0:45         ` Greg KH
  0 siblings, 0 replies; 45+ messages in thread
From: Greg KH @ 2009-03-09  0:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Avi Kivity, Yu Zhao, jbarnes, linux-pci, kvm, linux-kernel, Yang, Sheng

On Sun, Mar 08, 2009 at 09:01:09AM -0600, Matthew Wilcox wrote:
> On Sun, Mar 08, 2009 at 04:30:16PM +0200, Avi Kivity wrote:
> > Matthew Wilcox wrote:
> > >On Tue, Feb 24, 2009 at 12:47:38PM +0200, Avi Kivity wrote:
> > >>Do those patches allow using a VF on the host (in other words, does the 
> > >>kernel emulate config space accesses)?
> > >
> > >SR-IOV hardware handles config space accesses to virtual functions.  No
> > >kernel changes needed for that aspect of it.
> > 
> > Patches 2 and 3 of the patchset that enables SR/IOV in kvm [1] suggest 
> > that at the config space is only partially implemented.
> > 
> > [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/29034
> 
> I believe Red Hat are now members of the PCI SIG, you really should
> download the SR-IOV spec that Yu Zhao keeps pointing at.  It's going to
> be very hard to have a sensible discussion if you haven't read it.

But if any kernel developers are working for companies that are not
members of the PCI SIG, and wish to read the PCI specs, please let me
know as we are working to enable this to happen through the Linux
Foundation.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support
  2009-03-08 14:30     ` Avi Kivity
  2009-03-08 15:01       ` Matthew Wilcox
@ 2009-03-09  3:42       ` Yang, Sheng
  2009-03-09  4:35         ` Yang, Sheng
  1 sibling, 1 reply; 45+ messages in thread
From: Yang, Sheng @ 2009-03-09  3:42 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Matthew Wilcox, Zhao, Yu, jbarnes, linux-pci, kvm, linux-kernel

On Sunday 08 March 2009 22:30:16 Avi Kivity wrote:
> Matthew Wilcox wrote:
> > On Tue, Feb 24, 2009 at 12:47:38PM +0200, Avi Kivity wrote:
> >> Do those patches allow using a VF on the host (in other words, does the
> >> kernel emulate config space accesses)?
> >
> > SR-IOV hardware handles config space accesses to virtual functions.  No
> > kernel changes needed for that aspect of it.
>
> Patches 2 and 3 of the patchset that enables SR/IOV in kvm [1] suggest
> that at the config space is only partially implemented.
>
> [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/29034

Hi Avi

For kernel side, patch 2 is not necessary. Because kernel would read VID/DID 
directly from pci_dev rather than configuration space, which have been set 
properly already.

And very sorry, for the patch 3. We haven't known exactly what's happened. I 
think the problem is caused by guest driver, but didn't confirm(and I have 
some misunderstandings with ZhaoYu for I thought we are agree on the reason, 
but after confirm with him, he didn't agree). I am doing more investigations 
to find the real cause.

-- 
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support
  2009-03-09  3:42       ` Yang, Sheng
@ 2009-03-09  4:35         ` Yang, Sheng
  2009-03-09 13:45           ` Avi Kivity
  0 siblings, 1 reply; 45+ messages in thread
From: Yang, Sheng @ 2009-03-09  4:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Matthew Wilcox, Zhao, Yu, jbarnes, linux-pci, kvm, linux-kernel

On Monday 09 March 2009 11:42:05 Yang, Sheng wrote:
> On Sunday 08 March 2009 22:30:16 Avi Kivity wrote:
> > Matthew Wilcox wrote:
> > > On Tue, Feb 24, 2009 at 12:47:38PM +0200, Avi Kivity wrote:
> > >> Do those patches allow using a VF on the host (in other words, does
> > >> the kernel emulate config space accesses)?
> > >
> > > SR-IOV hardware handles config space accesses to virtual functions.  No
> > > kernel changes needed for that aspect of it.
> >
> > Patches 2 and 3 of the patchset that enables SR/IOV in kvm [1] suggest
> > that at the config space is only partially implemented.
> >
> > [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/29034
>
> Hi Avi
>
> For kernel side, patch 2 is not necessary. Because kernel would read
> VID/DID directly from pci_dev rather than configuration space, which have
> been set properly already.
>
> And very sorry, for the patch 3. We haven't known exactly what's happened.
> I think the problem is caused by guest driver, but didn't confirm(and I
> have some misunderstandings with ZhaoYu for I thought we are agree on the
> reason, but after confirm with him, he didn't agree). I am doing more
> investigations to find the real cause.

Found the reason of patch 3.

After insert guest driver module(vf driver), the driver would do a RMW to the 
command register to enable Bus Master bit(bit 2). And before that, MMIO bit 
have been set in the register. But without the patch 3, guest driver won't see 
the MMIO bit(bit 1), then just set 0x4 to the command register, with the side 
effect to unmap MMIO in QEmu. So patch 3 is needed(and what I thought before 
is right).

Unset the bit only affect the QEmu, which would unmap the mapping for MMIO. 
Kernel side don't need this, so it's OK.

-- 
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 1/7] PCI: initialize and release SR-IOV capability
  2009-03-06 20:08   ` Matthew Wilcox
  2009-03-06 22:03     ` Randy Dunlap
  2009-03-07  2:38     ` Greg KH
@ 2009-03-09  8:12     ` Yu Zhao
  2 siblings, 0 replies; 45+ messages in thread
From: Yu Zhao @ 2009-03-09  8:12 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: jbarnes, linux-pci, kvm, linux-kernel, Randy.Dunlap

On Sat, Mar 07, 2009 at 04:08:10AM +0800, Matthew Wilcox wrote:
> On Fri, Feb 20, 2009 at 02:54:42PM +0800, Yu Zhao wrote:
> > +config PCI_IOV
> > +	bool "PCI IOV support"
> > +	depends on PCI
> > +	select PCI_MSI
> 
> My understanding is that having 'select' of a config symbol that the
> user can choose is bad.  I think we should probably make this 'depends
> on PCI_MSI'.
> 
> PCI MSI can also be disabled at runtime (and Fedora do by default).
> Since SR-IOV really does require MSI, we need to put in a runtime check
> to see if pci_msi_enabled() is false.

Actually the SR-IOV doesn't really depend on the MSI (e.g. hardware doesn't
implement interrupt at all), but in most case the SR-IOV needs the MSI. The
selection is intended to make life easier. Anyway I'll remove it if people
want more flexibility (and possibility to break the PF driver).

> We don't depend on PCIEPORTBUS (a horribly named symbol).  Should we?
> SR-IOV is only supported for PCI Express machines.  I'm not sure of the
> right answer here, but I thought I should raise the question.

I think we don't need PCIe port bus framework. My understanding is it's for
those capabilities that want to share resources of the PCIe capability.

> > +	default n
> 
> You don't need this -- the default default is n ;-)
> 
> > +	help
> > +	  PCI-SIG I/O Virtualization (IOV) Specifications support.
> > +	  Single Root IOV: allows the Physical Function driver to enable
> > +	  the hardware capability, so the Virtual Function is accessible
> > +	  via the PCI Configuration Space using its own Bus, Device and
> > +	  Function Numbers. Each Virtual Function also has the PCI Memory
> > +	  Space to map the device specific register set.
> 
> I'm not convinced this is the most helpful we could be to the user who's
> configuring their own kernel.  How about something like this?  (Randy, I
> particularly look to you to make my prose less turgid).
> 
> 	help
> 	  IO Virtualisation is a PCI feature supported by some devices
> 	  which allows you to create virtual PCI devices and assign them
> 	  to guest OSes.  This option needs to be selected in the host
> 	  or Dom0 kernel, but does not need to be selected in the guest
> 	  or DomU kernel.  If you don't know whether your hardware supports
> 	  it, you can check by using lspci to look for the SR-IOV capability.
> 
> 	  If you have no idea what any of that means, it is safe to
> 	  answer 'N' here.
> 
> > diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> > index 3d07ce2..ba99282 100644
> > --- a/drivers/pci/Makefile
> > +++ b/drivers/pci/Makefile
> > @@ -29,6 +29,9 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o
> >  
> >  obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o
> >  
> > +# PCI IOV support
> > +obj-$(CONFIG_PCI_IOV) += iov.o
> 
> I see you're following the gerneal style in this file, but the comments
> really add no value.  I should send a patch to take out the existing ones.
> 
> > +	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
> > +		if (pdev->sriov)
> > +			break;
> > +	if (list_empty(&dev->bus->devices) || !pdev->sriov)
> > +		pdev = NULL;
> > +	ctrl = 0;
> > +	if (!pdev && pci_ari_enabled(dev->bus))
> > +		ctrl |= PCI_SRIOV_CTRL_ARI;
> > +
> 
> I don't like this loop.  At the end of a list_for_each_entry() loop,
> pdev will not be pointing at a pci_device, it'll be pointing to some
> offset from &dev->bus->devices.  So checking pdev->sriov at this point
> is really, really bad.  I would prefer to see something like this:
> 
>         ctrl = 0;
>         list_for_each_entry(pdev, &dev->bus->devices, bus_list) {
>                 if (pdev->sriov)
>                         goto ari_enabled;
>         }
> 
>         if (pci_ari_enabled(dev->bus))
>                 ctrl = PCI_SRIOV_CTRL_ARI;
>  ari_enabled:
>         pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);

I guess I should put some comments here. What I want to do is to find the
lowest numbered PF (pdev) if it exists. It has ARI Capable Hierarchy bit,
as you have figured out, and it also keeps the VF bus lock. The lock is
for those VFs who belong to different PFs within a SR-IOV device and reside
on different bus (virtual) than PF's. When the PF driver enables/disables
the SR-IOV of a PF (this may happen anytime, not only at the driver probe
stage), the virtual VF bus will be allocated if it hasn't been allocated
yet. The lock guards the VF bus allocation between different PFs whose VFs
share the VF bus.

> > +	if (pdev)
> > +		iov->pdev = pci_dev_get(pdev);
> > +	else {
> > +		iov->pdev = dev;
> > +		mutex_init(&iov->lock);
> > +	}
> 
> Now I'm confused.  Why don't we need to init the mutex if there's another
> device on the same bus which also has an iov capability?

Yes, that's what it means :-)

> > +static void sriov_release(struct pci_dev *dev)
> > +{
> > +	if (dev == dev->sriov->pdev)
> > +		mutex_destroy(&dev->sriov->lock);
> > +	else
> > +		pci_dev_put(dev->sriov->pdev);
> > +
> > +	kfree(dev->sriov);
> > +	dev->sriov = NULL;
> > +}
> 
> > +void pci_iov_release(struct pci_dev *dev)
> > +{
> > +	if (dev->sriov)
> > +		sriov_release(dev);
> > +}
> 
> This seems to be a bit of a design pattern with you, and I'm not quite sure why you do it like this instead of just doing:
> 
> void pci_iov_release(struct pci_dev *dev)
> {
> 	if (!dev->sriov)
> 		return;
> 	[...]
> }

It's not my design pattern. I just want to leave some space for the MR-IOV,
which would look like:

void pci_iov_release(struct pci_dev *dev)
{
	if (dev->sriov)
		sriov_release(dev);

	if (dev->mriov)
		mriov_release(dev);
}

And that's why I put *_iov_* wrapper on those *_sriov_* functions.

Thank you for the careful review!
Yu

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 3/7] PCI: reserve bus range for SR-IOV device
  2009-03-06 20:20   ` Matthew Wilcox
@ 2009-03-09  8:13     ` Yu Zhao
  2009-03-09 18:09       ` Randy Dunlap
  0 siblings, 1 reply; 45+ messages in thread
From: Yu Zhao @ 2009-03-09  8:13 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: jbarnes, linux-pci, kvm, linux-kernel

On Sat, Mar 07, 2009 at 04:20:24AM +0800, Matthew Wilcox wrote:
> On Fri, Feb 20, 2009 at 02:54:44PM +0800, Yu Zhao wrote:
> > +static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn)
> > +{
> > +	u16 bdf;
> > +
> > +	bdf = (dev->bus->number << 8) + dev->devfn +
> > +	      dev->sriov->offset + dev->sriov->stride * id;
> > +	*busnr = bdf >> 8;
> > +	*devfn = bdf & 0xff;
> > +}
> 
> I find the interface here a bit clunky -- a function returning void
> while having two OUT parameters.  How about this variation on the theme
> (viewers are encouraged to come up with their own preferred
> implementations and interfaces):
> 
> static inline __pure u16 virtfn_bdf(struct pci_dev *dev, int id)
> {
> 	return (dev->bus->number << 8) + dev->devfn + dev->sriov->offset +
> 		dev->sriov->stride * id;
> }
> 
> #define VIRT_BUS(dev, id)	(virtfn_bdf(dev, id) >> 8)
> #define VIRT_DEVFN(dev, id)	(virtfn_bdf(dev, id) & 0xff)
> 
> We rely on GCC to do CSE and not actually invoke virtfn_bdf more than
> once.

Yes, that's a good idea. Will replace that function with macros.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver
  2009-03-06 20:37   ` Matthew Wilcox
  2009-03-06 21:48     ` Randy Dunlap
  2009-03-07  2:40     ` Greg KH
@ 2009-03-09  8:25     ` Yu Zhao
  2009-03-09 19:39       ` Greg KH
  2 siblings, 1 reply; 45+ messages in thread
From: Yu Zhao @ 2009-03-09  8:25 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: jbarnes, linux-pci, kvm, linux-kernel, Randy.Dunlap

On Sat, Mar 07, 2009 at 04:37:18AM +0800, Matthew Wilcox wrote:
> On Fri, Feb 20, 2009 at 02:54:45PM +0800, Yu Zhao wrote:
> > +	virtfn->sysdata = dev->bus->sysdata;
> > +	virtfn->dev.parent = dev->dev.parent;
> > +	virtfn->dev.bus = dev->dev.bus;
> > +	virtfn->devfn = devfn;
> > +	virtfn->hdr_type = PCI_HEADER_TYPE_NORMAL;
> > +	virtfn->cfg_size = PCI_CFG_SPACE_EXP_SIZE;
> > +	virtfn->error_state = pci_channel_io_normal;
> > +	virtfn->current_state = PCI_UNKNOWN;
> > +	virtfn->is_pcie = 1;
> > +	virtfn->pcie_type = PCI_EXP_TYPE_ENDPOINT;
> > +	virtfn->dma_mask = 0xffffffff;
> > +	virtfn->vendor = dev->vendor;
> > +	virtfn->subsystem_vendor = dev->subsystem_vendor;
> > +	virtfn->class = dev->class;
> 
> There seems to be a certain amount of commonality between this and
> pci_scan_device().  Have you considered trying to make a common helper
> function, or does it not work out well?

It's doable. Will enhance the pci_setup_device and use it to setup the VF.

> > +	pci_device_add(virtfn, virtfn->bus);
> 
> Greg is probably going to ding you here for adding the device, then
> creating the symlinks.  I believe it's now best practice to create the
> symlinks first, so there's no window where userspace can get confused.

Yes, but unfortunately we can't create links before adding a device.
I double checked device_add(), there is no place for those links to be
created before it sends uevent. So for now, we have to trigger another
uevent for those links.

> > +	mutex_unlock(&iov->pdev->sriov->lock);
> 
> I question the existance of this mutex now.  What's it protecting?
> 
> Aren't we going to be implicitly protected by virtue of the Physical
> Function device driver being the only one calling this function, and the
> driver will be calling it from the ->probe routine which is not called
> simultaneously for the same device.

The PF driver patches I listed before support dynamical enabling/disabling
of the SR-IOV through sysfs interface. So we have to protect the VF bus
allocation as I explained before.

> > +	virtfn->physfn = pci_dev_get(dev);
> > +
> > +	rc = pci_bus_add_device(virtfn);
> > +	if (rc)
> > +		goto failed1;
> > +	sprintf(buf, "%d", id);
> 
> %u, perhaps?  And maybe 'id' should always be unsigned?  Just a thought.

Yes, will replace %d to %u.

> > +	rc = sysfs_create_link(&iov->dev.kobj, &virtfn->dev.kobj, buf);
> > +	if (rc)
> > +		goto failed1;
> > +	rc = sysfs_create_link(&virtfn->dev.kobj, &dev->dev.kobj, "physfn");
> > +	if (rc)
> > +		goto failed2;
> 
> I'm glad to see these symlinks documented in later patches!
> 
> > +	nres = 0;
> > +	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> > +		res = dev->resource + PCI_SRIOV_RESOURCES + i;
> > +		if (!res->parent)
> > +			continue;
> > +		nres++;
> > +	}
> 
> Can't this be written more simply as:
> 
> 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> 		res = dev->resource + PCI_SRIOV_RESOURCES + i;
> 		if (res->parent)
> 			nres++;
> 	}

Yes, will do

> ?
> 
> > +	if (nres != iov->nres) {
> > +		dev_err(&dev->dev, "no enough MMIO for SR-IOV\n");
> > +		return -ENOMEM;
> > +	}
> 
> Randy, can you help us out with better wording here?
> 
> > +		dev_err(&dev->dev, "no enough bus range for SR-IOV\n");
> 
> and here.
> 
> > +	if (iov->link != dev->devfn) {
> > +		rc = -ENODEV;
> > +		list_for_each_entry(link, &dev->bus->devices, bus_list) {
> > +			if (link->sriov && link->devfn == iov->link)
> > +				rc = sysfs_create_link(&iov->dev.kobj,
> > +						&link->dev.kobj, "dep_link");
> 
> I skipped to the end and read patch 7/7 and I still don't understand
> what dep_link is for.  Can you explain please?  In particular, how is it
> different from physfn?

It's defined by spec as:

3.3.8. Function Dependency Link (12h)
The programming model for a Device may have vendor specific dependencies
between sets of Functions. The Function Dependency Link field is used to
describe these dependencies. This field describes dependencies between PFs.
VF dependencies are the same as the dependencies of their associated PFs.
If a PF is independent from other PFs of a Device, this field shall
contain its own Function Number. If a PF is dependent on other PFs of a
Device, this field shall contain the Function Number of the next PF in
the same Function Dependency List. The last PF in a Function Dependency
List shall contain the Function Number of the first PF in the Function
Dependency List. If PF p and PF q are in the same Function Dependency
List, than any SI that is assigned VF p,n shall also be assigned to VF q,n.

Thanks,
Yu

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 5/7] PCI: handle SR-IOV Virtual Function Migration
  2009-03-06 21:13   ` Matthew Wilcox
@ 2009-03-09  8:28     ` Yu Zhao
  0 siblings, 0 replies; 45+ messages in thread
From: Yu Zhao @ 2009-03-09  8:28 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: jbarnes, linux-pci, kvm, linux-kernel

On Sat, Mar 07, 2009 at 05:13:41AM +0800, Matthew Wilcox wrote:
> On Fri, Feb 20, 2009 at 02:54:46PM +0800, Yu Zhao wrote:
> > +static int sriov_migration(struct pci_dev *dev)
> > +{
> > +	u16 status;
> > +	struct pci_sriov *iov = dev->sriov;
> > +
> > +	if (!iov->nr_virtfn)
> > +		return 0;
> > +
> > +	if (!(iov->cap & PCI_SRIOV_CAP_VFM))
> > +		return 0;
> > +
> > +	pci_read_config_word(iov->self, iov->pos + PCI_SRIOV_STATUS, &status);
> 
> You passed in dev here, you don't need to use iov->self, right?

Will do.

> > +	if (!(status & PCI_SRIOV_STATUS_VFM))
> > +		return 0;
> > +
> > +	schedule_work(&iov->mtask);
> > +
> > +	return 1;
> > +}
> 
> > +/**
> > + * pci_sriov_migration - notify SR-IOV core of Virtual Function Migration
> > + * @dev: the PCI device
> > + *
> > + * Returns IRQ_HANDLED if the IRQ is handled, or IRQ_NONE if not.
> > + *
> > + * Physical Function driver is responsible to register IRQ handler using
> > + * VF Migration Interrupt Message Number, and call this function when the
> > + * interrupt is generated by the hardware.
> > + */
> > +irqreturn_t pci_sriov_migration(struct pci_dev *dev)
> > +{
> > +	if (!dev->sriov)
> > +		return IRQ_NONE;
> > +
> > +	return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_sriov_migration);
> 
> OK, I think I get it -- you've basically written an interrupt handler
> for the driver to call from its interrupt handler.  Am I right in
> thinking that the reason the driver needs to do the interrupt handler
> here is because we don't currently have an interface that looks like:
> 
> int pci_get_msix_interrupt(struct pci_dev *dev, unsigned vector);
> 
> ?  If so, we should probably add it; I want it for my MSI-X rewrite
> anyway.

Right, we really need this function. But I guess we still have to keep the
handler in case the PF only has MSI, right?

Thanks,
Yu

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver
  2009-03-06 21:48     ` Randy Dunlap
@ 2009-03-09  8:29       ` Yu Zhao
  0 siblings, 0 replies; 45+ messages in thread
From: Yu Zhao @ 2009-03-09  8:29 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: Matthew Wilcox, jbarnes, linux-pci, kvm, linux-kernel

Thanks a lot, Randy!

On Sat, Mar 07, 2009 at 05:48:33AM +0800, Randy Dunlap wrote:
> Matthew Wilcox wrote:
> > On Fri, Feb 20, 2009 at 02:54:45PM +0800, Yu Zhao wrote:
> > 
> >> +	if (nres != iov->nres) {
> >> +		dev_err(&dev->dev, "no enough MMIO for SR-IOV\n");
> >> +		return -ENOMEM;
> >> +	}
> 
> 		"not enough MMIO BARs for SR-IOV"
> 	or
> 		"not enough MMIO resources for SR-IOV"
> 	or
> 		"too few MMIO BARs for SR-IOV"
> ?
> 
> > Randy, can you help us out with better wording here?
> > 
> >> +		dev_err(&dev->dev, "no enough bus range for SR-IOV\n");
> > 
> > and here.
> 
> 		"SR-IOV: bus number too large"
> 	or
> 		"SR-IOV: bus number out of range"
> 	or
> 		"SR-IOV: cannot allocate valid bus number"
> ?
> 
> >> +	if (iov->link != dev->devfn) {
> >> +		rc = -ENODEV;
> >> +		list_for_each_entry(link, &dev->bus->devices, bus_list) {
> >> +			if (link->sriov && link->devfn == iov->link)
> >> +				rc = sysfs_create_link(&iov->dev.kobj,
> >> +						&link->dev.kobj, "dep_link");

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support
  2009-03-09  4:35         ` Yang, Sheng
@ 2009-03-09 13:45           ` Avi Kivity
  0 siblings, 0 replies; 45+ messages in thread
From: Avi Kivity @ 2009-03-09 13:45 UTC (permalink / raw)
  To: Yang, Sheng
  Cc: Matthew Wilcox, Zhao, Yu, jbarnes, linux-pci, kvm, linux-kernel

Yang, Sheng wrote:
>>> Patches 2 and 3 of the patchset that enables SR/IOV in kvm [1] suggest
>>> that at the config space is only partially implemented.
>>>
>>> [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/29034
>>>       
>> Hi Avi
>>
>> For kernel side, patch 2 is not necessary. Because kernel would read
>> VID/DID directly from pci_dev rather than configuration space, which have
>> been set properly already.
>>
>> And very sorry, for the patch 3. We haven't known exactly what's happened.
>> I think the problem is caused by guest driver, but didn't confirm(and I
>> have some misunderstandings with ZhaoYu for I thought we are agree on the
>> reason, but after confirm with him, he didn't agree). I am doing more
>> investigations to find the real cause.
>>     
>
> Found the reason of patch 3.
>
> After insert guest driver module(vf driver), the driver would do a RMW to the 
> command register to enable Bus Master bit(bit 2). And before that, MMIO bit 
> have been set in the register. But without the patch 3, guest driver won't see 
> the MMIO bit(bit 1), then just set 0x4 to the command register, with the side 
> effect to unmap MMIO in QEmu. So patch 3 is needed(and what I thought before 
> is right).
>
> Unset the bit only affect the QEmu, which would unmap the mapping for MMIO. 
> Kernel side don't need this, so it's OK.
>   

Thanks for the explanations!

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 3/7] PCI: reserve bus range for SR-IOV device
  2009-03-09  8:13     ` Yu Zhao
@ 2009-03-09 18:09       ` Randy Dunlap
  2009-03-09 18:11         ` Matthew Wilcox
  0 siblings, 1 reply; 45+ messages in thread
From: Randy Dunlap @ 2009-03-09 18:09 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Matthew Wilcox, jbarnes, linux-pci, kvm, linux-kernel

Yu Zhao wrote:
> On Sat, Mar 07, 2009 at 04:20:24AM +0800, Matthew Wilcox wrote:
>> On Fri, Feb 20, 2009 at 02:54:44PM +0800, Yu Zhao wrote:
>>> +static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn)
>>> +{
>>> +	u16 bdf;
>>> +
>>> +	bdf = (dev->bus->number << 8) + dev->devfn +
>>> +	      dev->sriov->offset + dev->sriov->stride * id;
>>> +	*busnr = bdf >> 8;
>>> +	*devfn = bdf & 0xff;
>>> +}
>> I find the interface here a bit clunky -- a function returning void
>> while having two OUT parameters.  How about this variation on the theme
>> (viewers are encouraged to come up with their own preferred
>> implementations and interfaces):
>>
>> static inline __pure u16 virtfn_bdf(struct pci_dev *dev, int id)
>> {
>> 	return (dev->bus->number << 8) + dev->devfn + dev->sriov->offset +
>> 		dev->sriov->stride * id;
>> }
>>
>> #define VIRT_BUS(dev, id)	(virtfn_bdf(dev, id) >> 8)
>> #define VIRT_DEVFN(dev, id)	(virtfn_bdf(dev, id) & 0xff)
>>
>> We rely on GCC to do CSE and not actually invoke virtfn_bdf more than
>> once.
> 
> Yes, that's a good idea. Will replace that function with macros.

That's the opposite of most changes lately.  I.e., functions (with
typechecking) are preferred.

-- 
~Randy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 3/7] PCI: reserve bus range for SR-IOV device
  2009-03-09 18:09       ` Randy Dunlap
@ 2009-03-09 18:11         ` Matthew Wilcox
  0 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2009-03-09 18:11 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: Yu Zhao, jbarnes, linux-pci, kvm, linux-kernel

On Mon, Mar 09, 2009 at 11:09:02AM -0700, Randy Dunlap wrote:
> >> static inline __pure u16 virtfn_bdf(struct pci_dev *dev, int id)
> >> {
> >> 	return (dev->bus->number << 8) + dev->devfn + dev->sriov->offset +
> >> 		dev->sriov->stride * id;
> >> }
> >>
> >> #define VIRT_BUS(dev, id)	(virtfn_bdf(dev, id) >> 8)
> >> #define VIRT_DEVFN(dev, id)	(virtfn_bdf(dev, id) & 0xff)
> >>
> >> We rely on GCC to do CSE and not actually invoke virtfn_bdf more than
> >> once.
> > 
> > Yes, that's a good idea. Will replace that function with macros.
> 
> That's the opposite of most changes lately.  I.e., functions (with
> typechecking) are preferred.

There's every bit as much typechecking with the approach I outlined as
there was with the original.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver
  2009-03-09  8:25     ` Yu Zhao
@ 2009-03-09 19:39       ` Greg KH
  2009-03-10  1:37         ` Yu Zhao
  0 siblings, 1 reply; 45+ messages in thread
From: Greg KH @ 2009-03-09 19:39 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Matthew Wilcox, jbarnes, linux-pci, kvm, linux-kernel, Randy.Dunlap

On Mon, Mar 09, 2009 at 04:25:05PM +0800, Yu Zhao wrote:
> > > +	pci_device_add(virtfn, virtfn->bus);
> > 
> > Greg is probably going to ding you here for adding the device, then
> > creating the symlinks.  I believe it's now best practice to create the
> > symlinks first, so there's no window where userspace can get confused.
> 
> Yes, but unfortunately we can't create links before adding a device.
> I double checked device_add(), there is no place for those links to be
> created before it sends uevent. So for now, we have to trigger another
> uevent for those links.

What exactly are you trying to do with a symlink here that you need to
do it this way?  I vaguely remember you mentioning this in the past, but
I thought you had dropped the symlinks after our conversation about this
very problem.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support
  2009-03-07  2:34   ` Greg KH
@ 2009-03-10  1:11     ` Yu Zhao
  0 siblings, 0 replies; 45+ messages in thread
From: Yu Zhao @ 2009-03-10  1:11 UTC (permalink / raw)
  To: Greg KH; +Cc: Matthew Wilcox, jbarnes, linux-pci, kvm, linux-kernel

On Sat, Mar 07, 2009 at 10:34:54AM +0800, Greg KH wrote:
> On Fri, Mar 06, 2009 at 12:44:11PM -0700, Matthew Wilcox wrote:
> > > Physical Function driver patches for Intel 82576 NIC are available:
> > >   http://patchwork.kernel.org/patch/8063/
> > >   http://patchwork.kernel.org/patch/8064/
> > >   http://patchwork.kernel.org/patch/8065/
> > >   http://patchwork.kernel.org/patch/8066/
> > 
> > I need to review this driver; I haven't done that yet.  Has anyone else?
> 
> The driver was rejected by the upstream developers, who said it would
> never be accepted.

Sorry I didn't make it clear. These Physical Function driver patches are
new ones that have been accepted by David Miller (net-next-2.6).
The old ones I sent last time are for demonstration purpose, and won't
be in any upstream trees.

Thanks,
Yu

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 1/7] PCI: initialize and release SR-IOV capability
  2009-03-07  2:38     ` Greg KH
@ 2009-03-10  1:19       ` Yu Zhao
  2009-03-11  4:36         ` Greg KH
  0 siblings, 1 reply; 45+ messages in thread
From: Yu Zhao @ 2009-03-10  1:19 UTC (permalink / raw)
  To: Greg KH
  Cc: Matthew Wilcox, jbarnes, linux-pci, kvm, linux-kernel, Randy.Dunlap

On Sat, Mar 07, 2009 at 10:38:45AM +0800, Greg KH wrote:
> On Fri, Mar 06, 2009 at 01:08:10PM -0700, Matthew Wilcox wrote:
> > On Fri, Feb 20, 2009 at 02:54:42PM +0800, Yu Zhao wrote:
> > > +	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
> > > +		if (pdev->sriov)
> > > +			break;
> > > +	if (list_empty(&dev->bus->devices) || !pdev->sriov)
> > > +		pdev = NULL;
> > > +	ctrl = 0;
> > > +	if (!pdev && pci_ari_enabled(dev->bus))
> > > +		ctrl |= PCI_SRIOV_CTRL_ARI;
> > > +
> > 
> > I don't like this loop.  At the end of a list_for_each_entry() loop,
> > pdev will not be pointing at a pci_device, it'll be pointing to some
> > offset from &dev->bus->devices.  So checking pdev->sriov at this point
> > is really, really bad.  I would prefer to see something like this:
> > 
> >         ctrl = 0;
> >         list_for_each_entry(pdev, &dev->bus->devices, bus_list) {
> >                 if (pdev->sriov)
> >                         goto ari_enabled;
> >         }
> > 
> >         if (pci_ari_enabled(dev->bus))
> >                 ctrl = PCI_SRIOV_CTRL_ARI;
> >  ari_enabled:
> >         pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
> 
> No, please use bus_for_each_dev() instead, or bus_find_device(), don't
> walk the bus list by hand.  I'm kind of surprised that even builds.  Hm,
> in looking at the 2.6.29-rc kernels, I notice it will not even build at
> all, you are now forced to use those functions, which is good.

The devices haven't been added at this time, so we can't use
bus_for_each_dev(). I guess that's why the `bus->devices' exists, and
actually pci_bus_add_devices() walks the bus list same way to retrieve
the devices and add them.

Thanks,
Yu

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver
  2009-03-09 19:39       ` Greg KH
@ 2009-03-10  1:37         ` Yu Zhao
  2009-03-11  4:34           ` Greg KH
  0 siblings, 1 reply; 45+ messages in thread
From: Yu Zhao @ 2009-03-10  1:37 UTC (permalink / raw)
  To: Greg KH
  Cc: Matthew Wilcox, jbarnes, linux-pci, kvm, linux-kernel, Randy.Dunlap

On Tue, Mar 10, 2009 at 03:39:01AM +0800, Greg KH wrote:
> On Mon, Mar 09, 2009 at 04:25:05PM +0800, Yu Zhao wrote:
> > > > +	pci_device_add(virtfn, virtfn->bus);
> > > 
> > > Greg is probably going to ding you here for adding the device, then
> > > creating the symlinks.  I believe it's now best practice to create the
> > > symlinks first, so there's no window where userspace can get confused.
> > 
> > Yes, but unfortunately we can't create links before adding a device.
> > I double checked device_add(), there is no place for those links to be
> > created before it sends uevent. So for now, we have to trigger another
> > uevent for those links.
> 
> What exactly are you trying to do with a symlink here that you need to
> do it this way?  I vaguely remember you mentioning this in the past, but
> I thought you had dropped the symlinks after our conversation about this
> very problem.

I'd like to create some symlinks to reflect the relationship between
Physical Function and its associated Virtual Functions. The Physical
Function is like a master device that controls the allocation of its
Virtual Functions and owns the device physical resource. The Virtual
Functions are like slave devices of the Physical Function. For example,
if 01:00.0 is a Physical Function and 02:00.0 is a Virtual Function
associated with 01:00.0. Then the symlinks (virtfnN and physfn) would
look like:

  $ ls -l /sys/bus/pci/devices/0000:01:00.0/
  ...
  ...  virtfn0 -> ../0000:02:00.0
  ...  virtfn1 -> ../0000:02:00.1
  ...  virtfn2 -> ../0000:02:00.2
  ...

  $ ls -l /sys/bus/pci/devices/0000:02:00.0/
  ...
  ... physfn -> ../0000:01:00.0
  ...

This is very useful for userspace applications, both KVM and Xen need
to know this kind of relationship so they can request the permission
from a Physical Function before using its associated Virtual Functions.

Thanks,
Yu

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver
  2009-03-10  1:37         ` Yu Zhao
@ 2009-03-11  4:34           ` Greg KH
  0 siblings, 0 replies; 45+ messages in thread
From: Greg KH @ 2009-03-11  4:34 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Matthew Wilcox, jbarnes, linux-pci, kvm, linux-kernel, Randy.Dunlap

On Tue, Mar 10, 2009 at 09:37:53AM +0800, Yu Zhao wrote:
> On Tue, Mar 10, 2009 at 03:39:01AM +0800, Greg KH wrote:
> > On Mon, Mar 09, 2009 at 04:25:05PM +0800, Yu Zhao wrote:
> > > > > +	pci_device_add(virtfn, virtfn->bus);
> > > > 
> > > > Greg is probably going to ding you here for adding the device, then
> > > > creating the symlinks.  I believe it's now best practice to create the
> > > > symlinks first, so there's no window where userspace can get confused.
> > > 
> > > Yes, but unfortunately we can't create links before adding a device.
> > > I double checked device_add(), there is no place for those links to be
> > > created before it sends uevent. So for now, we have to trigger another
> > > uevent for those links.
> > 
> > What exactly are you trying to do with a symlink here that you need to
> > do it this way?  I vaguely remember you mentioning this in the past, but
> > I thought you had dropped the symlinks after our conversation about this
> > very problem.
> 
> I'd like to create some symlinks to reflect the relationship between
> Physical Function and its associated Virtual Functions. The Physical
> Function is like a master device that controls the allocation of its
> Virtual Functions and owns the device physical resource. The Virtual
> Functions are like slave devices of the Physical Function. For example,
> if 01:00.0 is a Physical Function and 02:00.0 is a Virtual Function
> associated with 01:00.0. Then the symlinks (virtfnN and physfn) would
> look like:
> 
>   $ ls -l /sys/bus/pci/devices/0000:01:00.0/
>   ...
>   ...  virtfn0 -> ../0000:02:00.0
>   ...  virtfn1 -> ../0000:02:00.1
>   ...  virtfn2 -> ../0000:02:00.2
>   ...
> 
>   $ ls -l /sys/bus/pci/devices/0000:02:00.0/
>   ...
>   ... physfn -> ../0000:01:00.0
>   ...
> 
> This is very useful for userspace applications, both KVM and Xen need
> to know this kind of relationship so they can request the permission
> from a Physical Function before using its associated Virtual Functions.

Ok, but then make sure you never rely on a udev rule or notifier to see
these symlinks when the device is added to the kernel, as there will be
a nice race condition there :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v10 1/7] PCI: initialize and release SR-IOV capability
  2009-03-10  1:19       ` Yu Zhao
@ 2009-03-11  4:36         ` Greg KH
  0 siblings, 0 replies; 45+ messages in thread
From: Greg KH @ 2009-03-11  4:36 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Matthew Wilcox, jbarnes, linux-pci, kvm, linux-kernel, Randy.Dunlap

On Tue, Mar 10, 2009 at 09:19:44AM +0800, Yu Zhao wrote:
> On Sat, Mar 07, 2009 at 10:38:45AM +0800, Greg KH wrote:
> > On Fri, Mar 06, 2009 at 01:08:10PM -0700, Matthew Wilcox wrote:
> > > On Fri, Feb 20, 2009 at 02:54:42PM +0800, Yu Zhao wrote:
> > > > +	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
> > > > +		if (pdev->sriov)
> > > > +			break;
> > > > +	if (list_empty(&dev->bus->devices) || !pdev->sriov)
> > > > +		pdev = NULL;
> > > > +	ctrl = 0;
> > > > +	if (!pdev && pci_ari_enabled(dev->bus))
> > > > +		ctrl |= PCI_SRIOV_CTRL_ARI;
> > > > +
> > > 
> > > I don't like this loop.  At the end of a list_for_each_entry() loop,
> > > pdev will not be pointing at a pci_device, it'll be pointing to some
> > > offset from &dev->bus->devices.  So checking pdev->sriov at this point
> > > is really, really bad.  I would prefer to see something like this:
> > > 
> > >         ctrl = 0;
> > >         list_for_each_entry(pdev, &dev->bus->devices, bus_list) {
> > >                 if (pdev->sriov)
> > >                         goto ari_enabled;
> > >         }
> > > 
> > >         if (pci_ari_enabled(dev->bus))
> > >                 ctrl = PCI_SRIOV_CTRL_ARI;
> > >  ari_enabled:
> > >         pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
> > 
> > No, please use bus_for_each_dev() instead, or bus_find_device(), don't
> > walk the bus list by hand.  I'm kind of surprised that even builds.  Hm,
> > in looking at the 2.6.29-rc kernels, I notice it will not even build at
> > all, you are now forced to use those functions, which is good.
> 
> The devices haven't been added at this time, so we can't use
> bus_for_each_dev(). I guess that's why the `bus->devices' exists, and
> actually pci_bus_add_devices() walks the bus list same way to retrieve
> the devices and add them.

ah, this is struct pci_bus, not struct bus_type, my mistake.

sorry for the noise,

greg k-h

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2009-03-11  4:38 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-20  6:54 [PATCH v10 0/7] PCI: Linux kernel SR-IOV support Yu Zhao
2009-02-20  6:54 ` [PATCH v10 1/7] PCI: initialize and release SR-IOV capability Yu Zhao
2009-03-06 20:08   ` Matthew Wilcox
2009-03-06 22:03     ` Randy Dunlap
2009-03-06 23:31       ` Duyck, Alexander H
2009-03-07  2:38     ` Greg KH
2009-03-10  1:19       ` Yu Zhao
2009-03-11  4:36         ` Greg KH
2009-03-09  8:12     ` Yu Zhao
2009-02-20  6:54 ` [PATCH v10 2/7] PCI: restore saved SR-IOV state Yu Zhao
2009-03-06 20:09   ` Matthew Wilcox
2009-02-20  6:54 ` [PATCH v10 3/7] PCI: reserve bus range for SR-IOV device Yu Zhao
2009-03-06 20:20   ` Matthew Wilcox
2009-03-09  8:13     ` Yu Zhao
2009-03-09 18:09       ` Randy Dunlap
2009-03-09 18:11         ` Matthew Wilcox
2009-02-20  6:54 ` [PATCH v10 4/7] PCI: add SR-IOV API for Physical Function driver Yu Zhao
2009-03-06 20:37   ` Matthew Wilcox
2009-03-06 21:48     ` Randy Dunlap
2009-03-09  8:29       ` Yu Zhao
2009-03-07  2:40     ` Greg KH
2009-03-09  8:25     ` Yu Zhao
2009-03-09 19:39       ` Greg KH
2009-03-10  1:37         ` Yu Zhao
2009-03-11  4:34           ` Greg KH
2009-02-20  6:54 ` [PATCH v10 5/7] PCI: handle SR-IOV Virtual Function Migration Yu Zhao
2009-03-06 21:13   ` Matthew Wilcox
2009-03-09  8:28     ` Yu Zhao
2009-02-20  6:54 ` [PATCH v10 6/7] PCI: document SR-IOV sysfs entries Yu Zhao
2009-03-06 21:16   ` Matthew Wilcox
2009-03-06 22:35     ` Randy Dunlap
2009-02-20  6:54 ` [PATCH v10 7/7] PCI: manual for SR-IOV user and driver developer Yu Zhao
2009-03-06 21:17   ` Matthew Wilcox
2009-02-24 10:47 ` [PATCH v10 0/7] PCI: Linux kernel SR-IOV support Avi Kivity
2009-02-25  1:36   ` Yu Zhao
2009-03-06 19:33   ` Matthew Wilcox
2009-03-08 14:30     ` Avi Kivity
2009-03-08 15:01       ` Matthew Wilcox
2009-03-09  0:45         ` Greg KH
2009-03-09  3:42       ` Yang, Sheng
2009-03-09  4:35         ` Yang, Sheng
2009-03-09 13:45           ` Avi Kivity
2009-03-06 19:44 ` Matthew Wilcox
2009-03-07  2:34   ` Greg KH
2009-03-10  1:11     ` Yu Zhao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.