All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v4 0/3] pcie: Add support for Single Root I/O Virtualization
@ 2015-09-12 12:36 Knut Omang
  2015-09-12 12:36 ` [Qemu-devel] [PATCH v4 1/3] pci: Make use of the devfn property when registering new devices Knut Omang
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Knut Omang @ 2015-09-12 12:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Knut Omang,
	Richard W.M. Jones, Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

This patch set implements generic support for SR/IOV as an extension to the
core PCIe functionality, similar to the way other capabilities such as AER
is implemented.

There is no implementation of any device that provides
SR/IOV support included, but I have implemented a test
example which can be found together with this patch set here:

  git://github.com/knuto/qemu.git sriov_patches_v4

Testing with the example device was documented here:

  http://lists.nongnu.org/archive/html/qemu-devel/2014-08/msg05110.html

Changes since v3:
  - Reworked 'pci: Update pci_regs header' to merge kernel version improvements
    with the current qemu version instead of copying from the kernel version.

Changes since v2:
  - Rebased onto 090d0bfd
  - Un-qdev'ified - avoids issues when resetting NUM_VFS
  - Fixed handling of vf_offset/vf_stride

Changes since v1:
  - Rebased on top of latest master, eliminating prereqs.
  - Implement proper support for VF_STRIDE, VF_OFFSET and SUP_PGSIZE
    Time better spent fixing it than explaining what the previous
    limitations were.
    - Added new first patch to fix pci bug related to this
  - Split out patch to pci_default_config_write to a separate patch 2
    to highlight bug fix.
  - Refactored out logic into new source files
    hw/pci/pcie_sriov.c include/hw/pci/pcie_sriov.h
    similar to pcie_aer.c/h.
  - Rename functions and introduce structs to better separate
    pf and vf functionality.
  - Replaced is_vf member with pci_is_vf() function abstraction
  - Fix numerous syntax, whitespace and comment issues
    according to Michael's review.
  - Fix memory leaks.
  - Removed igb example device - a rebased version available
    on github instead.

Knut Omang (3):
  pci: Make use of the devfn property when registering new devices
  pci: Update pci_regs header
  pcie: Add support for Single Root I/O Virtualization (SR/IOV)

 hw/pci/Makefile.objs                      |   2 +-
 hw/pci/pci.c                              | 101 ++++++++---
 hw/pci/pcie.c                             |   9 +-
 hw/pci/pcie_sriov.c                       | 271 ++++++++++++++++++++++++++++++
 include/hw/pci/pci.h                      |  11 +-
 include/hw/pci/pcie.h                     |   6 +
 include/hw/pci/pcie_sriov.h               |  55 ++++++
 include/qemu/typedefs.h                   |   2 +
 include/standard-headers/linux/pci_regs.h | 142 +++++++++++++---
 9 files changed, 545 insertions(+), 54 deletions(-)
 create mode 100644 hw/pci/pcie_sriov.c
 create mode 100644 include/hw/pci/pcie_sriov.h

--
2.4.3

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v4 1/3] pci: Make use of the devfn property when registering new devices
  2015-09-12 12:36 [Qemu-devel] [PATCH v4 0/3] pcie: Add support for Single Root I/O Virtualization Knut Omang
@ 2015-09-12 12:36 ` Knut Omang
  2015-09-17 11:11   ` Marcel Apfelbaum
  2015-09-12 12:36 ` [Qemu-devel] [PATCH v4 2/3] pci: Update pci_regs header Knut Omang
  2015-09-12 12:36 ` [Qemu-devel] [PATCH v4 3/3] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Knut Omang
  2 siblings, 1 reply; 16+ messages in thread
From: Knut Omang @ 2015-09-12 12:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Knut Omang,
	Richard W.M. Jones, Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

Without this, the devfn argument to pci_create_*()
does not affect the assigned devfn.

Needed to support (VF_STRIDE,VF_OFFSET) values other than (1,1)
for SR/IOV.

Signed-off-by: Knut Omang <knut.omang@oracle.com>
---
 hw/pci/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index ccea628..a5cc015 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1840,7 +1840,7 @@ static void pci_qdev_realize(DeviceState *qdev, Error **errp)
     bus = PCI_BUS(qdev_get_parent_bus(qdev));
     pci_dev = do_pci_register_device(pci_dev, bus,
                                      object_get_typename(OBJECT(qdev)),
-                                     pci_dev->devfn, errp);
+                                     object_property_get_int(OBJECT(qdev), "addr", NULL), errp);
     if (pci_dev == NULL)
         return;
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v4 2/3] pci: Update pci_regs header
  2015-09-12 12:36 [Qemu-devel] [PATCH v4 0/3] pcie: Add support for Single Root I/O Virtualization Knut Omang
  2015-09-12 12:36 ` [Qemu-devel] [PATCH v4 1/3] pci: Make use of the devfn property when registering new devices Knut Omang
@ 2015-09-12 12:36 ` Knut Omang
  2015-10-07 13:32   ` Marcel Apfelbaum
  2015-09-12 12:36 ` [Qemu-devel] [PATCH v4 3/3] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Knut Omang
  2 siblings, 1 reply; 16+ messages in thread
From: Knut Omang @ 2015-09-12 12:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Knut Omang,
	Richard W.M. Jones, Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

Merge in new definitions from kernel v4.2
Adds definition necessary to support emulated SR/IOV.

Signed-off-by: Knut Omang <knut.omang@oracle.com>
---
 include/standard-headers/linux/pci_regs.h | 142 +++++++++++++++++++++++++-----
 1 file changed, 118 insertions(+), 24 deletions(-)

diff --git a/include/standard-headers/linux/pci_regs.h b/include/standard-headers/linux/pci_regs.h
index 57e8c80..3d56c1e 100644
--- a/include/standard-headers/linux/pci_regs.h
+++ b/include/standard-headers/linux/pci_regs.h
@@ -304,13 +304,14 @@
 #define PCI_MSI_PENDING_64	20	/* Pending bits register for 32-bit devices */
 
 /* MSI-X registers */
-#define PCI_MSIX_FLAGS		2
-#define  PCI_MSIX_FLAGS_QSIZE	0x7FF
-#define  PCI_MSIX_FLAGS_ENABLE	(1 << 15)
-#define  PCI_MSIX_FLAGS_MASKALL	(1 << 14)
-#define PCI_MSIX_TABLE		4
-#define PCI_MSIX_PBA		8
+#define PCI_MSIX_FLAGS		2	/* Message Control */
+#define  PCI_MSIX_FLAGS_QSIZE	0x7FF	/* Table size */
+#define  PCI_MSIX_FLAGS_ENABLE	(1 << 15) /* MSI-X enable */
+#define  PCI_MSIX_FLAGS_MASKALL	(1 << 14) /* Mask all vectors for this function */
+#define PCI_MSIX_TABLE		4	/* Table offset */
+#define PCI_MSIX_PBA		8	/* Pending Bit Array offset */
 #define  PCI_MSIX_FLAGS_BIRMASK	(7 << 0)
+#define PCI_CAP_MSIX_SIZEOF	12	/* size of MSIX capability structure */
 
 /* MSI-X entry's format */
 #define PCI_MSIX_ENTRY_SIZE		16
@@ -517,31 +518,63 @@
 #define  PCI_EXP_OBFF_MSG	0x40000 /* New message signaling */
 #define  PCI_EXP_OBFF_WAKE	0x80000 /* Re-use WAKE# for OBFF */
 #define PCI_EXP_DEVCTL2		40	/* Device Control 2 */
-#define  PCI_EXP_DEVCTL2_ARI	0x20	/* Alternative Routing-ID */
-#define  PCI_EXP_IDO_REQ_EN	0x100	/* ID-based ordering request enable */
-#define  PCI_EXP_IDO_CMP_EN	0x200	/* ID-based ordering completion enable */
-#define  PCI_EXP_LTR_EN		0x400	/* Latency tolerance reporting */
-#define  PCI_EXP_OBFF_MSGA_EN	0x2000	/* OBFF enable with Message type A */
-#define  PCI_EXP_OBFF_MSGB_EN	0x4000	/* OBFF enable with Message type B */
-#define  PCI_EXP_OBFF_WAKE_EN	0x6000	/* OBFF using WAKE# signaling */
+#define  PCI_EXP_DEVCTL2_COMP_TIMEOUT	0x000f	/* Completion Timeout Value */
+#define  PCI_EXP_DEVCTL2_ARI		0x0020	/* Alternative Routing-ID */
+#define  PCI_EXP_DEVCTL2_IDO_REQ_EN	0x0100	/* Allow IDO for requests */
+#define  PCI_EXP_DEVCTL2_IDO_CMP_EN	0x0200	/* Allow IDO for completions */
+#define  PCI_EXP_DEVCTL2_LTR_EN		0x0400	/* Enable LTR mechanism */
+#define  PCI_EXP_DEVCTL2_OBFF_MSGA_EN	0x2000	/* Enable OBFF Message type A */
+#define  PCI_EXP_DEVCTL2_OBFF_MSGB_EN	0x4000	/* Enable OBFF Message type B */
+#define  PCI_EXP_DEVCTL2_OBFF_WAKE_EN	0x6000	/* OBFF using WAKE# signaling */
+#define PCI_EXP_DEVSTA2		42	/* Device Status 2 */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2	44	/* v2 endpoints end here */
+#define PCI_EXP_LNKCAP2		44	/* Link Capabilities 2 */
+#define  PCI_EXP_LNKCAP2_SLS_2_5GB	0x00000002 /* Supported Speed 2.5GT/s */
+#define  PCI_EXP_LNKCAP2_SLS_5_0GB	0x00000004 /* Supported Speed 5.0GT/s */
+#define  PCI_EXP_LNKCAP2_SLS_8_0GB	0x00000008 /* Supported Speed 8.0GT/s */
+#define  PCI_EXP_LNKCAP2_CROSSLINK	0x00000100 /* Crosslink supported */
 #define PCI_EXP_LNKCTL2		48	/* Link Control 2 */
-#define PCI_EXP_SLTCTL2		56	/* Slot Control 2 */
+#define PCI_EXP_LNKSTA2		50	/* Link Status 2 */
+#define PCI_EXP_SLTCAP2		52	/* Slot Capabilities 2 */
+#define PCI_EXP_SLTCTL2 	56	/* Slot Control 2 */
+#define PCI_EXP_SLTSTA2		58	/* Slot Status 2 */
 
 /* Extended Capabilities (PCI-X 2.0 and Express) */
 #define PCI_EXT_CAP_ID(header)		(header & 0x0000ffff)
 #define PCI_EXT_CAP_VER(header)		((header >> 16) & 0xf)
 #define PCI_EXT_CAP_NEXT(header)	((header >> 20) & 0xffc)
 
-#define PCI_EXT_CAP_ID_ERR	1
-#define PCI_EXT_CAP_ID_VC	2
-#define PCI_EXT_CAP_ID_DSN	3
-#define PCI_EXT_CAP_ID_PWR	4
-#define PCI_EXT_CAP_ID_VNDR	11
-#define PCI_EXT_CAP_ID_ACS	13
-#define PCI_EXT_CAP_ID_ARI	14
-#define PCI_EXT_CAP_ID_ATS	15
-#define PCI_EXT_CAP_ID_SRIOV	16
-#define PCI_EXT_CAP_ID_LTR	24
+#define PCI_EXT_CAP_ID_ERR	0x01	/* Advanced Error Reporting */
+#define PCI_EXT_CAP_ID_VC	0x02	/* Virtual Channel Capability */
+#define PCI_EXT_CAP_ID_DSN	0x03	/* Device Serial Number */
+#define PCI_EXT_CAP_ID_PWR	0x04	/* Power Budgeting */
+#define PCI_EXT_CAP_ID_RCLD	0x05	/* Root Complex Link Declaration */
+#define PCI_EXT_CAP_ID_RCILC	0x06	/* Root Complex Internal Link Control */
+#define PCI_EXT_CAP_ID_RCEC	0x07	/* Root Complex Event Collector */
+#define PCI_EXT_CAP_ID_MFVC	0x08	/* Multi-Function VC Capability */
+#define PCI_EXT_CAP_ID_VC9	0x09	/* same as _VC */
+#define PCI_EXT_CAP_ID_RCRB	0x0A	/* Root Complex RB? */
+#define PCI_EXT_CAP_ID_VNDR	0x0B	/* Vendor-Specific */
+#define PCI_EXT_CAP_ID_CAC	0x0C	/* Config Access - obsolete */
+#define PCI_EXT_CAP_ID_ACS	0x0D	/* Access Control Services */
+#define PCI_EXT_CAP_ID_ARI	0x0E	/* Alternate Routing ID */
+#define PCI_EXT_CAP_ID_ATS	0x0F	/* Address Translation Services */
+#define PCI_EXT_CAP_ID_SRIOV	0x10	/* Single Root I/O Virtualization */
+#define PCI_EXT_CAP_ID_MRIOV	0x11	/* Multi Root I/O Virtualization */
+#define PCI_EXT_CAP_ID_MCAST	0x12	/* Multicast */
+#define PCI_EXT_CAP_ID_PRI	0x13	/* Page Request Interface */
+#define PCI_EXT_CAP_ID_AMD_XXX	0x14	/* Reserved for AMD */
+#define PCI_EXT_CAP_ID_REBAR	0x15	/* Resizable BAR */
+#define PCI_EXT_CAP_ID_DPA	0x16	/* Dynamic Power Allocation */
+#define PCI_EXT_CAP_ID_TPH	0x17	/* TPH Requester */
+#define PCI_EXT_CAP_ID_LTR	0x18	/* Latency Tolerance Reporting */
+#define PCI_EXT_CAP_ID_SECPCI	0x19	/* Secondary PCIe Capability */
+#define PCI_EXT_CAP_ID_PMUX	0x1A	/* Protocol Multiplexing */
+#define PCI_EXT_CAP_ID_PASID	0x1B	/* Process Address Space ID */
+#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PASID
+
+#define PCI_EXT_CAP_DSN_SIZEOF	12
+#define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
 
 /* Advanced Error Reporting */
 #define PCI_ERR_UNCOR_STATUS	4	/* Uncorrectable Error Status */
@@ -556,6 +589,11 @@
 #define  PCI_ERR_UNC_MALF_TLP	0x00040000	/* Malformed TLP */
 #define  PCI_ERR_UNC_ECRC	0x00080000	/* ECRC Error Status */
 #define  PCI_ERR_UNC_UNSUP	0x00100000	/* Unsupported Request */
+#define  PCI_ERR_UNC_ACSV	0x00200000	/* ACS Violation */
+#define  PCI_ERR_UNC_INTN	0x00400000	/* internal error */
+#define  PCI_ERR_UNC_MCBTLP	0x00800000	/* MC blocked TLP */
+#define  PCI_ERR_UNC_ATOMEG	0x01000000	/* Atomic egress blocked */
+#define  PCI_ERR_UNC_TLPPRE	0x02000000	/* TLP prefix blocked */
 #define PCI_ERR_UNCOR_MASK	8	/* Uncorrectable Error Mask */
 	/* Same bits as above */
 #define PCI_ERR_UNCOR_SEVER	12	/* Uncorrectable Error Severity */
@@ -657,6 +695,7 @@
 #define  PCI_ARI_CTRL_MFVC	0x0001	/* MFVC Function Groups Enable */
 #define  PCI_ARI_CTRL_ACS	0x0002	/* ACS Function Groups Enable */
 #define  PCI_ARI_CTRL_FG(x)	(((x) >> 4) & 7) /* Function Group */
+#define PCI_EXT_CAP_ARI_SIZEOF	8
 
 /* Address Translation Service */
 #define PCI_ATS_CAP		0x04	/* ATS Capability Register */
@@ -666,6 +705,29 @@
 #define  PCI_ATS_CTRL_ENABLE	0x8000	/* ATS Enable */
 #define  PCI_ATS_CTRL_STU(x)	((x) & 0x1f)	/* Smallest Translation Unit */
 #define  PCI_ATS_MIN_STU	12	/* shift of minimum STU block */
+#define PCI_EXT_CAP_ATS_SIZEOF	8
+
+/* Page Request Interface */
+#define PCI_PRI_CTRL		0x04	/* PRI control register */
+#define  PCI_PRI_CTRL_ENABLE	0x01	/* Enable */
+#define  PCI_PRI_CTRL_RESET	0x02	/* Reset */
+#define PCI_PRI_STATUS		0x06	/* PRI status register */
+#define  PCI_PRI_STATUS_RF	0x001	/* Response Failure */
+#define  PCI_PRI_STATUS_UPRGI	0x002	/* Unexpected PRG index */
+#define  PCI_PRI_STATUS_STOPPED	0x100	/* PRI Stopped */
+#define PCI_PRI_MAX_REQ		0x08	/* PRI max reqs supported */
+#define PCI_PRI_ALLOC_REQ	0x0c	/* PRI max reqs allowed */
+#define PCI_EXT_CAP_PRI_SIZEOF	16
+
+/* Process Address Space ID */
+#define PCI_PASID_CAP		0x04    /* PASID feature register */
+#define  PCI_PASID_CAP_EXEC	0x02	/* Exec permissions Supported */
+#define  PCI_PASID_CAP_PRIV	0x04	/* Privilege Mode Supported */
+#define PCI_PASID_CTRL		0x06    /* PASID control register */
+#define  PCI_PASID_CTRL_ENABLE	0x01	/* Enable bit */
+#define  PCI_PASID_CTRL_EXEC	0x02	/* Exec permissions Enable */
+#define  PCI_PASID_CTRL_PRIV	0x04	/* Privilege Mode Enable */
+#define PCI_EXT_CAP_PASID_SIZEOF	8
 
 /* Single Root I/O Virtualization */
 #define PCI_SRIOV_CAP		0x04	/* SR-IOV Capabilities */
@@ -697,6 +759,7 @@
 #define  PCI_SRIOV_VFM_MI	0x1	/* Dormant.MigrateIn */
 #define  PCI_SRIOV_VFM_MO	0x2	/* Active.MigrateOut */
 #define  PCI_SRIOV_VFM_AV	0x3	/* Active.Available */
+#define PCI_EXT_CAP_SRIOV_SIZEOF 64
 
 #define PCI_LTR_MAX_SNOOP_LAT	0x4
 #define PCI_LTR_MAX_NOSNOOP_LAT	0x6
@@ -713,7 +776,38 @@
 #define  PCI_ACS_UF		0x10	/* Upstream Forwarding */
 #define  PCI_ACS_EC		0x20	/* P2P Egress Control */
 #define  PCI_ACS_DT		0x40	/* Direct Translated P2P */
+#define PCI_ACS_EGRESS_BITS	0x05	/* ACS Egress Control Vector Size */
 #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
 #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
 
+#define PCI_VSEC_HDR		4	/* extended cap - vendor-specific */
+#define  PCI_VSEC_HDR_LEN_SHIFT	20	/* shift for length field */
+
+/* SATA capability */
+#define PCI_SATA_REGS		4	/* SATA REGs specifier */
+#define  PCI_SATA_REGS_MASK	0xF	/* location - BAR#/inline */
+#define  PCI_SATA_REGS_INLINE	0xF	/* REGS in config space */
+#define PCI_SATA_SIZEOF_SHORT	8
+#define PCI_SATA_SIZEOF_LONG	16
+
+/* Resizable BARs */
+#define PCI_REBAR_CTRL		8	/* control register */
+#define  PCI_REBAR_CTRL_NBAR_MASK	(7 << 5)	/* mask for # bars */
+#define  PCI_REBAR_CTRL_NBAR_SHIFT	5	/* shift for # bars */
+
+/* Dynamic Power Allocation */
+#define PCI_DPA_CAP		4	/* capability register */
+#define  PCI_DPA_CAP_SUBSTATE_MASK	0x1F	/* # substates - 1 */
+#define PCI_DPA_BASE_SIZEOF	16	/* size with 0 substates */
+
+/* TPH Requester */
+#define PCI_TPH_CAP		4	/* capability register */
+#define  PCI_TPH_CAP_LOC_MASK	0x600	/* location mask */
+#define   PCI_TPH_LOC_NONE	0x000	/* no location */
+#define   PCI_TPH_LOC_CAP	0x200	/* in capability */
+#define   PCI_TPH_LOC_MSIX	0x400	/* in MSI-X */
+#define PCI_TPH_CAP_ST_MASK	0x07FF0000	/* st table mask */
+#define PCI_TPH_CAP_ST_SHIFT	16	/* st table shift */
+#define PCI_TPH_BASE_SIZEOF	12	/* size with no st table */
+
 #endif /* LINUX_PCI_REGS_H */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v4 3/3] pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  2015-09-12 12:36 [Qemu-devel] [PATCH v4 0/3] pcie: Add support for Single Root I/O Virtualization Knut Omang
  2015-09-12 12:36 ` [Qemu-devel] [PATCH v4 1/3] pci: Make use of the devfn property when registering new devices Knut Omang
  2015-09-12 12:36 ` [Qemu-devel] [PATCH v4 2/3] pci: Update pci_regs header Knut Omang
@ 2015-09-12 12:36 ` Knut Omang
  2015-09-17 11:49   ` Marcel Apfelbaum
  2015-10-07 15:06   ` Marcel Apfelbaum
  2 siblings, 2 replies; 16+ messages in thread
From: Knut Omang @ 2015-09-12 12:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Knut Omang,
	Richard W.M. Jones, Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

This patch provides the building blocks for creating an SR/IOV
PCIe Extended Capability header and register/unregister
SR/IOV Virtual Functions.

Signed-off-by: Knut Omang <knut.omang@oracle.com>
---
 hw/pci/Makefile.objs        |   2 +-
 hw/pci/pci.c                |  99 ++++++++++++----
 hw/pci/pcie.c               |   9 +-
 hw/pci/pcie_sriov.c         | 271 ++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci/pci.h        |  11 +-
 include/hw/pci/pcie.h       |   6 +
 include/hw/pci/pcie_sriov.h |  55 +++++++++
 include/qemu/typedefs.h     |   2 +
 8 files changed, 426 insertions(+), 29 deletions(-)
 create mode 100644 hw/pci/pcie_sriov.c
 create mode 100644 include/hw/pci/pcie_sriov.h

diff --git a/hw/pci/Makefile.objs b/hw/pci/Makefile.objs
index 9f905e6..2226980 100644
--- a/hw/pci/Makefile.objs
+++ b/hw/pci/Makefile.objs
@@ -3,7 +3,7 @@ common-obj-$(CONFIG_PCI) += msix.o msi.o
 common-obj-$(CONFIG_PCI) += shpc.o
 common-obj-$(CONFIG_PCI) += slotid_cap.o
 common-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o
-common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o
+common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o pcie_sriov.o
 
 common-obj-$(call lnot,$(CONFIG_PCI)) += pci-stub.o
 common-obj-$(CONFIG_ALL) += pci-stub.o
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index a5cc015..9c0eba1 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -153,6 +153,9 @@ int pci_bar(PCIDevice *d, int reg)
 {
     uint8_t type;
 
+    /* PCIe virtual functions do not have their own BARs */
+    assert(!pci_is_vf(d));
+
     if (reg != PCI_ROM_SLOT)
         return PCI_BASE_ADDRESS_0 + reg * 4;
 
@@ -211,22 +214,13 @@ void pci_device_deassert_intx(PCIDevice *dev)
     }
 }
 
-static void pci_do_device_reset(PCIDevice *dev)
+static void pci_reset_regions(PCIDevice *dev)
 {
     int r;
+    if (pci_is_vf(dev)) {
+        return;
+    }
 
-    pci_device_deassert_intx(dev);
-    assert(dev->irq_state == 0);
-
-    /* Clear all writable bits */
-    pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
-                                 pci_get_word(dev->wmask + PCI_COMMAND) |
-                                 pci_get_word(dev->w1cmask + PCI_COMMAND));
-    pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
-                                 pci_get_word(dev->wmask + PCI_STATUS) |
-                                 pci_get_word(dev->w1cmask + PCI_STATUS));
-    dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
-    dev->config[PCI_INTERRUPT_LINE] = 0x0;
     for (r = 0; r < PCI_NUM_REGIONS; ++r) {
         PCIIORegion *region = &dev->io_regions[r];
         if (!region->size) {
@@ -240,6 +234,27 @@ static void pci_do_device_reset(PCIDevice *dev)
             pci_set_long(dev->config + pci_bar(dev, r), region->type);
         }
     }
+}
+
+static void pci_do_device_reset(PCIDevice *dev)
+{
+    qdev_reset_all(&dev->qdev);
+
+    dev->irq_state = 0;
+    pci_update_irq_status(dev);
+    pci_device_deassert_intx(dev);
+    assert(dev->irq_state == 0);
+
+    /* Clear all writable bits */
+    pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
+                                 pci_get_word(dev->wmask + PCI_COMMAND) |
+                                 pci_get_word(dev->w1cmask + PCI_COMMAND));
+    pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
+                                 pci_get_word(dev->wmask + PCI_STATUS) |
+                                 pci_get_word(dev->w1cmask + PCI_STATUS));
+    dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
+    dev->config[PCI_INTERRUPT_LINE] = 0x0;
+    pci_reset_regions(dev);
     pci_update_mappings(dev);
 
     msi_reset(dev);
@@ -771,6 +786,15 @@ static void pci_init_multifunction(PCIBus *bus, PCIDevice *dev, Error **errp)
         dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
     }
 
+    /* With SR/IOV and ARI, a device at function 0 need not be a multifunction
+     * device, as it may just be a VF that ended up with function 0 in
+     * the legacy PCI interpretation. Avoid failing in such cases:
+     */
+    if (pci_is_vf(dev) &&
+        dev->exp.sriov_vf.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+        return;
+    }
+
     /*
      * multifunction bit is interpreted in two ways as follows.
      *   - all functions must set the bit to 1.
@@ -962,6 +986,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
     uint64_t wmask;
     pcibus_t size = memory_region_size(memory);
 
+    assert(!pci_is_vf(pci_dev)); /* VFs must use pcie_sriov_vf_register_bar */
     assert(region_num >= 0);
     assert(region_num < PCI_NUM_REGIONS);
     if (size & (size-1)) {
@@ -1060,11 +1085,44 @@ pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int region_num)
     return pci_dev->io_regions[region_num].addr;
 }
 
-static pcibus_t pci_bar_address(PCIDevice *d,
-				int reg, uint8_t type, pcibus_t size)
+
+static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
+                                        uint8_t type, pcibus_t size)
+{
+    pcibus_t new_addr;
+    if (!pci_is_vf(d)) {
+        int bar = pci_bar(d, reg);
+        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+            new_addr = pci_get_quad(d->config + bar);
+        } else {
+            new_addr = pci_get_long(d->config + bar);
+        }
+    } else {
+        PCIDevice *pf = d->exp.sriov_vf.pf;
+        uint16_t sriov_cap = pf->exp.sriov_cap;
+        int bar = sriov_cap + PCI_SRIOV_BAR + reg * 4;
+        uint16_t vf_offset = pci_get_word(pf->config + sriov_cap + PCI_SRIOV_VF_OFFSET);
+        uint16_t vf_stride = pci_get_word(pf->config + sriov_cap + PCI_SRIOV_VF_STRIDE);
+        uint32_t vf_num = (d->devfn - (pf->devfn + vf_offset)) / vf_stride;
+
+        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+            new_addr = pci_get_quad(pf->config + bar);
+        } else {
+            new_addr = pci_get_long(pf->config + bar);
+        }
+        new_addr += vf_num * size;
+    }
+    if (reg != PCI_ROM_SLOT) {
+        /* Preserve the rom enable bit */
+        new_addr &= ~(size - 1);
+    }
+    return new_addr;
+}
+
+pcibus_t pci_bar_address(PCIDevice *d,
+                         int reg, uint8_t type, pcibus_t size)
 {
     pcibus_t new_addr, last_addr;
-    int bar = pci_bar(d, reg);
     uint16_t cmd = pci_get_word(d->config + PCI_COMMAND);
     Object *machine = qdev_get_machine();
     ObjectClass *oc = object_get_class(machine);
@@ -1075,7 +1133,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
         if (!(cmd & PCI_COMMAND_IO)) {
             return PCI_BAR_UNMAPPED;
         }
-        new_addr = pci_get_long(d->config + bar) & ~(size - 1);
+        new_addr = pci_config_get_bar_addr(d, reg, type, size);
         last_addr = new_addr + size - 1;
         /* Check if 32 bit BAR wraps around explicitly.
          * TODO: make priorities correct and remove this work around.
@@ -1090,11 +1148,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
     if (!(cmd & PCI_COMMAND_MEMORY)) {
         return PCI_BAR_UNMAPPED;
     }
-    if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
-        new_addr = pci_get_quad(d->config + bar);
-    } else {
-        new_addr = pci_get_long(d->config + bar);
-    }
+    new_addr = pci_config_get_bar_addr(d, reg, type, size);
     /* the ROM slot has a specific enable bit */
     if (reg == PCI_ROM_SLOT && !(new_addr & PCI_ROM_ADDRESS_ENABLE)) {
         return PCI_BAR_UNMAPPED;
@@ -1228,6 +1282,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int
 
     msi_write_config(d, addr, val_in, l);
     msix_write_config(d, addr, val_in, l);
+    pcie_sriov_config_write(d, addr, val_in, l);
 }
 
 /***********************************************************/
diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 6e28985..ba49c0f 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -253,7 +253,7 @@ void pcie_cap_slot_hotplug_cb(HotplugHandler *hotplug_dev, DeviceState *dev,
      * Right now, only a device of function = 0 is allowed to be
      * hot plugged/unplugged.
      */
-    assert(PCI_FUNC(pci_dev->devfn) == 0);
+    assert(PCI_FUNC(pci_dev->devfn) == 0 || pci_is_vf(pci_dev));
 
     pci_word_test_and_set_mask(exp_cap + PCI_EXP_SLTSTA,
                                PCI_EXP_SLTSTA_PDS);
@@ -265,10 +265,11 @@ void pcie_cap_slot_hot_unplug_request_cb(HotplugHandler *hotplug_dev,
                                          DeviceState *dev, Error **errp)
 {
     uint8_t *exp_cap;
+    PCIDevice *pdev = PCI_DEVICE(hotplug_dev);
 
-    pcie_cap_slot_hotplug_common(PCI_DEVICE(hotplug_dev), dev, &exp_cap, errp);
+    pcie_cap_slot_hotplug_common(pdev, dev, &exp_cap, errp);
 
-    pcie_cap_slot_push_attention_button(PCI_DEVICE(hotplug_dev));
+    pcie_cap_slot_push_attention_button(pdev);
 }
 
 /* pci express slot for pci express root/downstream port
@@ -408,7 +409,7 @@ void pcie_cap_slot_write_config(PCIDevice *dev,
     }
 
     /*
-     * If the slot is polulated, power indicator is off and power
+     * If the slot is populated, power indicator is off and power
      * controller is off, it is safe to detach the devices.
      */
     if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
new file mode 100644
index 0000000..14f31cc
--- /dev/null
+++ b/hw/pci/pcie_sriov.c
@@ -0,0 +1,271 @@
+/*
+ * pcie_sriov.c:
+ *
+ * Implementation of SR/IOV emulation support.
+ *
+ * Copyright (c) 2014 Knut Omang <knut.omang@oracle.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "hw/pci/pci.h"
+#include "hw/pci/pcie.h"
+#include "hw/pci/pci_bus.h"
+#include "qemu/range.h"
+
+//#define DEBUG_SRIOV
+#ifdef DEBUG_SRIOV
+# define SRIOV_DPRINTF(fmt, ...)                                         \
+    fprintf(stderr, "%s:%d " fmt, __func__, __LINE__, ## __VA_ARGS__)
+#else
+# define SRIOV_DPRINTF(fmt, ...) do {} while (0)
+#endif
+#define SRIOV_DEV_PRINTF(dev, fmt, ...)                                  \
+    SRIOV_DPRINTF("%s:%x "fmt, (dev)->name, (dev)->devfn, ## __VA_ARGS__)
+
+static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char *name);
+static void unregister_vfs(PCIDevice *dev);
+
+void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
+                        const char *vfname, uint16_t vf_dev_id,
+                        uint16_t init_vfs, uint16_t total_vfs,
+                        uint16_t vf_offset, uint16_t vf_stride)
+{
+    uint8_t *cfg = dev->config + offset;
+    uint8_t *wmask;
+
+    pcie_add_capability(dev, PCI_EXT_CAP_ID_SRIOV, 1,
+                        offset, PCI_EXT_CAP_SRIOV_SIZEOF);
+    dev->exp.sriov_cap = offset;
+    dev->exp.sriov_pf.num_vfs = 0;
+    dev->exp.sriov_pf.vfname = g_strdup(vfname);
+    dev->exp.sriov_pf.vf = NULL;
+
+    pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
+    pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
+
+    /* Minimal page size support values required by the SR/IOV standard:
+     * 0x553 << 12 = 0x553000 = 4K + 8K + 64K + 256K + 1M + 4M
+     * Hard coded for simplicity in this version
+     */
+    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, 0x553);
+
+    /* Default is to use 4K pages, software can modify it
+     * to any of the supported bits
+     */
+    pci_set_word(cfg + PCI_SRIOV_SYS_PGSIZE, 0x1);
+
+    /* Set up device ID and initial/total number of VFs available */
+    pci_set_word(cfg + PCI_SRIOV_VF_DID, vf_dev_id);
+    pci_set_word(cfg + PCI_SRIOV_INITIAL_VF, init_vfs);
+    pci_set_word(cfg + PCI_SRIOV_TOTAL_VF, total_vfs);
+    pci_set_word(cfg + PCI_SRIOV_NUM_VF, 0);
+
+    /* Write enable control bits */
+    wmask = dev->wmask + offset;
+    pci_set_word(wmask + PCI_SRIOV_CTRL,
+                 PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE | PCI_SRIOV_CTRL_ARI);
+    pci_set_word(wmask + PCI_SRIOV_NUM_VF, 0xffff);
+    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, 0x553);
+
+    qdev_prop_set_bit(&dev->qdev, "multifunction", true);
+}
+
+void pcie_sriov_pf_exit(PCIDevice *dev)
+{
+    SRIOV_DPRINTF("\n");
+    unregister_vfs(dev);
+    g_free((char *)dev->exp.sriov_pf.vfname);
+    dev->exp.sriov_pf.vfname = NULL;
+}
+
+void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
+                               uint8_t type, dma_addr_t size)
+{
+    uint32_t addr;
+    uint64_t wmask;
+    uint16_t sriov_cap = dev->exp.sriov_cap;
+
+    assert(sriov_cap > 0);
+    assert(region_num >= 0);
+    assert(region_num < PCI_NUM_REGIONS);
+    assert(region_num != PCI_ROM_SLOT);
+
+    wmask = ~(size - 1);
+    addr = sriov_cap + PCI_SRIOV_BAR + region_num * 4;
+
+    pci_set_long(dev->config + addr, type);
+    if (!(type & PCI_BASE_ADDRESS_SPACE_IO) &&
+        type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+        pci_set_quad(dev->wmask + addr, wmask);
+        pci_set_quad(dev->cmask + addr, ~0ULL);
+    } else {
+        pci_set_long(dev->wmask + addr, wmask & 0xffffffff);
+        pci_set_long(dev->cmask + addr, 0xffffffff);
+    }
+    dev->exp.sriov_pf.vf_bar_type[region_num] = type;
+}
+
+void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
+                                MemoryRegion *memory)
+{
+    PCIIORegion *r;
+    uint8_t type;
+    pcibus_t size = memory_region_size(memory);
+
+    assert(pci_is_vf(dev)); /* PFs must use pci_register_bar */
+    assert(region_num >= 0);
+    assert(region_num < PCI_NUM_REGIONS);
+    type = dev->exp.sriov_vf.pf->exp.sriov_pf.vf_bar_type[region_num];
+
+    if (size & (size-1)) {
+        fprintf(stderr, "ERROR: PCI region size must be a power"
+                " of two - type=0x%x, size=0x%"FMT_PCIBUS"\n", type, size);
+        exit(1);
+    }
+
+    r = &dev->io_regions[region_num];
+    r->memory = memory;
+    r->address_space =
+        type & PCI_BASE_ADDRESS_SPACE_IO
+        ? dev->bus->address_space_io
+        : dev->bus->address_space_mem;
+    r->size = size;
+    r->type = type;
+
+    r->addr = pci_bar_address(dev, region_num, r->type, r->size);
+    if (r->addr != PCI_BAR_UNMAPPED) {
+        memory_region_add_subregion_overlap(r->address_space,
+                                            r->addr, r->memory, 1);
+    }
+}
+
+static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char *name)
+{
+    PCIDevice *dev = pci_create(pf->bus, devfn, name);
+    dev->exp.sriov_vf.pf = pf;
+    Error *local_err = NULL;
+
+    object_property_set_bool(OBJECT(&dev->qdev), true, "realized", &local_err);
+    if (local_err) {
+        fprintf(stderr, "Failed to enable: %s\n",
+                error_get_pretty(local_err));
+        error_free(local_err);
+        return NULL;
+    }
+
+    /* set vid/did according to sr/iov spec - they are not used */
+    pci_config_set_vendor_id(dev->config, 0xffff);
+    pci_config_set_device_id(dev->config, 0xffff);
+    return dev;
+}
+
+static void register_vfs(PCIDevice *dev)
+{
+    uint16_t num_vfs;
+    uint16_t i;
+    uint16_t sriov_cap = dev->exp.sriov_cap;
+    uint16_t vf_offset = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_OFFSET);
+    uint16_t vf_stride = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_STRIDE);
+    int32_t devfn = dev->devfn + vf_offset;
+
+    assert(sriov_cap > 0);
+    num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
+
+    dev->exp.sriov_pf.vf = g_malloc(sizeof(PCIDevice *) * num_vfs);
+    assert(dev->exp.sriov_pf.vf);
+
+    SRIOV_DEV_PRINTF(dev, "creating %d vf devs\n", num_vfs);
+    for (i = 0; i < num_vfs; i++) {
+        dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn, dev->exp.sriov_pf.vfname);
+        if (!dev->exp.sriov_pf.vf[i]) {
+            SRIOV_DEV_PRINTF(dev, "Failed to create VF %d at devfn %x.%x\n", i,
+                             PCI_SLOT(devfn), PCI_FUNC(devfn));
+            num_vfs = i;
+            break;
+        }
+        devfn += vf_stride;
+    }
+    dev->exp.sriov_pf.num_vfs = num_vfs;
+}
+
+static void unregister_vfs(PCIDevice *dev)
+{
+    Error *local_err = NULL;
+    uint16_t num_vfs = dev->exp.sriov_pf.num_vfs;
+    uint16_t i;
+    SRIOV_DEV_PRINTF(dev, "Unregistering %d vf devs\n", num_vfs);
+    for (i = 0; i < num_vfs; i++) {
+        object_property_set_bool(OBJECT(&dev->exp.sriov_pf.vf[i]->qdev), false, "realized", &local_err);
+        if (local_err) {
+            fprintf(stderr, "Failed to unplug: %s\n",
+                    error_get_pretty(local_err));
+            error_free(local_err);
+        }
+    }
+    g_free(dev->exp.sriov_pf.vf);
+    dev->exp.sriov_pf.vf = NULL;
+    dev->exp.sriov_pf.num_vfs = 0;
+}
+
+void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
+{
+    uint32_t off;
+    uint16_t sriov_cap = dev->exp.sriov_cap;
+
+    if (!sriov_cap || address < sriov_cap) {
+        return;
+    }
+    off = address - sriov_cap;
+    if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
+        return;
+    }
+
+    SRIOV_DEV_PRINTF(dev, "cap at 0x%x sriov offset 0x%x val 0x%x len %d\n",
+                 sriov_cap, off, val, len);
+
+    if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
+        if (dev->exp.sriov_pf.num_vfs) {
+            if (!(val & PCI_SRIOV_CTRL_VFE)) {
+                unregister_vfs(dev);
+            }
+        } else {
+            if (val & PCI_SRIOV_CTRL_VFE) {
+                register_vfs(dev);
+            }
+        }
+    }
+}
+
+
+/* Reset SR/IOV VF Enable bit to trigger an unregister of all VFs */
+void pcie_sriov_pf_disable_vfs(PCIDevice *dev)
+{
+    uint16_t sriov_cap = dev->exp.sriov_cap;
+    if (sriov_cap) {
+        uint32_t val = pci_get_byte(dev->config + sriov_cap + PCI_SRIOV_CTRL);
+        if (val & PCI_SRIOV_CTRL_VFE) {
+            val &= ~PCI_SRIOV_CTRL_VFE;
+            pcie_sriov_config_write(dev, sriov_cap + PCI_SRIOV_CTRL, val, 1);
+        }
+    }
+}
+
+/* Add optional supported page sizes to the mask of supported page sizes */
+void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t opt_sup_pgsize)
+{
+    uint8_t *cfg = dev->config + dev->exp.sriov_cap;
+    uint8_t *wmask = dev->wmask + dev->exp.sriov_cap;
+
+    uint16_t sup_pgsize = pci_get_word(cfg + PCI_SRIOV_SUP_PGSIZE);
+
+    sup_pgsize |= opt_sup_pgsize;
+
+    /* Make sure the new bits are set, and that system page size
+     * also can be set to any of the new values according to spec:
+     */
+    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, sup_pgsize);
+    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, sup_pgsize);
+}
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 551cb3d..2e9d8ba 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -11,8 +11,6 @@
 /* PCI includes legacy ISA access.  */
 #include "hw/isa/isa.h"
 
-#include "hw/pci/pcie.h"
-
 /* PCI bus */
 
 #define PCI_DEVFN(slot, func)   ((((slot) & 0x1f) << 3) | ((func) & 0x07))
@@ -132,6 +130,7 @@ enum {
 #define QEMU_PCI_VGA_IO_HI_SIZE 0x20
 
 #include "hw/pci/pci_regs.h"
+#include "hw/pci/pcie.h"
 
 /* PCI HEADER_TYPE */
 #define  PCI_HEADER_TYPE_MULTI_FUNCTION 0x80
@@ -421,6 +420,9 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
 void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
 
+pcibus_t pci_bar_address(PCIDevice *d,
+                         int reg, uint8_t type, pcibus_t size);
+
 static inline void
 pci_set_byte(uint8_t *config, uint8_t val)
 {
@@ -672,6 +674,11 @@ static inline int pci_is_express(const PCIDevice *d)
     return d->cap_present & QEMU_PCI_CAP_EXPRESS;
 }
 
+static inline int pci_is_vf(const PCIDevice *d)
+{
+    return d->exp.sriov_vf.pf != NULL;
+}
+
 static inline uint32_t pci_config_size(const PCIDevice *d)
 {
     return pci_is_express(d) ? PCIE_CONFIG_SPACE_SIZE : PCI_CONFIG_SPACE_SIZE;
diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
index b48a7a2..b09f79a 100644
--- a/include/hw/pci/pcie.h
+++ b/include/hw/pci/pcie.h
@@ -25,6 +25,7 @@
 #include "hw/pci/pci_regs.h"
 #include "hw/pci/pcie_regs.h"
 #include "hw/pci/pcie_aer.h"
+#include "hw/pci/pcie_sriov.h"
 #include "hw/hotplug.h"
 
 typedef enum {
@@ -74,6 +75,11 @@ struct PCIExpressDevice {
     /* AER */
     uint16_t aer_cap;
     PCIEAERLog aer_log;
+
+    /* SR/IOV */
+    uint16_t sriov_cap;
+    PCIESriovPF sriov_pf;
+    PCIESriovVF sriov_vf;
 };
 
 #define COMPAT_PROP_PCP "power_controller_present"
diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
new file mode 100644
index 0000000..061f5cf
--- /dev/null
+++ b/include/hw/pci/pcie_sriov.h
@@ -0,0 +1,55 @@
+/*
+ * pcie_sriov.h:
+ *
+ * Implementation of SR/IOV emulation support.
+ *
+ * Copyright (c) 2014 Knut Omang <knut.omang@oracle.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_PCIE_SRIOV_H
+#define QEMU_PCIE_SRIOV_H
+
+
+
+struct PCIESriovPF {
+    uint16_t num_vfs;           /* Number of virtual functions created */
+    uint8_t vf_bar_type[PCI_NUM_REGIONS];  /* Store type for each VF bar */
+    const char *vfname;         /* Reference to the device type used for the VFs */
+    PCIDevice **vf;             /* Pointer to an array of num_vfs VF devices */
+};
+
+struct PCIESriovVF {
+    PCIDevice *pf;              /* Pointer back to owner physical function */
+};
+
+void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
+                        const char *vfname, uint16_t vf_dev_id,
+                        uint16_t init_vfs, uint16_t total_vfs,
+                        uint16_t vf_offset, uint16_t vf_stride);
+void pcie_sriov_pf_exit(PCIDevice *dev);
+
+/* Set up a VF bar in the SR/IOV bar area */
+void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
+                               uint8_t type, dma_addr_t size);
+
+/* Instantiate a bar for a VF */
+void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
+                                MemoryRegion *memory);
+
+/* Add optional supported page sizes to the mask of supported page sizes
+ * Page size values are interpreted as opt_sup_pgsize << 12.
+ */
+void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t opt_sup_pgsize);
+
+/* SR/IOV capability config write handler */
+void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
+                             uint32_t val, int len);
+
+/* Reset SR/IOV VF Enable bit to unregister all VFs */
+void pcie_sriov_pf_disable_vfs(PCIDevice *dev);
+
+#endif /* QEMU_PCIE_SRIOV_H */
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index f8a9dd6..fa36b8c 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -53,6 +53,8 @@ typedef struct PCIDevice PCIDevice;
 typedef struct PCIEAERErr PCIEAERErr;
 typedef struct PCIEAERLog PCIEAERLog;
 typedef struct PCIEAERMsg PCIEAERMsg;
+typedef struct PCIESriovPF PCIESriovPF;
+typedef struct PCIESriovVF PCIESriovVF;
 typedef struct PCIEPort PCIEPort;
 typedef struct PCIESlot PCIESlot;
 typedef struct PCIExpressDevice PCIExpressDevice;
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/3] pci: Make use of the devfn property when registering new devices
  2015-09-12 12:36 ` [Qemu-devel] [PATCH v4 1/3] pci: Make use of the devfn property when registering new devices Knut Omang
@ 2015-09-17 11:11   ` Marcel Apfelbaum
  2015-09-17 12:12     ` Knut Omang
  0 siblings, 1 reply; 16+ messages in thread
From: Marcel Apfelbaum @ 2015-09-17 11:11 UTC (permalink / raw)
  To: Knut Omang, qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard W.M. Jones,
	Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

On 09/12/2015 03:36 PM, Knut Omang wrote:
> Without this, the devfn argument to pci_create_*()
> does not affect the assigned devfn.
>
> Needed to support (VF_STRIDE,VF_OFFSET) values other than (1,1)
> for SR/IOV.
>
> Signed-off-by: Knut Omang <knut.omang@oracle.com>
> ---
>   hw/pci/pci.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index ccea628..a5cc015 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -1840,7 +1840,7 @@ static void pci_qdev_realize(DeviceState *qdev, Error **errp)
>       bus = PCI_BUS(qdev_get_parent_bus(qdev));
>       pci_dev = do_pci_register_device(pci_dev, bus,
>                                        object_get_typename(OBJECT(qdev)),
> -                                     pci_dev->devfn, errp);
> +                                     object_property_get_int(OBJECT(qdev), "addr", NULL), errp);
Hi,

I don't get this, using object_property_get_int on "addr" should return the value of pci_dev->devfn,
can you please tell what exactly is not working?

Thanks,
Marcel



>       if (pci_dev == NULL)
>           return;
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/3] pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  2015-09-12 12:36 ` [Qemu-devel] [PATCH v4 3/3] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Knut Omang
@ 2015-09-17 11:49   ` Marcel Apfelbaum
  2015-09-17 14:10     ` Knut Omang
  2015-10-07 15:06   ` Marcel Apfelbaum
  1 sibling, 1 reply; 16+ messages in thread
From: Marcel Apfelbaum @ 2015-09-17 11:49 UTC (permalink / raw)
  To: Knut Omang, qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard W.M. Jones,
	Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

On 09/12/2015 03:36 PM, Knut Omang wrote:
> This patch provides the building blocks for creating an SR/IOV
> PCIe Extended Capability header and register/unregister
> SR/IOV Virtual Functions.
>
> Signed-off-by: Knut Omang <knut.omang@oracle.com>
> ---
>   hw/pci/Makefile.objs        |   2 +-
>   hw/pci/pci.c                |  99 ++++++++++++----
>   hw/pci/pcie.c               |   9 +-
>   hw/pci/pcie_sriov.c         | 271 ++++++++++++++++++++++++++++++++++++++++++++
>   include/hw/pci/pci.h        |  11 +-
>   include/hw/pci/pcie.h       |   6 +
>   include/hw/pci/pcie_sriov.h |  55 +++++++++
>   include/qemu/typedefs.h     |   2 +
>   8 files changed, 426 insertions(+), 29 deletions(-)
>   create mode 100644 hw/pci/pcie_sriov.c
>   create mode 100644 include/hw/pci/pcie_sriov.h
>
> diff --git a/hw/pci/Makefile.objs b/hw/pci/Makefile.objs
> index 9f905e6..2226980 100644
> --- a/hw/pci/Makefile.objs
> +++ b/hw/pci/Makefile.objs
> @@ -3,7 +3,7 @@ common-obj-$(CONFIG_PCI) += msix.o msi.o
>   common-obj-$(CONFIG_PCI) += shpc.o
>   common-obj-$(CONFIG_PCI) += slotid_cap.o
>   common-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o
> -common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o
> +common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o pcie_sriov.o
>
>   common-obj-$(call lnot,$(CONFIG_PCI)) += pci-stub.o
>   common-obj-$(CONFIG_ALL) += pci-stub.o
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index a5cc015..9c0eba1 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -153,6 +153,9 @@ int pci_bar(PCIDevice *d, int reg)
>   {
>       uint8_t type;
>
> +    /* PCIe virtual functions do not have their own BARs */
> +    assert(!pci_is_vf(d));
> +
>       if (reg != PCI_ROM_SLOT)
>           return PCI_BASE_ADDRESS_0 + reg * 4;
>
> @@ -211,22 +214,13 @@ void pci_device_deassert_intx(PCIDevice *dev)
>       }
>   }
>
> -static void pci_do_device_reset(PCIDevice *dev)
> +static void pci_reset_regions(PCIDevice *dev)
>   {
>       int r;
> +    if (pci_is_vf(dev)) {
> +        return;
> +    }
>
> -    pci_device_deassert_intx(dev);
> -    assert(dev->irq_state == 0);
> -
> -    /* Clear all writable bits */
> -    pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
> -                                 pci_get_word(dev->wmask + PCI_COMMAND) |
> -                                 pci_get_word(dev->w1cmask + PCI_COMMAND));
> -    pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
> -                                 pci_get_word(dev->wmask + PCI_STATUS) |
> -                                 pci_get_word(dev->w1cmask + PCI_STATUS));
> -    dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> -    dev->config[PCI_INTERRUPT_LINE] = 0x0;
>       for (r = 0; r < PCI_NUM_REGIONS; ++r) {
>           PCIIORegion *region = &dev->io_regions[r];
>           if (!region->size) {
> @@ -240,6 +234,27 @@ static void pci_do_device_reset(PCIDevice *dev)
>               pci_set_long(dev->config + pci_bar(dev, r), region->type);
>           }
>       }
> +}
> +
> +static void pci_do_device_reset(PCIDevice *dev)
> +{
> +    qdev_reset_all(&dev->qdev);

Hi,
Thank you for resubmitting this series!

This is only a quick look, I hope I'll have more time next week to go over it again.


> +
> +    dev->irq_state = 0;

Are you sure we need the assignment above? It seems that the irq_state
should be modified only by the intx wrappers as  pci_device_deassert_intx and so.


> +    pci_update_irq_status(dev);

Why do we have to update the irq status on reset?


> +    pci_device_deassert_intx(dev);
> +    assert(dev->irq_state == 0);
> +
> +    /* Clear all writable bits */
> +    pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
> +                                 pci_get_word(dev->wmask + PCI_COMMAND) |
> +                                 pci_get_word(dev->w1cmask + PCI_COMMAND));
> +    pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
> +                                 pci_get_word(dev->wmask + PCI_STATUS) |
> +                                 pci_get_word(dev->w1cmask + PCI_STATUS));
> +    dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> +    dev->config[PCI_INTERRUPT_LINE] = 0x0;
> +    pci_reset_regions(dev);
>       pci_update_mappings(dev);
>
>       msi_reset(dev);
> @@ -771,6 +786,15 @@ static void pci_init_multifunction(PCIBus *bus, PCIDevice *dev, Error **errp)
>           dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
>       }
>
> +    /* With SR/IOV and ARI, a device at function 0 need not be a multifunction
> +     * device, as it may just be a VF that ended up with function 0 in
> +     * the legacy PCI interpretation. Avoid failing in such cases:
> +     */
> +    if (pci_is_vf(dev) &&
> +        dev->exp.sriov_vf.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> +        return;
> +    }
> +
>       /*
>        * multifunction bit is interpreted in two ways as follows.
>        *   - all functions must set the bit to 1.
> @@ -962,6 +986,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
>       uint64_t wmask;
>       pcibus_t size = memory_region_size(memory);
>
> +    assert(!pci_is_vf(pci_dev)); /* VFs must use pcie_sriov_vf_register_bar */
>       assert(region_num >= 0);
>       assert(region_num < PCI_NUM_REGIONS);
>       if (size & (size-1)) {
> @@ -1060,11 +1085,44 @@ pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int region_num)
>       return pci_dev->io_regions[region_num].addr;
>   }
>
> -static pcibus_t pci_bar_address(PCIDevice *d,
> -				int reg, uint8_t type, pcibus_t size)
> +
> +static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
> +                                        uint8_t type, pcibus_t size)
> +{
> +    pcibus_t new_addr;
> +    if (!pci_is_vf(d)) {
> +        int bar = pci_bar(d, reg);
> +        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +            new_addr = pci_get_quad(d->config + bar);
> +        } else {
> +            new_addr = pci_get_long(d->config + bar);
> +        }
> +    } else {
> +        PCIDevice *pf = d->exp.sriov_vf.pf;
> +        uint16_t sriov_cap = pf->exp.sriov_cap;
> +        int bar = sriov_cap + PCI_SRIOV_BAR + reg * 4;
> +        uint16_t vf_offset = pci_get_word(pf->config + sriov_cap + PCI_SRIOV_VF_OFFSET);
> +        uint16_t vf_stride = pci_get_word(pf->config + sriov_cap + PCI_SRIOV_VF_STRIDE);
> +        uint32_t vf_num = (d->devfn - (pf->devfn + vf_offset)) / vf_stride;
> +
> +        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +            new_addr = pci_get_quad(pf->config + bar);
> +        } else {
> +            new_addr = pci_get_long(pf->config + bar);
> +        }
> +        new_addr += vf_num * size;
> +    }
> +    if (reg != PCI_ROM_SLOT) {
> +        /* Preserve the rom enable bit */
> +        new_addr &= ~(size - 1);
> +    }
> +    return new_addr;
> +}
> +
> +pcibus_t pci_bar_address(PCIDevice *d,
> +                         int reg, uint8_t type, pcibus_t size)
>   {
>       pcibus_t new_addr, last_addr;
> -    int bar = pci_bar(d, reg);
>       uint16_t cmd = pci_get_word(d->config + PCI_COMMAND);
>       Object *machine = qdev_get_machine();
>       ObjectClass *oc = object_get_class(machine);
> @@ -1075,7 +1133,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
>           if (!(cmd & PCI_COMMAND_IO)) {
>               return PCI_BAR_UNMAPPED;
>           }
> -        new_addr = pci_get_long(d->config + bar) & ~(size - 1);
> +        new_addr = pci_config_get_bar_addr(d, reg, type, size);
>           last_addr = new_addr + size - 1;
>           /* Check if 32 bit BAR wraps around explicitly.
>            * TODO: make priorities correct and remove this work around.
> @@ -1090,11 +1148,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
>       if (!(cmd & PCI_COMMAND_MEMORY)) {
>           return PCI_BAR_UNMAPPED;
>       }
> -    if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> -        new_addr = pci_get_quad(d->config + bar);
> -    } else {
> -        new_addr = pci_get_long(d->config + bar);
> -    }
> +    new_addr = pci_config_get_bar_addr(d, reg, type, size);
>       /* the ROM slot has a specific enable bit */
>       if (reg == PCI_ROM_SLOT && !(new_addr & PCI_ROM_ADDRESS_ENABLE)) {
>           return PCI_BAR_UNMAPPED;
> @@ -1228,6 +1282,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int
>
>       msi_write_config(d, addr, val_in, l);
>       msix_write_config(d, addr, val_in, l);
> +    pcie_sriov_config_write(d, addr, val_in, l);
>   }
>
>   /***********************************************************/
> diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> index 6e28985..ba49c0f 100644
> --- a/hw/pci/pcie.c
> +++ b/hw/pci/pcie.c
> @@ -253,7 +253,7 @@ void pcie_cap_slot_hotplug_cb(HotplugHandler *hotplug_dev, DeviceState *dev,
>        * Right now, only a device of function = 0 is allowed to be
>        * hot plugged/unplugged.
>        */
> -    assert(PCI_FUNC(pci_dev->devfn) == 0);
> +    assert(PCI_FUNC(pci_dev->devfn) == 0 || pci_is_vf(pci_dev));
>
>       pci_word_test_and_set_mask(exp_cap + PCI_EXP_SLTSTA,
>                                  PCI_EXP_SLTSTA_PDS);
> @@ -265,10 +265,11 @@ void pcie_cap_slot_hot_unplug_request_cb(HotplugHandler *hotplug_dev,
>                                            DeviceState *dev, Error **errp)
>   {
>       uint8_t *exp_cap;
> +    PCIDevice *pdev = PCI_DEVICE(hotplug_dev);
>
> -    pcie_cap_slot_hotplug_common(PCI_DEVICE(hotplug_dev), dev, &exp_cap, errp);
> +    pcie_cap_slot_hotplug_common(pdev, dev, &exp_cap, errp);
>
> -    pcie_cap_slot_push_attention_button(PCI_DEVICE(hotplug_dev));
> +    pcie_cap_slot_push_attention_button(pdev);
>   }

This chunk is not directly liked to the patch, I would put it in a different patch.

>
>   /* pci express slot for pci express root/downstream port
> @@ -408,7 +409,7 @@ void pcie_cap_slot_write_config(PCIDevice *dev,
>       }
>
>       /*
> -     * If the slot is polulated, power indicator is off and power
> +     * If the slot is populated, power indicator is off and power
>        * controller is off, it is safe to detach the devices.
>        */
>       if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&


Same here. I am always happy to have this kind of types taken care of,
but I think a separate patch would be cleaner.

> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> new file mode 100644
> index 0000000..14f31cc
> --- /dev/null
> +++ b/hw/pci/pcie_sriov.c
> @@ -0,0 +1,271 @@
> +/*
> + * pcie_sriov.c:
> + *
> + * Implementation of SR/IOV emulation support.
> + *
> + * Copyright (c) 2014 Knut Omang <knut.omang@oracle.com>

2015 :)

> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "hw/pci/pci.h"
> +#include "hw/pci/pcie.h"
> +#include "hw/pci/pci_bus.h"
> +#include "qemu/range.h"
> +
> +//#define DEBUG_SRIOV
> +#ifdef DEBUG_SRIOV
> +# define SRIOV_DPRINTF(fmt, ...)                                         \
> +    fprintf(stderr, "%s:%d " fmt, __func__, __LINE__, ## __VA_ARGS__)
> +#else
> +# define SRIOV_DPRINTF(fmt, ...) do {} while (0)
> +#endif
> +#define SRIOV_DEV_PRINTF(dev, fmt, ...)                                  \
> +    SRIOV_DPRINTF("%s:%x "fmt, (dev)->name, (dev)->devfn, ## __VA_ARGS__)
> +
> +static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char *name);
> +static void unregister_vfs(PCIDevice *dev);
> +
> +void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> +                        const char *vfname, uint16_t vf_dev_id,
> +                        uint16_t init_vfs, uint16_t total_vfs,
> +                        uint16_t vf_offset, uint16_t vf_stride)
> +{
> +    uint8_t *cfg = dev->config + offset;
> +    uint8_t *wmask;
> +
> +    pcie_add_capability(dev, PCI_EXT_CAP_ID_SRIOV, 1,
> +                        offset, PCI_EXT_CAP_SRIOV_SIZEOF);
> +    dev->exp.sriov_cap = offset;
> +    dev->exp.sriov_pf.num_vfs = 0;
> +    dev->exp.sriov_pf.vfname = g_strdup(vfname);
> +    dev->exp.sriov_pf.vf = NULL;
> +
> +    pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
> +    pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
> +
> +    /* Minimal page size support values required by the SR/IOV standard:
> +     * 0x553 << 12 = 0x553000 = 4K + 8K + 64K + 256K + 1M + 4M
> +     * Hard coded for simplicity in this version
> +     */

A define (with the above very good explanation) would be preferred.

> +    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, 0x553);
> +
> +    /* Default is to use 4K pages, software can modify it
> +     * to any of the supported bits
> +     */
> +    pci_set_word(cfg + PCI_SRIOV_SYS_PGSIZE, 0x1);
> +
> +    /* Set up device ID and initial/total number of VFs available */
> +    pci_set_word(cfg + PCI_SRIOV_VF_DID, vf_dev_id);
> +    pci_set_word(cfg + PCI_SRIOV_INITIAL_VF, init_vfs);
> +    pci_set_word(cfg + PCI_SRIOV_TOTAL_VF, total_vfs);
> +    pci_set_word(cfg + PCI_SRIOV_NUM_VF, 0);
> +
> +    /* Write enable control bits */
> +    wmask = dev->wmask + offset;
> +    pci_set_word(wmask + PCI_SRIOV_CTRL,
> +                 PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE | PCI_SRIOV_CTRL_ARI);
> +    pci_set_word(wmask + PCI_SRIOV_NUM_VF, 0xffff);
> +    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, 0x553);
> +
> +    qdev_prop_set_bit(&dev->qdev, "multifunction", true);
> +}
> +
> +void pcie_sriov_pf_exit(PCIDevice *dev)
> +{
> +    SRIOV_DPRINTF("\n");
> +    unregister_vfs(dev);
> +    g_free((char *)dev->exp.sriov_pf.vfname);
> +    dev->exp.sriov_pf.vfname = NULL;
> +}
> +
> +void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
> +                               uint8_t type, dma_addr_t size)
> +{
> +    uint32_t addr;
> +    uint64_t wmask;
> +    uint16_t sriov_cap = dev->exp.sriov_cap;
> +
> +    assert(sriov_cap > 0);
> +    assert(region_num >= 0);
> +    assert(region_num < PCI_NUM_REGIONS);
> +    assert(region_num != PCI_ROM_SLOT);
> +
> +    wmask = ~(size - 1);
> +    addr = sriov_cap + PCI_SRIOV_BAR + region_num * 4;
> +
> +    pci_set_long(dev->config + addr, type);
> +    if (!(type & PCI_BASE_ADDRESS_SPACE_IO) &&
> +        type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +        pci_set_quad(dev->wmask + addr, wmask);
> +        pci_set_quad(dev->cmask + addr, ~0ULL);
> +    } else {
> +        pci_set_long(dev->wmask + addr, wmask & 0xffffffff);
> +        pci_set_long(dev->cmask + addr, 0xffffffff);
> +    }
> +    dev->exp.sriov_pf.vf_bar_type[region_num] = type;
> +}
> +
> +void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
> +                                MemoryRegion *memory)
> +{
> +    PCIIORegion *r;
> +    uint8_t type;
> +    pcibus_t size = memory_region_size(memory);
> +
> +    assert(pci_is_vf(dev)); /* PFs must use pci_register_bar */
> +    assert(region_num >= 0);
> +    assert(region_num < PCI_NUM_REGIONS);
> +    type = dev->exp.sriov_vf.pf->exp.sriov_pf.vf_bar_type[region_num];
> +
> +    if (size & (size-1)) {

We already have a  is_power_of_2 func defined in include/qemu/host-utils.h

> +        fprintf(stderr, "ERROR: PCI region size must be a power"
> +                " of two - type=0x%x, size=0x%"FMT_PCIBUS"\n", type, size);

Maybe we should error_report instead of fprintf

> +        exit(1);
> +    }
> +
> +    r = &dev->io_regions[region_num];
> +    r->memory = memory;
> +    r->address_space =
> +        type & PCI_BASE_ADDRESS_SPACE_IO
> +        ? dev->bus->address_space_io
> +        : dev->bus->address_space_mem;
> +    r->size = size;
> +    r->type = type;
> +
> +    r->addr = pci_bar_address(dev, region_num, r->type, r->size);
> +    if (r->addr != PCI_BAR_UNMAPPED) {
> +        memory_region_add_subregion_overlap(r->address_space,
> +                                            r->addr, r->memory, 1);
> +    }
> +}
> +
> +static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char *name)
> +{
> +    PCIDevice *dev = pci_create(pf->bus, devfn, name);
> +    dev->exp.sriov_vf.pf = pf;
> +    Error *local_err = NULL;
> +
> +    object_property_set_bool(OBJECT(&dev->qdev), true, "realized", &local_err);
> +    if (local_err) {
> +        fprintf(stderr, "Failed to enable: %s\n",
> +                error_get_pretty(local_err));
> +        error_free(local_err);

and error_report_err here

> +        return NULL;
> +    }
> +
> +    /* set vid/did according to sr/iov spec - they are not used */
> +    pci_config_set_vendor_id(dev->config, 0xffff);
> +    pci_config_set_device_id(dev->config, 0xffff);
> +    return dev;
> +}
> +
> +static void register_vfs(PCIDevice *dev)
> +{
> +    uint16_t num_vfs;
> +    uint16_t i;
> +    uint16_t sriov_cap = dev->exp.sriov_cap;
> +    uint16_t vf_offset = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_OFFSET);
> +    uint16_t vf_stride = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_STRIDE);
> +    int32_t devfn = dev->devfn + vf_offset;
> +
> +    assert(sriov_cap > 0);
> +    num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
> +
> +    dev->exp.sriov_pf.vf = g_malloc(sizeof(PCIDevice *) * num_vfs);
> +    assert(dev->exp.sriov_pf.vf);
> +
> +    SRIOV_DEV_PRINTF(dev, "creating %d vf devs\n", num_vfs);
> +    for (i = 0; i < num_vfs; i++) {
> +        dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn, dev->exp.sriov_pf.vfname);
> +        if (!dev->exp.sriov_pf.vf[i]) {
> +            SRIOV_DEV_PRINTF(dev, "Failed to create VF %d at devfn %x.%x\n", i,
> +                             PCI_SLOT(devfn), PCI_FUNC(devfn));
> +            num_vfs = i;
> +            break;
> +        }
> +        devfn += vf_stride;
> +    }
> +    dev->exp.sriov_pf.num_vfs = num_vfs;
> +}
> +
> +static void unregister_vfs(PCIDevice *dev)
> +{
> +    Error *local_err = NULL;
> +    uint16_t num_vfs = dev->exp.sriov_pf.num_vfs;
> +    uint16_t i;
> +    SRIOV_DEV_PRINTF(dev, "Unregistering %d vf devs\n", num_vfs);
> +    for (i = 0; i < num_vfs; i++) {
> +        object_property_set_bool(OBJECT(&dev->exp.sriov_pf.vf[i]->qdev), false, "realized", &local_err);
> +        if (local_err) {
> +            fprintf(stderr, "Failed to unplug: %s\n",
> +                    error_get_pretty(local_err));
> +            error_free(local_err);

error_report_err

> +        }
> +    }
> +    g_free(dev->exp.sriov_pf.vf);
> +    dev->exp.sriov_pf.vf = NULL;
> +    dev->exp.sriov_pf.num_vfs = 0;
> +}
> +
> +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> +{
> +    uint32_t off;
> +    uint16_t sriov_cap = dev->exp.sriov_cap;
> +
> +    if (!sriov_cap || address < sriov_cap) {
> +        return;
> +    }
> +    off = address - sriov_cap;
> +    if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
> +        return;
> +    }
> +
> +    SRIOV_DEV_PRINTF(dev, "cap at 0x%x sriov offset 0x%x val 0x%x len %d\n",
> +                 sriov_cap, off, val, len);
> +
> +    if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> +        if (dev->exp.sriov_pf.num_vfs) {
> +            if (!(val & PCI_SRIOV_CTRL_VFE)) {
> +                unregister_vfs(dev);
> +            }
> +        } else {
> +            if (val & PCI_SRIOV_CTRL_VFE) {
> +                register_vfs(dev);
> +            }
> +        }
> +    }
> +}
> +
> +
> +/* Reset SR/IOV VF Enable bit to trigger an unregister of all VFs */
> +void pcie_sriov_pf_disable_vfs(PCIDevice *dev)
> +{
> +    uint16_t sriov_cap = dev->exp.sriov_cap;
> +    if (sriov_cap) {
> +        uint32_t val = pci_get_byte(dev->config + sriov_cap + PCI_SRIOV_CTRL);
> +        if (val & PCI_SRIOV_CTRL_VFE) {
> +            val &= ~PCI_SRIOV_CTRL_VFE;
> +            pcie_sriov_config_write(dev, sriov_cap + PCI_SRIOV_CTRL, val, 1);
> +        }
> +    }
> +}
> +
> +/* Add optional supported page sizes to the mask of supported page sizes */
> +void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t opt_sup_pgsize)
> +{
> +    uint8_t *cfg = dev->config + dev->exp.sriov_cap;
> +    uint8_t *wmask = dev->wmask + dev->exp.sriov_cap;
> +
> +    uint16_t sup_pgsize = pci_get_word(cfg + PCI_SRIOV_SUP_PGSIZE);
> +
> +    sup_pgsize |= opt_sup_pgsize;
> +
> +    /* Make sure the new bits are set, and that system page size
> +     * also can be set to any of the new values according to spec:
> +     */
> +    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, sup_pgsize);
> +    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, sup_pgsize);
> +}
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 551cb3d..2e9d8ba 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -11,8 +11,6 @@
>   /* PCI includes legacy ISA access.  */
>   #include "hw/isa/isa.h"
>
> -#include "hw/pci/pcie.h"
> -
>   /* PCI bus */
>
>   #define PCI_DEVFN(slot, func)   ((((slot) & 0x1f) << 3) | ((func) & 0x07))
> @@ -132,6 +130,7 @@ enum {
>   #define QEMU_PCI_VGA_IO_HI_SIZE 0x20
>
>   #include "hw/pci/pci_regs.h"
> +#include "hw/pci/pcie.h"

Why is it moved here?


>
>   /* PCI HEADER_TYPE */
>   #define  PCI_HEADER_TYPE_MULTI_FUNCTION 0x80
> @@ -421,6 +420,9 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
>   AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>   void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
>
> +pcibus_t pci_bar_address(PCIDevice *d,
> +                         int reg, uint8_t type, pcibus_t size);
> +
>   static inline void
>   pci_set_byte(uint8_t *config, uint8_t val)
>   {
> @@ -672,6 +674,11 @@ static inline int pci_is_express(const PCIDevice *d)
>       return d->cap_present & QEMU_PCI_CAP_EXPRESS;
>   }
>
> +static inline int pci_is_vf(const PCIDevice *d)
> +{
> +    return d->exp.sriov_vf.pf != NULL;
> +}
> +
>   static inline uint32_t pci_config_size(const PCIDevice *d)
>   {
>       return pci_is_express(d) ? PCIE_CONFIG_SPACE_SIZE : PCI_CONFIG_SPACE_SIZE;
> diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> index b48a7a2..b09f79a 100644
> --- a/include/hw/pci/pcie.h
> +++ b/include/hw/pci/pcie.h
> @@ -25,6 +25,7 @@
>   #include "hw/pci/pci_regs.h"
>   #include "hw/pci/pcie_regs.h"
>   #include "hw/pci/pcie_aer.h"
> +#include "hw/pci/pcie_sriov.h"
>   #include "hw/hotplug.h"
>
>   typedef enum {
> @@ -74,6 +75,11 @@ struct PCIExpressDevice {
>       /* AER */
>       uint16_t aer_cap;
>       PCIEAERLog aer_log;
> +
> +    /* SR/IOV */
> +    uint16_t sriov_cap;
> +    PCIESriovPF sriov_pf;
> +    PCIESriovVF sriov_vf;
>   };
>
>   #define COMPAT_PROP_PCP "power_controller_present"
> diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
> new file mode 100644
> index 0000000..061f5cf
> --- /dev/null
> +++ b/include/hw/pci/pcie_sriov.h
> @@ -0,0 +1,55 @@
> +/*
> + * pcie_sriov.h:
> + *
> + * Implementation of SR/IOV emulation support.
> + *
> + * Copyright (c) 2014 Knut Omang <knut.omang@oracle.com>

2015

> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef QEMU_PCIE_SRIOV_H
> +#define QEMU_PCIE_SRIOV_H
> +
> +
> +
> +struct PCIESriovPF {
> +    uint16_t num_vfs;           /* Number of virtual functions created */
> +    uint8_t vf_bar_type[PCI_NUM_REGIONS];  /* Store type for each VF bar */
> +    const char *vfname;         /* Reference to the device type used for the VFs */
> +    PCIDevice **vf;             /* Pointer to an array of num_vfs VF devices */
> +};
> +
> +struct PCIESriovVF {
> +    PCIDevice *pf;              /* Pointer back to owner physical function */
> +};
> +
> +void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> +                        const char *vfname, uint16_t vf_dev_id,
> +                        uint16_t init_vfs, uint16_t total_vfs,
> +                        uint16_t vf_offset, uint16_t vf_stride);
> +void pcie_sriov_pf_exit(PCIDevice *dev);
> +
> +/* Set up a VF bar in the SR/IOV bar area */
> +void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
> +                               uint8_t type, dma_addr_t size);
> +
> +/* Instantiate a bar for a VF */
> +void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
> +                                MemoryRegion *memory);
> +
> +/* Add optional supported page sizes to the mask of supported page sizes
> + * Page size values are interpreted as opt_sup_pgsize << 12.
> + */
> +void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t opt_sup_pgsize);
> +
> +/* SR/IOV capability config write handler */
> +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
> +                             uint32_t val, int len);
> +
> +/* Reset SR/IOV VF Enable bit to unregister all VFs */
> +void pcie_sriov_pf_disable_vfs(PCIDevice *dev);
> +
> +#endif /* QEMU_PCIE_SRIOV_H */
> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> index f8a9dd6..fa36b8c 100644
> --- a/include/qemu/typedefs.h
> +++ b/include/qemu/typedefs.h
> @@ -53,6 +53,8 @@ typedef struct PCIDevice PCIDevice;
>   typedef struct PCIEAERErr PCIEAERErr;
>   typedef struct PCIEAERLog PCIEAERLog;
>   typedef struct PCIEAERMsg PCIEAERMsg;
> +typedef struct PCIESriovPF PCIESriovPF;
> +typedef struct PCIESriovVF PCIESriovVF;
>   typedef struct PCIEPort PCIEPort;
>   typedef struct PCIESlot PCIESlot;
>   typedef struct PCIExpressDevice PCIExpressDevice;
>


I hope I helped,

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/3] pci: Make use of the devfn property when registering new devices
  2015-09-17 11:11   ` Marcel Apfelbaum
@ 2015-09-17 12:12     ` Knut Omang
  2015-09-17 12:41       ` Marcel Apfelbaum
  0 siblings, 1 reply; 16+ messages in thread
From: Knut Omang @ 2015-09-17 12:12 UTC (permalink / raw)
  To: marcel, qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard W.M. Jones,
	Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

On Thu, 2015-09-17 at 14:11 +0300, Marcel Apfelbaum wrote:
> On 09/12/2015 03:36 PM, Knut Omang wrote:
> > Without this, the devfn argument to pci_create_*()
> > does not affect the assigned devfn.
> > 
> > Needed to support (VF_STRIDE,VF_OFFSET) values other than (1,1)
> > for SR/IOV.
> > 
> > Signed-off-by: Knut Omang <knut.omang@oracle.com>
> > ---
> >   hw/pci/pci.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index ccea628..a5cc015 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -1840,7 +1840,7 @@ static void pci_qdev_realize(DeviceState
> > *qdev, Error **errp)
> >       bus = PCI_BUS(qdev_get_parent_bus(qdev));
> >       pci_dev = do_pci_register_device(pci_dev, bus,
> >                                       
> >  object_get_typename(OBJECT(qdev)),
> > -                                     pci_dev->devfn, errp);
> > +                                    
> >  object_property_get_int(OBJECT(qdev), "addr", NULL), errp);
> Hi,
> 
> I don't get this, using object_property_get_int on "addr" should
> return the value of pci_dev->devfn,
> can you please tell what exactly is not working?

The problem is that at that point pci_dev->devfn has not been set yet -
have commented on this before somewhere..

Knut

> Thanks,
> Marcel
> 
> 
> 
> >       if (pci_dev == NULL)
> >           return;
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/3] pci: Make use of the devfn property when registering new devices
  2015-09-17 12:12     ` Knut Omang
@ 2015-09-17 12:41       ` Marcel Apfelbaum
  2015-10-14 10:27         ` Knut Omang
  0 siblings, 1 reply; 16+ messages in thread
From: Marcel Apfelbaum @ 2015-09-17 12:41 UTC (permalink / raw)
  To: Knut Omang, qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard W.M. Jones,
	Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

On 09/17/2015 03:12 PM, Knut Omang wrote:
> On Thu, 2015-09-17 at 14:11 +0300, Marcel Apfelbaum wrote:
>> On 09/12/2015 03:36 PM, Knut Omang wrote:
>>> Without this, the devfn argument to pci_create_*()
>>> does not affect the assigned devfn.
>>>
>>> Needed to support (VF_STRIDE,VF_OFFSET) values other than (1,1)
>>> for SR/IOV.
>>>
>>> Signed-off-by: Knut Omang <knut.omang@oracle.com>
>>> ---
>>>    hw/pci/pci.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>> index ccea628..a5cc015 100644
>>> --- a/hw/pci/pci.c
>>> +++ b/hw/pci/pci.c
>>> @@ -1840,7 +1840,7 @@ static void pci_qdev_realize(DeviceState
>>> *qdev, Error **errp)
>>>        bus = PCI_BUS(qdev_get_parent_bus(qdev));
>>>        pci_dev = do_pci_register_device(pci_dev, bus,
>>>
>>>   object_get_typename(OBJECT(qdev)),
>>> -                                     pci_dev->devfn, errp);
>>> +
>>>   object_property_get_int(OBJECT(qdev), "addr", NULL), errp);
>> Hi,
>>
>> I don't get this, using object_property_get_int on "addr" should
>> return the value of pci_dev->devfn,
>> can you please tell what exactly is not working?
>
> The problem is that at that point pci_dev->devfn has not been set yet -
> have commented on this before somewhere..
>

But "addr" property has the right value? Is indeed strange because it should
get the value from pci_dev->devfn.

Don't get me wrong, this patch is OK.
I just want to understand if we have a hidden bug somewhere.

Anyway,
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>

> Knut
>
>> Thanks,
>> Marcel
>>
>>
>>
>>>        if (pci_dev == NULL)
>>>            return;
>>>
>>>
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/3] pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  2015-09-17 11:49   ` Marcel Apfelbaum
@ 2015-09-17 14:10     ` Knut Omang
  2015-10-07 13:45       ` Marcel Apfelbaum
  2015-10-14 12:25       ` Knut Omang
  0 siblings, 2 replies; 16+ messages in thread
From: Knut Omang @ 2015-09-17 14:10 UTC (permalink / raw)
  To: marcel, qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard W.M. Jones,
	Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

On Thu, 2015-09-17 at 14:49 +0300, Marcel Apfelbaum wrote:
> On 09/12/2015 03:36 PM, Knut Omang wrote:
> > This patch provides the building blocks for creating an SR/IOV
> > PCIe Extended Capability header and register/unregister
> > SR/IOV Virtual Functions.
> > 
> > Signed-off-by: Knut Omang <knut.omang@oracle.com>
> > ---
> >   hw/pci/Makefile.objs        |   2 +-
> >   hw/pci/pci.c                |  99 ++++++++++++----
> >   hw/pci/pcie.c               |   9 +-
> >   hw/pci/pcie_sriov.c         | 271
> > ++++++++++++++++++++++++++++++++++++++++++++
> >   include/hw/pci/pci.h        |  11 +-
> >   include/hw/pci/pcie.h       |   6 +
> >   include/hw/pci/pcie_sriov.h |  55 +++++++++
> >   include/qemu/typedefs.h     |   2 +
> >   8 files changed, 426 insertions(+), 29 deletions(-)
> >   create mode 100644 hw/pci/pcie_sriov.c
> >   create mode 100644 include/hw/pci/pcie_sriov.h
> > 
> > diff --git a/hw/pci/Makefile.objs b/hw/pci/Makefile.objs
> > index 9f905e6..2226980 100644
> > --- a/hw/pci/Makefile.objs
> > +++ b/hw/pci/Makefile.objs
> > @@ -3,7 +3,7 @@ common-obj-$(CONFIG_PCI) += msix.o msi.o
> >   common-obj-$(CONFIG_PCI) += shpc.o
> >   common-obj-$(CONFIG_PCI) += slotid_cap.o
> >   common-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o
> > -common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o
> > +common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o
> > pcie_sriov.o
> > 
> >   common-obj-$(call lnot,$(CONFIG_PCI)) += pci-stub.o
> >   common-obj-$(CONFIG_ALL) += pci-stub.o
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index a5cc015..9c0eba1 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -153,6 +153,9 @@ int pci_bar(PCIDevice *d, int reg)
> >   {
> >       uint8_t type;
> > 
> > +    /* PCIe virtual functions do not have their own BARs */
> > +    assert(!pci_is_vf(d));
> > +
> >       if (reg != PCI_ROM_SLOT)
> >           return PCI_BASE_ADDRESS_0 + reg * 4;
> > 
> > @@ -211,22 +214,13 @@ void pci_device_deassert_intx(PCIDevice *dev)
> >       }
> >   }
> > 
> > -static void pci_do_device_reset(PCIDevice *dev)
> > +static void pci_reset_regions(PCIDevice *dev)
> >   {
> >       int r;
> > +    if (pci_is_vf(dev)) {
> > +        return;
> > +    }
> > 
> > -    pci_device_deassert_intx(dev);
> > -    assert(dev->irq_state == 0);
> > -
> > -    /* Clear all writable bits */
> > -    pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
> > -                                 pci_get_word(dev->wmask +
> > PCI_COMMAND) |
> > -                                 pci_get_word(dev->w1cmask +
> > PCI_COMMAND));
> > -    pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
> > -                                 pci_get_word(dev->wmask +
> > PCI_STATUS) |
> > -                                 pci_get_word(dev->w1cmask +
> > PCI_STATUS));
> > -    dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> > -    dev->config[PCI_INTERRUPT_LINE] = 0x0;
> >       for (r = 0; r < PCI_NUM_REGIONS; ++r) {
> >           PCIIORegion *region = &dev->io_regions[r];
> >           if (!region->size) {
> > @@ -240,6 +234,27 @@ static void pci_do_device_reset(PCIDevice
> > *dev)
> >               pci_set_long(dev->config + pci_bar(dev, r), region
> > ->type);
> >           }
> >       }
> > +}
> > +
> > +static void pci_do_device_reset(PCIDevice *dev)
> > +{
> > +    qdev_reset_all(&dev->qdev);
> 
> Hi,
> Thank you for resubmitting this series!
> 
> This is only a quick look, I hope I'll have more time next week to go
> over it again.
> 
> 
> > +
> > +    dev->irq_state = 0;
> 
> Are you sure we need the assignment above? It seems that the
> irq_state
> should be modified only by the intx wrappers as 
>  pci_device_deassert_intx and so.
> 
> 
> > +    pci_update_irq_status(dev);
> 
> Why do we have to update the irq status on reset?

I struggled a lot with interrupts and PF disapperance in earlier
versions, this may be unintended leftovers from rebase.

The intention was to avoid the qdev removal which caused problems with
unintended "hot unplug of the PF" at VF removal earlier, but after some
of the more recent QOM'ification all those problems disappeared.

I tried without these lines and the do_device_reset refactor and it
seems to work just fine, will fix that in v5, thanks!

> > +    pci_device_deassert_intx(dev);
> > +    assert(dev->irq_state == 0);
> > +
> > +    /* Clear all writable bits */
> > +    pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
> > +                                 pci_get_word(dev->wmask +
> > PCI_COMMAND) |
> > +                                 pci_get_word(dev->w1cmask +
> > PCI_COMMAND));
> > +    pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
> > +                                 pci_get_word(dev->wmask +
> > PCI_STATUS) |
> > +                                 pci_get_word(dev->w1cmask +
> > PCI_STATUS));
> > +    dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> > +    dev->config[PCI_INTERRUPT_LINE] = 0x0;
> > +    pci_reset_regions(dev);
> >       pci_update_mappings(dev);
> > 
> >       msi_reset(dev);
> > @@ -771,6 +786,15 @@ static void pci_init_multifunction(PCIBus
> > *bus, PCIDevice *dev, Error **errp)
> >           dev->config[PCI_HEADER_TYPE] |=
> > PCI_HEADER_TYPE_MULTI_FUNCTION;
> >       }
> > 
> > +    /* With SR/IOV and ARI, a device at function 0 need not be a
> > multifunction
> > +     * device, as it may just be a VF that ended up with function
> > 0 in
> > +     * the legacy PCI interpretation. Avoid failing in such cases:
> > +     */
> > +    if (pci_is_vf(dev) &&
> > +        dev->exp.sriov_vf.pf->cap_present &
> > QEMU_PCI_CAP_MULTIFUNCTION) {
> > +        return;
> > +    }
> > +
> >       /*
> >        * multifunction bit is interpreted in two ways as follows.
> >        *   - all functions must set the bit to 1.
> > @@ -962,6 +986,7 @@ void pci_register_bar(PCIDevice *pci_dev, int
> > region_num,
> >       uint64_t wmask;
> >       pcibus_t size = memory_region_size(memory);
> > 
> > +    assert(!pci_is_vf(pci_dev)); /* VFs must use
> > pcie_sriov_vf_register_bar */
> >       assert(region_num >= 0);
> >       assert(region_num < PCI_NUM_REGIONS);
> >       if (size & (size-1)) {
> > @@ -1060,11 +1085,44 @@ pcibus_t pci_get_bar_addr(PCIDevice
> > *pci_dev, int region_num)
> >       return pci_dev->io_regions[region_num].addr;
> >   }
> > 
> > -static pcibus_t pci_bar_address(PCIDevice *d,
> > -				int reg, uint8_t type, pcibus_t
> > size)
> > +
> > +static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
> > +                                        uint8_t type, pcibus_t
> > size)
> > +{
> > +    pcibus_t new_addr;
> > +    if (!pci_is_vf(d)) {
> > +        int bar = pci_bar(d, reg);
> > +        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > +            new_addr = pci_get_quad(d->config + bar);
> > +        } else {
> > +            new_addr = pci_get_long(d->config + bar);
> > +        }
> > +    } else {
> > +        PCIDevice *pf = d->exp.sriov_vf.pf;
> > +        uint16_t sriov_cap = pf->exp.sriov_cap;
> > +        int bar = sriov_cap + PCI_SRIOV_BAR + reg * 4;
> > +        uint16_t vf_offset = pci_get_word(pf->config + sriov_cap +
> > PCI_SRIOV_VF_OFFSET);
> > +        uint16_t vf_stride = pci_get_word(pf->config + sriov_cap +
> > PCI_SRIOV_VF_STRIDE);
> > +        uint32_t vf_num = (d->devfn - (pf->devfn + vf_offset)) /
> > vf_stride;
> > +
> > +        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > +            new_addr = pci_get_quad(pf->config + bar);
> > +        } else {
> > +            new_addr = pci_get_long(pf->config + bar);
> > +        }
> > +        new_addr += vf_num * size;
> > +    }
> > +    if (reg != PCI_ROM_SLOT) {
> > +        /* Preserve the rom enable bit */
> > +        new_addr &= ~(size - 1);
> > +    }
> > +    return new_addr;
> > +}
> > +
> > +pcibus_t pci_bar_address(PCIDevice *d,
> > +                         int reg, uint8_t type, pcibus_t size)
> >   {
> >       pcibus_t new_addr, last_addr;
> > -    int bar = pci_bar(d, reg);
> >       uint16_t cmd = pci_get_word(d->config + PCI_COMMAND);
> >       Object *machine = qdev_get_machine();
> >       ObjectClass *oc = object_get_class(machine);
> > @@ -1075,7 +1133,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
> >           if (!(cmd & PCI_COMMAND_IO)) {
> >               return PCI_BAR_UNMAPPED;
> >           }
> > -        new_addr = pci_get_long(d->config + bar) & ~(size - 1);
> > +        new_addr = pci_config_get_bar_addr(d, reg, type, size);
> >           last_addr = new_addr + size - 1;
> >           /* Check if 32 bit BAR wraps around explicitly.
> >            * TODO: make priorities correct and remove this work
> > around.
> > @@ -1090,11 +1148,7 @@ static pcibus_t pci_bar_address(PCIDevice
> > *d,
> >       if (!(cmd & PCI_COMMAND_MEMORY)) {
> >           return PCI_BAR_UNMAPPED;
> >       }
> > -    if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > -        new_addr = pci_get_quad(d->config + bar);
> > -    } else {
> > -        new_addr = pci_get_long(d->config + bar);
> > -    }
> > +    new_addr = pci_config_get_bar_addr(d, reg, type, size);
> >       /* the ROM slot has a specific enable bit */
> >       if (reg == PCI_ROM_SLOT && !(new_addr &
> > PCI_ROM_ADDRESS_ENABLE)) {
> >           return PCI_BAR_UNMAPPED;
> > @@ -1228,6 +1282,7 @@ void pci_default_write_config(PCIDevice *d,
> > uint32_t addr, uint32_t val_in, int
> > 
> >       msi_write_config(d, addr, val_in, l);
> >       msix_write_config(d, addr, val_in, l);
> > +    pcie_sriov_config_write(d, addr, val_in, l);
> >   }
> > 
> >   /***********************************************************/
> > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > index 6e28985..ba49c0f 100644
> > --- a/hw/pci/pcie.c
> > +++ b/hw/pci/pcie.c
> > @@ -253,7 +253,7 @@ void pcie_cap_slot_hotplug_cb(HotplugHandler
> > *hotplug_dev, DeviceState *dev,
> >        * Right now, only a device of function = 0 is allowed to be
> >        * hot plugged/unplugged.
> >        */
> > -    assert(PCI_FUNC(pci_dev->devfn) == 0);
> > +    assert(PCI_FUNC(pci_dev->devfn) == 0 || pci_is_vf(pci_dev));
> > 
> >       pci_word_test_and_set_mask(exp_cap + PCI_EXP_SLTSTA,
> >                                  PCI_EXP_SLTSTA_PDS);
> > @@ -265,10 +265,11 @@ void
> > pcie_cap_slot_hot_unplug_request_cb(HotplugHandler *hotplug_dev,
> >                                            DeviceState *dev, Error
> > **errp)
> >   {
> >       uint8_t *exp_cap;
> > +    PCIDevice *pdev = PCI_DEVICE(hotplug_dev);
> > 
> > -    pcie_cap_slot_hotplug_common(PCI_DEVICE(hotplug_dev), dev,
> > &exp_cap, errp);
> > +    pcie_cap_slot_hotplug_common(pdev, dev, &exp_cap, errp);
> > 
> > -    pcie_cap_slot_push_attention_button(PCI_DEVICE(hotplug_dev));
> > +    pcie_cap_slot_push_attention_button(pdev);
> >   }
> 
> This chunk is not directly liked to the patch, I would put it in a
> different patch.

ok

> >  /* pci express slot for pci express root/downstream port
> > @@ -408,7 +409,7 @@ void pcie_cap_slot_write_config(PCIDevice *dev,
> >       }
> > 
> >       /*
> > -     * If the slot is polulated, power indicator is off and power
> > +     * If the slot is populated, power indicator is off and power
> >        * controller is off, it is safe to detach the devices.
> >        */
> >       if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val &
> > PCI_EXP_SLTCTL_PCC) &&
> 
> 
> Same here. I am always happy to have this kind of types taken care
> of,
> but I think a separate patch would be cleaner.

Would you like it in the patch set as a separate patch or should I
schedule it for a trivial patches set instead, maybe together with the
 PCI_DEVICE thing?

> > diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> > new file mode 100644
> > index 0000000..14f31cc
> > --- /dev/null
> > +++ b/hw/pci/pcie_sriov.c
> > @@ -0,0 +1,271 @@
> > +/*
> > + * pcie_sriov.c:
> > + *
> > + * Implementation of SR/IOV emulation support.
> > + *
> > + * Copyright (c) 2014 Knut Omang <knut.omang@oracle.com>
> 
> 2015 :)

thanks,

> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2
> > or later.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#include "hw/pci/pci.h"
> > +#include "hw/pci/pcie.h"
> > +#include "hw/pci/pci_bus.h"
> > +#include "qemu/range.h"
> > +
> > +//#define DEBUG_SRIOV
> > +#ifdef DEBUG_SRIOV
> > +# define SRIOV_DPRINTF(fmt, ...)                                  
> >        \
> > +    fprintf(stderr, "%s:%d " fmt, __func__, __LINE__, ##
> > __VA_ARGS__)
> > +#else
> > +# define SRIOV_DPRINTF(fmt, ...) do {} while (0)
> > +#endif
> > +#define SRIOV_DEV_PRINTF(dev, fmt, ...)                           
> >        \
> > +    SRIOV_DPRINTF("%s:%x "fmt, (dev)->name, (dev)->devfn, ##
> > __VA_ARGS__)
> > +
> > +static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char
> > *name);
> > +static void unregister_vfs(PCIDevice *dev);
> > +
> > +void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > +                        const char *vfname, uint16_t vf_dev_id,
> > +                        uint16_t init_vfs, uint16_t total_vfs,
> > +                        uint16_t vf_offset, uint16_t vf_stride)
> > +{
> > +    uint8_t *cfg = dev->config + offset;
> > +    uint8_t *wmask;
> > +
> > +    pcie_add_capability(dev, PCI_EXT_CAP_ID_SRIOV, 1,
> > +                        offset, PCI_EXT_CAP_SRIOV_SIZEOF);
> > +    dev->exp.sriov_cap = offset;
> > +    dev->exp.sriov_pf.num_vfs = 0;
> > +    dev->exp.sriov_pf.vfname = g_strdup(vfname);
> > +    dev->exp.sriov_pf.vf = NULL;
> > +
> > +    pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
> > +    pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
> > +
> > +    /* Minimal page size support values required by the SR/IOV
> > standard:
> > +     * 0x553 << 12 = 0x553000 = 4K + 8K + 64K + 256K + 1M + 4M
> > +     * Hard coded for simplicity in this version
> > +     */
> 
> A define (with the above very good explanation) would be preferred.

Will do

> > +    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, 0x553);
> > +
> > +    /* Default is to use 4K pages, software can modify it
> > +     * to any of the supported bits
> > +     */
> > +    pci_set_word(cfg + PCI_SRIOV_SYS_PGSIZE, 0x1);
> > +
> > +    /* Set up device ID and initial/total number of VFs available
> > */
> > +    pci_set_word(cfg + PCI_SRIOV_VF_DID, vf_dev_id);
> > +    pci_set_word(cfg + PCI_SRIOV_INITIAL_VF, init_vfs);
> > +    pci_set_word(cfg + PCI_SRIOV_TOTAL_VF, total_vfs);
> > +    pci_set_word(cfg + PCI_SRIOV_NUM_VF, 0);
> > +
> > +    /* Write enable control bits */
> > +    wmask = dev->wmask + offset;
> > +    pci_set_word(wmask + PCI_SRIOV_CTRL,
> > +                 PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE |
> > PCI_SRIOV_CTRL_ARI);
> > +    pci_set_word(wmask + PCI_SRIOV_NUM_VF, 0xffff);
> > +    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, 0x553);
> > +
> > +    qdev_prop_set_bit(&dev->qdev, "multifunction", true);
> > +}
> > +
> > +void pcie_sriov_pf_exit(PCIDevice *dev)
> > +{
> > +    SRIOV_DPRINTF("\n");
> > +    unregister_vfs(dev);
> > +    g_free((char *)dev->exp.sriov_pf.vfname);
> > +    dev->exp.sriov_pf.vfname = NULL;
> > +}
> > +
> > +void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
> > +                               uint8_t type, dma_addr_t size)
> > +{
> > +    uint32_t addr;
> > +    uint64_t wmask;
> > +    uint16_t sriov_cap = dev->exp.sriov_cap;
> > +
> > +    assert(sriov_cap > 0);
> > +    assert(region_num >= 0);
> > +    assert(region_num < PCI_NUM_REGIONS);
> > +    assert(region_num != PCI_ROM_SLOT);
> > +
> > +    wmask = ~(size - 1);
> > +    addr = sriov_cap + PCI_SRIOV_BAR + region_num * 4;
> > +
> > +    pci_set_long(dev->config + addr, type);
> > +    if (!(type & PCI_BASE_ADDRESS_SPACE_IO) &&
> > +        type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > +        pci_set_quad(dev->wmask + addr, wmask);
> > +        pci_set_quad(dev->cmask + addr, ~0ULL);
> > +    } else {
> > +        pci_set_long(dev->wmask + addr, wmask & 0xffffffff);
> > +        pci_set_long(dev->cmask + addr, 0xffffffff);
> > +    }
> > +    dev->exp.sriov_pf.vf_bar_type[region_num] = type;
> > +}
> > +
> > +void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
> > +                                MemoryRegion *memory)
> > +{
> > +    PCIIORegion *r;
> > +    uint8_t type;
> > +    pcibus_t size = memory_region_size(memory);
> > +
> > +    assert(pci_is_vf(dev)); /* PFs must use pci_register_bar */
> > +    assert(region_num >= 0);
> > +    assert(region_num < PCI_NUM_REGIONS);
> > +    type = dev->exp.sriov_vf.pf
> > ->exp.sriov_pf.vf_bar_type[region_num];
> > +
> > +    if (size & (size-1)) {
> 
> We already have a  is_power_of_2 func defined in include/qemu/host
> -utils.h

will fix,

> > +        fprintf(stderr, "ERROR: PCI region size must be a power"
> > +                " of two - type=0x%x, size=0x%"FMT_PCIBUS"\n",
> > type, size);
> 
> Maybe we should error_report instead of fprintf

Leftover from far too many rebases, will fix,

> > +        exit(1);
> > +    }
> > +
> > +    r = &dev->io_regions[region_num];
> > +    r->memory = memory;
> > +    r->address_space =
> > +        type & PCI_BASE_ADDRESS_SPACE_IO
> > +        ? dev->bus->address_space_io
> > +        : dev->bus->address_space_mem;
> > +    r->size = size;
> > +    r->type = type;
> > +
> > +    r->addr = pci_bar_address(dev, region_num, r->type, r->size);
> > +    if (r->addr != PCI_BAR_UNMAPPED) {
> > +        memory_region_add_subregion_overlap(r->address_space,
> > +                                            r->addr, r->memory,
> > 1);
> > +    }
> > +}
> > +
> > +static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char
> > *name)
> > +{
> > +    PCIDevice *dev = pci_create(pf->bus, devfn, name);
> > +    dev->exp.sriov_vf.pf = pf;
> > +    Error *local_err = NULL;
> > +
> > +    object_property_set_bool(OBJECT(&dev->qdev), true, "realized",
> > &local_err);
> > +    if (local_err) {
> > +        fprintf(stderr, "Failed to enable: %s\n",
> > +                error_get_pretty(local_err));
> > +        error_free(local_err);
> 
> and error_report_err here

ok

> > +        return NULL;
> > +    }
> > +
> > +    /* set vid/did according to sr/iov spec - they are not used */
> > +    pci_config_set_vendor_id(dev->config, 0xffff);
> > +    pci_config_set_device_id(dev->config, 0xffff);
> > +    return dev;
> > +}
> > +
> > +static void register_vfs(PCIDevice *dev)
> > +{
> > +    uint16_t num_vfs;
> > +    uint16_t i;
> > +    uint16_t sriov_cap = dev->exp.sriov_cap;
> > +    uint16_t vf_offset = pci_get_word(dev->config + sriov_cap +
> > PCI_SRIOV_VF_OFFSET);
> > +    uint16_t vf_stride = pci_get_word(dev->config + sriov_cap +
> > PCI_SRIOV_VF_STRIDE);
> > +    int32_t devfn = dev->devfn + vf_offset;
> > +
> > +    assert(sriov_cap > 0);
> > +    num_vfs = pci_get_word(dev->config + sriov_cap +
> > PCI_SRIOV_NUM_VF);
> > +
> > +    dev->exp.sriov_pf.vf = g_malloc(sizeof(PCIDevice *) *
> > num_vfs);
> > +    assert(dev->exp.sriov_pf.vf);
> > +
> > +    SRIOV_DEV_PRINTF(dev, "creating %d vf devs\n", num_vfs);
> > +    for (i = 0; i < num_vfs; i++) {
> > +        dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn, dev
> > ->exp.sriov_pf.vfname);
> > +        if (!dev->exp.sriov_pf.vf[i]) {
> > +            SRIOV_DEV_PRINTF(dev, "Failed to create VF %d at devfn
> > %x.%x\n", i,
> > +                             PCI_SLOT(devfn), PCI_FUNC(devfn));
> > +            num_vfs = i;
> > +            break;
> > +        }
> > +        devfn += vf_stride;
> > +    }
> > +    dev->exp.sriov_pf.num_vfs = num_vfs;
> > +}
> > +
> > +static void unregister_vfs(PCIDevice *dev)
> > +{
> > +    Error *local_err = NULL;
> > +    uint16_t num_vfs = dev->exp.sriov_pf.num_vfs;
> > +    uint16_t i;
> > +    SRIOV_DEV_PRINTF(dev, "Unregistering %d vf devs\n", num_vfs);
> > +    for (i = 0; i < num_vfs; i++) {
> > +        object_property_set_bool(OBJECT(&dev->exp.sriov_pf.vf[i]
> > ->qdev), false, "realized", &local_err);
> > +        if (local_err) {
> > +            fprintf(stderr, "Failed to unplug: %s\n",
> > +                    error_get_pretty(local_err));
> > +            error_free(local_err);
> 
> error_report_err
> 
> > +        }
> > +    }
> > +    g_free(dev->exp.sriov_pf.vf);
> > +    dev->exp.sriov_pf.vf = NULL;
> > +    dev->exp.sriov_pf.num_vfs = 0;
> > +}
> > +
> > +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
> > uint32_t val, int len)
> > +{
> > +    uint32_t off;
> > +    uint16_t sriov_cap = dev->exp.sriov_cap;
> > +
> > +    if (!sriov_cap || address < sriov_cap) {
> > +        return;
> > +    }
> > +    off = address - sriov_cap;
> > +    if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
> > +        return;
> > +    }
> > +
> > +    SRIOV_DEV_PRINTF(dev, "cap at 0x%x sriov offset 0x%x val 0x%x
> > len %d\n",
> > +                 sriov_cap, off, val, len);
> > +
> > +    if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> > +        if (dev->exp.sriov_pf.num_vfs) {
> > +            if (!(val & PCI_SRIOV_CTRL_VFE)) {
> > +                unregister_vfs(dev);
> > +            }
> > +        } else {
> > +            if (val & PCI_SRIOV_CTRL_VFE) {
> > +                register_vfs(dev);
> > +            }
> > +        }
> > +    }
> > +}
> > +
> > +
> > +/* Reset SR/IOV VF Enable bit to trigger an unregister of all VFs
> > */
> > +void pcie_sriov_pf_disable_vfs(PCIDevice *dev)
> > +{
> > +    uint16_t sriov_cap = dev->exp.sriov_cap;
> > +    if (sriov_cap) {
> > +        uint32_t val = pci_get_byte(dev->config + sriov_cap +
> > PCI_SRIOV_CTRL);
> > +        if (val & PCI_SRIOV_CTRL_VFE) {
> > +            val &= ~PCI_SRIOV_CTRL_VFE;
> > +            pcie_sriov_config_write(dev, sriov_cap +
> > PCI_SRIOV_CTRL, val, 1);
> > +        }
> > +    }
> > +}
> > +
> > +/* Add optional supported page sizes to the mask of supported page
> > sizes */
> > +void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t
> > opt_sup_pgsize)
> > +{
> > +    uint8_t *cfg = dev->config + dev->exp.sriov_cap;
> > +    uint8_t *wmask = dev->wmask + dev->exp.sriov_cap;
> > +
> > +    uint16_t sup_pgsize = pci_get_word(cfg +
> > PCI_SRIOV_SUP_PGSIZE);
> > +
> > +    sup_pgsize |= opt_sup_pgsize;
> > +
> > +    /* Make sure the new bits are set, and that system page size
> > +     * also can be set to any of the new values according to spec:
> > +     */
> > +    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, sup_pgsize);
> > +    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, sup_pgsize);
> > +}
> > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> > index 551cb3d..2e9d8ba 100644
> > --- a/include/hw/pci/pci.h
> > +++ b/include/hw/pci/pci.h
> > @@ -11,8 +11,6 @@
> >   /* PCI includes legacy ISA access.  */
> >   #include "hw/isa/isa.h"
> > 
> > -#include "hw/pci/pcie.h"
> > -
> >   /* PCI bus */
> > 
> >   #define PCI_DEVFN(slot, func)   ((((slot) & 0x1f) << 3) | ((func)
> > & 0x07))
> > @@ -132,6 +130,7 @@ enum {
> >   #define QEMU_PCI_VGA_IO_HI_SIZE 0x20
> > 
> >   #include "hw/pci/pci_regs.h"
> > +#include "hw/pci/pcie.h"
> 
> Why is it moved here?

Probably an unintended side effect of rebasing..

> >   /* PCI HEADER_TYPE */
> >   #define  PCI_HEADER_TYPE_MULTI_FUNCTION 0x80
> > @@ -421,6 +420,9 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *,
> > void *, int);
> >   AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
> >   void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
> > 
> > +pcibus_t pci_bar_address(PCIDevice *d,
> > +                         int reg, uint8_t type, pcibus_t size);
> > +
> >   static inline void
> >   pci_set_byte(uint8_t *config, uint8_t val)
> >   {
> > @@ -672,6 +674,11 @@ static inline int pci_is_express(const
> > PCIDevice *d)
> >       return d->cap_present & QEMU_PCI_CAP_EXPRESS;
> >   }
> > 
> > +static inline int pci_is_vf(const PCIDevice *d)
> > +{
> > +    return d->exp.sriov_vf.pf != NULL;
> > +}
> > +
> >   static inline uint32_t pci_config_size(const PCIDevice *d)
> >   {
> >       return pci_is_express(d) ? PCIE_CONFIG_SPACE_SIZE :
> > PCI_CONFIG_SPACE_SIZE;
> > diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> > index b48a7a2..b09f79a 100644
> > --- a/include/hw/pci/pcie.h
> > +++ b/include/hw/pci/pcie.h
> > @@ -25,6 +25,7 @@
> >   #include "hw/pci/pci_regs.h"
> >   #include "hw/pci/pcie_regs.h"
> >   #include "hw/pci/pcie_aer.h"
> > +#include "hw/pci/pcie_sriov.h"
> >   #include "hw/hotplug.h"
> > 
> >   typedef enum {
> > @@ -74,6 +75,11 @@ struct PCIExpressDevice {
> >       /* AER */
> >       uint16_t aer_cap;
> >       PCIEAERLog aer_log;
> > +
> > +    /* SR/IOV */
> > +    uint16_t sriov_cap;
> > +    PCIESriovPF sriov_pf;
> > +    PCIESriovVF sriov_vf;
> >   };
> > 
> >   #define COMPAT_PROP_PCP "power_controller_present"
> > diff --git a/include/hw/pci/pcie_sriov.h
> > b/include/hw/pci/pcie_sriov.h
> > new file mode 100644
> > index 0000000..061f5cf
> > --- /dev/null
> > +++ b/include/hw/pci/pcie_sriov.h
> > @@ -0,0 +1,55 @@
> > +/*
> > + * pcie_sriov.h:
> > + *
> > + * Implementation of SR/IOV emulation support.
> > + *
> > + * Copyright (c) 2014 Knut Omang <knut.omang@oracle.com>
> 
> 2015
> 
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2
> > or later.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#ifndef QEMU_PCIE_SRIOV_H
> > +#define QEMU_PCIE_SRIOV_H
> > +
> > +
> > +
> > +struct PCIESriovPF {
> > +    uint16_t num_vfs;           /* Number of virtual functions
> > created */
> > +    uint8_t vf_bar_type[PCI_NUM_REGIONS];  /* Store type for each
> > VF bar */
> > +    const char *vfname;         /* Reference to the device type
> > used for the VFs */
> > +    PCIDevice **vf;             /* Pointer to an array of num_vfs
> > VF devices */
> > +};
> > +
> > +struct PCIESriovVF {
> > +    PCIDevice *pf;              /* Pointer back to owner physical
> > function */
> > +};
> > +
> > +void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > +                        const char *vfname, uint16_t vf_dev_id,
> > +                        uint16_t init_vfs, uint16_t total_vfs,
> > +                        uint16_t vf_offset, uint16_t vf_stride);
> > +void pcie_sriov_pf_exit(PCIDevice *dev);
> > +
> > +/* Set up a VF bar in the SR/IOV bar area */
> > +void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
> > +                               uint8_t type, dma_addr_t size);
> > +
> > +/* Instantiate a bar for a VF */
> > +void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
> > +                                MemoryRegion *memory);
> > +
> > +/* Add optional supported page sizes to the mask of supported page
> > sizes
> > + * Page size values are interpreted as opt_sup_pgsize << 12.
> > + */
> > +void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t
> > opt_sup_pgsize);
> > +
> > +/* SR/IOV capability config write handler */
> > +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
> > +                             uint32_t val, int len);
> > +
> > +/* Reset SR/IOV VF Enable bit to unregister all VFs */
> > +void pcie_sriov_pf_disable_vfs(PCIDevice *dev);
> > +
> > +#endif /* QEMU_PCIE_SRIOV_H */
> > diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> > index f8a9dd6..fa36b8c 100644
> > --- a/include/qemu/typedefs.h
> > +++ b/include/qemu/typedefs.h
> > @@ -53,6 +53,8 @@ typedef struct PCIDevice PCIDevice;
> >   typedef struct PCIEAERErr PCIEAERErr;
> >   typedef struct PCIEAERLog PCIEAERLog;
> >   typedef struct PCIEAERMsg PCIEAERMsg;
> > +typedef struct PCIESriovPF PCIESriovPF;
> > +typedef struct PCIESriovVF PCIESriovVF;
> >   typedef struct PCIEPort PCIEPort;
> >   typedef struct PCIESlot PCIESlot;
> >   typedef struct PCIExpressDevice PCIExpressDevice;
> > 
> 
> 
> I hope I helped,

Definitely, 
thanks for the review!

Knut

> Thanks,
> Marcel
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/3] pci: Update pci_regs header
  2015-09-12 12:36 ` [Qemu-devel] [PATCH v4 2/3] pci: Update pci_regs header Knut Omang
@ 2015-10-07 13:32   ` Marcel Apfelbaum
  2015-10-10 10:46     ` Knut Omang
  0 siblings, 1 reply; 16+ messages in thread
From: Marcel Apfelbaum @ 2015-10-07 13:32 UTC (permalink / raw)
  To: Knut Omang, qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard W.M. Jones,
	Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

On 09/12/2015 03:36 PM, Knut Omang wrote:
> Merge in new definitions from kernel v4.2
> Adds definition necessary to support emulated SR/IOV.

In the meantime Paolo updated this header to 4.3-RC1, is this patch still needed?

Thanks,
Marcel

>
> Signed-off-by: Knut Omang <knut.omang@oracle.com>
> ---
>   include/standard-headers/linux/pci_regs.h | 142 +++++++++++++++++++++++++-----
>   1 file changed, 118 insertions(+), 24 deletions(-)
>
> diff --git a/include/standard-headers/linux/pci_regs.h b/include/standard-headers/linux/pci_regs.h
> index 57e8c80..3d56c1e 100644
> --- a/include/standard-headers/linux/pci_regs.h
> +++ b/include/standard-headers/linux/pci_regs.h
> @@ -304,13 +304,14 @@
>   #define PCI_MSI_PENDING_64	20	/* Pending bits register for 32-bit devices */
>
>   /* MSI-X registers */
> -#define PCI_MSIX_FLAGS		2
> -#define  PCI_MSIX_FLAGS_QSIZE	0x7FF
> -#define  PCI_MSIX_FLAGS_ENABLE	(1 << 15)
> -#define  PCI_MSIX_FLAGS_MASKALL	(1 << 14)
> -#define PCI_MSIX_TABLE		4
> -#define PCI_MSIX_PBA		8
> +#define PCI_MSIX_FLAGS		2	/* Message Control */
> +#define  PCI_MSIX_FLAGS_QSIZE	0x7FF	/* Table size */
> +#define  PCI_MSIX_FLAGS_ENABLE	(1 << 15) /* MSI-X enable */
> +#define  PCI_MSIX_FLAGS_MASKALL	(1 << 14) /* Mask all vectors for this function */
> +#define PCI_MSIX_TABLE		4	/* Table offset */
> +#define PCI_MSIX_PBA		8	/* Pending Bit Array offset */
>   #define  PCI_MSIX_FLAGS_BIRMASK	(7 << 0)
> +#define PCI_CAP_MSIX_SIZEOF	12	/* size of MSIX capability structure */
>
>   /* MSI-X entry's format */
>   #define PCI_MSIX_ENTRY_SIZE		16
> @@ -517,31 +518,63 @@
>   #define  PCI_EXP_OBFF_MSG	0x40000 /* New message signaling */
>   #define  PCI_EXP_OBFF_WAKE	0x80000 /* Re-use WAKE# for OBFF */
>   #define PCI_EXP_DEVCTL2		40	/* Device Control 2 */
> -#define  PCI_EXP_DEVCTL2_ARI	0x20	/* Alternative Routing-ID */
> -#define  PCI_EXP_IDO_REQ_EN	0x100	/* ID-based ordering request enable */
> -#define  PCI_EXP_IDO_CMP_EN	0x200	/* ID-based ordering completion enable */
> -#define  PCI_EXP_LTR_EN		0x400	/* Latency tolerance reporting */
> -#define  PCI_EXP_OBFF_MSGA_EN	0x2000	/* OBFF enable with Message type A */
> -#define  PCI_EXP_OBFF_MSGB_EN	0x4000	/* OBFF enable with Message type B */
> -#define  PCI_EXP_OBFF_WAKE_EN	0x6000	/* OBFF using WAKE# signaling */
> +#define  PCI_EXP_DEVCTL2_COMP_TIMEOUT	0x000f	/* Completion Timeout Value */
> +#define  PCI_EXP_DEVCTL2_ARI		0x0020	/* Alternative Routing-ID */
> +#define  PCI_EXP_DEVCTL2_IDO_REQ_EN	0x0100	/* Allow IDO for requests */
> +#define  PCI_EXP_DEVCTL2_IDO_CMP_EN	0x0200	/* Allow IDO for completions */
> +#define  PCI_EXP_DEVCTL2_LTR_EN		0x0400	/* Enable LTR mechanism */
> +#define  PCI_EXP_DEVCTL2_OBFF_MSGA_EN	0x2000	/* Enable OBFF Message type A */
> +#define  PCI_EXP_DEVCTL2_OBFF_MSGB_EN	0x4000	/* Enable OBFF Message type B */
> +#define  PCI_EXP_DEVCTL2_OBFF_WAKE_EN	0x6000	/* OBFF using WAKE# signaling */
> +#define PCI_EXP_DEVSTA2		42	/* Device Status 2 */
> +#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2	44	/* v2 endpoints end here */
> +#define PCI_EXP_LNKCAP2		44	/* Link Capabilities 2 */
> +#define  PCI_EXP_LNKCAP2_SLS_2_5GB	0x00000002 /* Supported Speed 2.5GT/s */
> +#define  PCI_EXP_LNKCAP2_SLS_5_0GB	0x00000004 /* Supported Speed 5.0GT/s */
> +#define  PCI_EXP_LNKCAP2_SLS_8_0GB	0x00000008 /* Supported Speed 8.0GT/s */
> +#define  PCI_EXP_LNKCAP2_CROSSLINK	0x00000100 /* Crosslink supported */
>   #define PCI_EXP_LNKCTL2		48	/* Link Control 2 */
> -#define PCI_EXP_SLTCTL2		56	/* Slot Control 2 */
> +#define PCI_EXP_LNKSTA2		50	/* Link Status 2 */
> +#define PCI_EXP_SLTCAP2		52	/* Slot Capabilities 2 */
> +#define PCI_EXP_SLTCTL2 	56	/* Slot Control 2 */
> +#define PCI_EXP_SLTSTA2		58	/* Slot Status 2 */
>
>   /* Extended Capabilities (PCI-X 2.0 and Express) */
>   #define PCI_EXT_CAP_ID(header)		(header & 0x0000ffff)
>   #define PCI_EXT_CAP_VER(header)		((header >> 16) & 0xf)
>   #define PCI_EXT_CAP_NEXT(header)	((header >> 20) & 0xffc)
>
> -#define PCI_EXT_CAP_ID_ERR	1
> -#define PCI_EXT_CAP_ID_VC	2
> -#define PCI_EXT_CAP_ID_DSN	3
> -#define PCI_EXT_CAP_ID_PWR	4
> -#define PCI_EXT_CAP_ID_VNDR	11
> -#define PCI_EXT_CAP_ID_ACS	13
> -#define PCI_EXT_CAP_ID_ARI	14
> -#define PCI_EXT_CAP_ID_ATS	15
> -#define PCI_EXT_CAP_ID_SRIOV	16
> -#define PCI_EXT_CAP_ID_LTR	24
> +#define PCI_EXT_CAP_ID_ERR	0x01	/* Advanced Error Reporting */
> +#define PCI_EXT_CAP_ID_VC	0x02	/* Virtual Channel Capability */
> +#define PCI_EXT_CAP_ID_DSN	0x03	/* Device Serial Number */
> +#define PCI_EXT_CAP_ID_PWR	0x04	/* Power Budgeting */
> +#define PCI_EXT_CAP_ID_RCLD	0x05	/* Root Complex Link Declaration */
> +#define PCI_EXT_CAP_ID_RCILC	0x06	/* Root Complex Internal Link Control */
> +#define PCI_EXT_CAP_ID_RCEC	0x07	/* Root Complex Event Collector */
> +#define PCI_EXT_CAP_ID_MFVC	0x08	/* Multi-Function VC Capability */
> +#define PCI_EXT_CAP_ID_VC9	0x09	/* same as _VC */
> +#define PCI_EXT_CAP_ID_RCRB	0x0A	/* Root Complex RB? */
> +#define PCI_EXT_CAP_ID_VNDR	0x0B	/* Vendor-Specific */
> +#define PCI_EXT_CAP_ID_CAC	0x0C	/* Config Access - obsolete */
> +#define PCI_EXT_CAP_ID_ACS	0x0D	/* Access Control Services */
> +#define PCI_EXT_CAP_ID_ARI	0x0E	/* Alternate Routing ID */
> +#define PCI_EXT_CAP_ID_ATS	0x0F	/* Address Translation Services */
> +#define PCI_EXT_CAP_ID_SRIOV	0x10	/* Single Root I/O Virtualization */
> +#define PCI_EXT_CAP_ID_MRIOV	0x11	/* Multi Root I/O Virtualization */
> +#define PCI_EXT_CAP_ID_MCAST	0x12	/* Multicast */
> +#define PCI_EXT_CAP_ID_PRI	0x13	/* Page Request Interface */
> +#define PCI_EXT_CAP_ID_AMD_XXX	0x14	/* Reserved for AMD */
> +#define PCI_EXT_CAP_ID_REBAR	0x15	/* Resizable BAR */
> +#define PCI_EXT_CAP_ID_DPA	0x16	/* Dynamic Power Allocation */
> +#define PCI_EXT_CAP_ID_TPH	0x17	/* TPH Requester */
> +#define PCI_EXT_CAP_ID_LTR	0x18	/* Latency Tolerance Reporting */
> +#define PCI_EXT_CAP_ID_SECPCI	0x19	/* Secondary PCIe Capability */
> +#define PCI_EXT_CAP_ID_PMUX	0x1A	/* Protocol Multiplexing */
> +#define PCI_EXT_CAP_ID_PASID	0x1B	/* Process Address Space ID */
> +#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PASID
> +
> +#define PCI_EXT_CAP_DSN_SIZEOF	12
> +#define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
>
>   /* Advanced Error Reporting */
>   #define PCI_ERR_UNCOR_STATUS	4	/* Uncorrectable Error Status */
> @@ -556,6 +589,11 @@
>   #define  PCI_ERR_UNC_MALF_TLP	0x00040000	/* Malformed TLP */
>   #define  PCI_ERR_UNC_ECRC	0x00080000	/* ECRC Error Status */
>   #define  PCI_ERR_UNC_UNSUP	0x00100000	/* Unsupported Request */
> +#define  PCI_ERR_UNC_ACSV	0x00200000	/* ACS Violation */
> +#define  PCI_ERR_UNC_INTN	0x00400000	/* internal error */
> +#define  PCI_ERR_UNC_MCBTLP	0x00800000	/* MC blocked TLP */
> +#define  PCI_ERR_UNC_ATOMEG	0x01000000	/* Atomic egress blocked */
> +#define  PCI_ERR_UNC_TLPPRE	0x02000000	/* TLP prefix blocked */
>   #define PCI_ERR_UNCOR_MASK	8	/* Uncorrectable Error Mask */
>   	/* Same bits as above */
>   #define PCI_ERR_UNCOR_SEVER	12	/* Uncorrectable Error Severity */
> @@ -657,6 +695,7 @@
>   #define  PCI_ARI_CTRL_MFVC	0x0001	/* MFVC Function Groups Enable */
>   #define  PCI_ARI_CTRL_ACS	0x0002	/* ACS Function Groups Enable */
>   #define  PCI_ARI_CTRL_FG(x)	(((x) >> 4) & 7) /* Function Group */
> +#define PCI_EXT_CAP_ARI_SIZEOF	8
>
>   /* Address Translation Service */
>   #define PCI_ATS_CAP		0x04	/* ATS Capability Register */
> @@ -666,6 +705,29 @@
>   #define  PCI_ATS_CTRL_ENABLE	0x8000	/* ATS Enable */
>   #define  PCI_ATS_CTRL_STU(x)	((x) & 0x1f)	/* Smallest Translation Unit */
>   #define  PCI_ATS_MIN_STU	12	/* shift of minimum STU block */
> +#define PCI_EXT_CAP_ATS_SIZEOF	8
> +
> +/* Page Request Interface */
> +#define PCI_PRI_CTRL		0x04	/* PRI control register */
> +#define  PCI_PRI_CTRL_ENABLE	0x01	/* Enable */
> +#define  PCI_PRI_CTRL_RESET	0x02	/* Reset */
> +#define PCI_PRI_STATUS		0x06	/* PRI status register */
> +#define  PCI_PRI_STATUS_RF	0x001	/* Response Failure */
> +#define  PCI_PRI_STATUS_UPRGI	0x002	/* Unexpected PRG index */
> +#define  PCI_PRI_STATUS_STOPPED	0x100	/* PRI Stopped */
> +#define PCI_PRI_MAX_REQ		0x08	/* PRI max reqs supported */
> +#define PCI_PRI_ALLOC_REQ	0x0c	/* PRI max reqs allowed */
> +#define PCI_EXT_CAP_PRI_SIZEOF	16
> +
> +/* Process Address Space ID */
> +#define PCI_PASID_CAP		0x04    /* PASID feature register */
> +#define  PCI_PASID_CAP_EXEC	0x02	/* Exec permissions Supported */
> +#define  PCI_PASID_CAP_PRIV	0x04	/* Privilege Mode Supported */
> +#define PCI_PASID_CTRL		0x06    /* PASID control register */
> +#define  PCI_PASID_CTRL_ENABLE	0x01	/* Enable bit */
> +#define  PCI_PASID_CTRL_EXEC	0x02	/* Exec permissions Enable */
> +#define  PCI_PASID_CTRL_PRIV	0x04	/* Privilege Mode Enable */
> +#define PCI_EXT_CAP_PASID_SIZEOF	8
>
>   /* Single Root I/O Virtualization */
>   #define PCI_SRIOV_CAP		0x04	/* SR-IOV Capabilities */
> @@ -697,6 +759,7 @@
>   #define  PCI_SRIOV_VFM_MI	0x1	/* Dormant.MigrateIn */
>   #define  PCI_SRIOV_VFM_MO	0x2	/* Active.MigrateOut */
>   #define  PCI_SRIOV_VFM_AV	0x3	/* Active.Available */
> +#define PCI_EXT_CAP_SRIOV_SIZEOF 64
>
>   #define PCI_LTR_MAX_SNOOP_LAT	0x4
>   #define PCI_LTR_MAX_NOSNOOP_LAT	0x6
> @@ -713,7 +776,38 @@
>   #define  PCI_ACS_UF		0x10	/* Upstream Forwarding */
>   #define  PCI_ACS_EC		0x20	/* P2P Egress Control */
>   #define  PCI_ACS_DT		0x40	/* Direct Translated P2P */
> +#define PCI_ACS_EGRESS_BITS	0x05	/* ACS Egress Control Vector Size */
>   #define PCI_ACS_CTRL		0x06	/* ACS Control Register */
>   #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress Control Vector */
>
> +#define PCI_VSEC_HDR		4	/* extended cap - vendor-specific */
> +#define  PCI_VSEC_HDR_LEN_SHIFT	20	/* shift for length field */
> +
> +/* SATA capability */
> +#define PCI_SATA_REGS		4	/* SATA REGs specifier */
> +#define  PCI_SATA_REGS_MASK	0xF	/* location - BAR#/inline */
> +#define  PCI_SATA_REGS_INLINE	0xF	/* REGS in config space */
> +#define PCI_SATA_SIZEOF_SHORT	8
> +#define PCI_SATA_SIZEOF_LONG	16
> +
> +/* Resizable BARs */
> +#define PCI_REBAR_CTRL		8	/* control register */
> +#define  PCI_REBAR_CTRL_NBAR_MASK	(7 << 5)	/* mask for # bars */
> +#define  PCI_REBAR_CTRL_NBAR_SHIFT	5	/* shift for # bars */
> +
> +/* Dynamic Power Allocation */
> +#define PCI_DPA_CAP		4	/* capability register */
> +#define  PCI_DPA_CAP_SUBSTATE_MASK	0x1F	/* # substates - 1 */
> +#define PCI_DPA_BASE_SIZEOF	16	/* size with 0 substates */
> +
> +/* TPH Requester */
> +#define PCI_TPH_CAP		4	/* capability register */
> +#define  PCI_TPH_CAP_LOC_MASK	0x600	/* location mask */
> +#define   PCI_TPH_LOC_NONE	0x000	/* no location */
> +#define   PCI_TPH_LOC_CAP	0x200	/* in capability */
> +#define   PCI_TPH_LOC_MSIX	0x400	/* in MSI-X */
> +#define PCI_TPH_CAP_ST_MASK	0x07FF0000	/* st table mask */
> +#define PCI_TPH_CAP_ST_SHIFT	16	/* st table shift */
> +#define PCI_TPH_BASE_SIZEOF	12	/* size with no st table */
> +
>   #endif /* LINUX_PCI_REGS_H */
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/3] pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  2015-09-17 14:10     ` Knut Omang
@ 2015-10-07 13:45       ` Marcel Apfelbaum
  2015-10-14 12:25       ` Knut Omang
  1 sibling, 0 replies; 16+ messages in thread
From: Marcel Apfelbaum @ 2015-10-07 13:45 UTC (permalink / raw)
  To: Knut Omang, marcel, qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard W.M. Jones,
	Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

On 09/17/2015 05:10 PM, Knut Omang wrote:
> On Thu, 2015-09-17 at 14:49 +0300, Marcel Apfelbaum wrote:
>> On 09/12/2015 03:36 PM, Knut Omang wrote:
>>> This patch provides the building blocks for creating an SR/IOV
>>> PCIe Extended Capability header and register/unregister
>>> SR/IOV Virtual Functions.
>>>
>>> Signed-off-by: Knut Omang <knut.omang@oracle.com>
>>> ---
>>>    hw/pci/Makefile.objs        |   2 +-
>>>    hw/pci/pci.c                |  99 ++++++++++++----
>>>    hw/pci/pcie.c               |   9 +-
>>>    hw/pci/pcie_sriov.c         | 271
>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>    include/hw/pci/pci.h        |  11 +-
>>>    include/hw/pci/pcie.h       |   6 +
>>>    include/hw/pci/pcie_sriov.h |  55 +++++++++
>>>    include/qemu/typedefs.h     |   2 +
>>>    8 files changed, 426 insertions(+), 29 deletions(-)
>>>    create mode 100644 hw/pci/pcie_sriov.c
>>>    create mode 100644 include/hw/pci/pcie_sriov.h
>>>
>>> diff --git a/hw/pci/Makefile.objs b/hw/pci/Makefile.objs
>>> index 9f905e6..2226980 100644
>>> --- a/hw/pci/Makefile.objs
>>> +++ b/hw/pci/Makefile.objs
>>> @@ -3,7 +3,7 @@ common-obj-$(CONFIG_PCI) += msix.o msi.o
>>>    common-obj-$(CONFIG_PCI) += shpc.o
>>>    common-obj-$(CONFIG_PCI) += slotid_cap.o
>>>    common-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o
>>> -common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o
>>> +common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o
>>> pcie_sriov.o
>>>
>>>    common-obj-$(call lnot,$(CONFIG_PCI)) += pci-stub.o
>>>    common-obj-$(CONFIG_ALL) += pci-stub.o
>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>> index a5cc015..9c0eba1 100644
>>> --- a/hw/pci/pci.c
>>> +++ b/hw/pci/pci.c
>>> @@ -153,6 +153,9 @@ int pci_bar(PCIDevice *d, int reg)
>>>    {
>>>        uint8_t type;
>>>
>>> +    /* PCIe virtual functions do not have their own BARs */
>>> +    assert(!pci_is_vf(d));
>>> +
>>>        if (reg != PCI_ROM_SLOT)
>>>            return PCI_BASE_ADDRESS_0 + reg * 4;
>>>
>>> @@ -211,22 +214,13 @@ void pci_device_deassert_intx(PCIDevice *dev)
>>>        }
>>>    }
>>>
>>> -static void pci_do_device_reset(PCIDevice *dev)
>>> +static void pci_reset_regions(PCIDevice *dev)
>>>    {
>>>        int r;
>>> +    if (pci_is_vf(dev)) {
>>> +        return;
>>> +    }
>>>
>>> -    pci_device_deassert_intx(dev);
>>> -    assert(dev->irq_state == 0);
>>> -
>>> -    /* Clear all writable bits */
>>> -    pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
>>> -                                 pci_get_word(dev->wmask +
>>> PCI_COMMAND) |
>>> -                                 pci_get_word(dev->w1cmask +
>>> PCI_COMMAND));
>>> -    pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
>>> -                                 pci_get_word(dev->wmask +
>>> PCI_STATUS) |
>>> -                                 pci_get_word(dev->w1cmask +
>>> PCI_STATUS));
>>> -    dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
>>> -    dev->config[PCI_INTERRUPT_LINE] = 0x0;
>>>        for (r = 0; r < PCI_NUM_REGIONS; ++r) {
>>>            PCIIORegion *region = &dev->io_regions[r];
>>>            if (!region->size) {
>>> @@ -240,6 +234,27 @@ static void pci_do_device_reset(PCIDevice
>>> *dev)
>>>                pci_set_long(dev->config + pci_bar(dev, r), region
>>> ->type);
>>>            }
>>>        }
>>> +}
>>> +
>>> +static void pci_do_device_reset(PCIDevice *dev)
>>> +{
>>> +    qdev_reset_all(&dev->qdev);
>>
>> Hi,
>> Thank you for resubmitting this series!
>>
>> This is only a quick look, I hope I'll have more time next week to go
>> over it again.
>>
>>
>>> +
>>> +    dev->irq_state = 0;
>>
>> Are you sure we need the assignment above? It seems that the
>> irq_state
>> should be modified only by the intx wrappers as
>>   pci_device_deassert_intx and so.
>>
>>
>>> +    pci_update_irq_status(dev);
>>
>> Why do we have to update the irq status on reset?
>
> I struggled a lot with interrupts and PF disapperance in earlier
> versions, this may be unintended leftovers from rebase.
>
> The intention was to avoid the qdev removal which caused problems with
> unintended "hot unplug of the PF" at VF removal earlier, but after some
> of the more recent QOM'ification all those problems disappeared.
>
> I tried without these lines and the do_device_reset refactor and it
> seems to work just fine, will fix that in v5, thanks!
>
>>> +    pci_device_deassert_intx(dev);
>>> +    assert(dev->irq_state == 0);
>>> +
>>> +    /* Clear all writable bits */
>>> +    pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
>>> +                                 pci_get_word(dev->wmask +
>>> PCI_COMMAND) |
>>> +                                 pci_get_word(dev->w1cmask +
>>> PCI_COMMAND));
>>> +    pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
>>> +                                 pci_get_word(dev->wmask +
>>> PCI_STATUS) |
>>> +                                 pci_get_word(dev->w1cmask +
>>> PCI_STATUS));
>>> +    dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
>>> +    dev->config[PCI_INTERRUPT_LINE] = 0x0;
>>> +    pci_reset_regions(dev);
>>>        pci_update_mappings(dev);
>>>
>>>        msi_reset(dev);
>>> @@ -771,6 +786,15 @@ static void pci_init_multifunction(PCIBus
>>> *bus, PCIDevice *dev, Error **errp)
>>>            dev->config[PCI_HEADER_TYPE] |=
>>> PCI_HEADER_TYPE_MULTI_FUNCTION;
>>>        }
>>>
>>> +    /* With SR/IOV and ARI, a device at function 0 need not be a
>>> multifunction
>>> +     * device, as it may just be a VF that ended up with function
>>> 0 in
>>> +     * the legacy PCI interpretation. Avoid failing in such cases:
>>> +     */
>>> +    if (pci_is_vf(dev) &&
>>> +        dev->exp.sriov_vf.pf->cap_present &
>>> QEMU_PCI_CAP_MULTIFUNCTION) {
>>> +        return;
>>> +    }
>>> +
>>>        /*
>>>         * multifunction bit is interpreted in two ways as follows.
>>>         *   - all functions must set the bit to 1.
>>> @@ -962,6 +986,7 @@ void pci_register_bar(PCIDevice *pci_dev, int
>>> region_num,
>>>        uint64_t wmask;
>>>        pcibus_t size = memory_region_size(memory);
>>>
>>> +    assert(!pci_is_vf(pci_dev)); /* VFs must use
>>> pcie_sriov_vf_register_bar */
>>>        assert(region_num >= 0);
>>>        assert(region_num < PCI_NUM_REGIONS);
>>>        if (size & (size-1)) {
>>> @@ -1060,11 +1085,44 @@ pcibus_t pci_get_bar_addr(PCIDevice
>>> *pci_dev, int region_num)
>>>        return pci_dev->io_regions[region_num].addr;
>>>    }
>>>
>>> -static pcibus_t pci_bar_address(PCIDevice *d,
>>> -				int reg, uint8_t type, pcibus_t
>>> size)
>>> +
>>> +static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
>>> +                                        uint8_t type, pcibus_t
>>> size)
>>> +{
>>> +    pcibus_t new_addr;
>>> +    if (!pci_is_vf(d)) {
>>> +        int bar = pci_bar(d, reg);
>>> +        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
>>> +            new_addr = pci_get_quad(d->config + bar);
>>> +        } else {
>>> +            new_addr = pci_get_long(d->config + bar);
>>> +        }
>>> +    } else {
>>> +        PCIDevice *pf = d->exp.sriov_vf.pf;
>>> +        uint16_t sriov_cap = pf->exp.sriov_cap;
>>> +        int bar = sriov_cap + PCI_SRIOV_BAR + reg * 4;
>>> +        uint16_t vf_offset = pci_get_word(pf->config + sriov_cap +
>>> PCI_SRIOV_VF_OFFSET);
>>> +        uint16_t vf_stride = pci_get_word(pf->config + sriov_cap +
>>> PCI_SRIOV_VF_STRIDE);
>>> +        uint32_t vf_num = (d->devfn - (pf->devfn + vf_offset)) /
>>> vf_stride;
>>> +
>>> +        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
>>> +            new_addr = pci_get_quad(pf->config + bar);
>>> +        } else {
>>> +            new_addr = pci_get_long(pf->config + bar);
>>> +        }
>>> +        new_addr += vf_num * size;
>>> +    }
>>> +    if (reg != PCI_ROM_SLOT) {
>>> +        /* Preserve the rom enable bit */
>>> +        new_addr &= ~(size - 1);
>>> +    }
>>> +    return new_addr;
>>> +}
>>> +
>>> +pcibus_t pci_bar_address(PCIDevice *d,
>>> +                         int reg, uint8_t type, pcibus_t size)
>>>    {
>>>        pcibus_t new_addr, last_addr;
>>> -    int bar = pci_bar(d, reg);
>>>        uint16_t cmd = pci_get_word(d->config + PCI_COMMAND);
>>>        Object *machine = qdev_get_machine();
>>>        ObjectClass *oc = object_get_class(machine);
>>> @@ -1075,7 +1133,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
>>>            if (!(cmd & PCI_COMMAND_IO)) {
>>>                return PCI_BAR_UNMAPPED;
>>>            }
>>> -        new_addr = pci_get_long(d->config + bar) & ~(size - 1);
>>> +        new_addr = pci_config_get_bar_addr(d, reg, type, size);
>>>            last_addr = new_addr + size - 1;
>>>            /* Check if 32 bit BAR wraps around explicitly.
>>>             * TODO: make priorities correct and remove this work
>>> around.
>>> @@ -1090,11 +1148,7 @@ static pcibus_t pci_bar_address(PCIDevice
>>> *d,
>>>        if (!(cmd & PCI_COMMAND_MEMORY)) {
>>>            return PCI_BAR_UNMAPPED;
>>>        }
>>> -    if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
>>> -        new_addr = pci_get_quad(d->config + bar);
>>> -    } else {
>>> -        new_addr = pci_get_long(d->config + bar);
>>> -    }
>>> +    new_addr = pci_config_get_bar_addr(d, reg, type, size);
>>>        /* the ROM slot has a specific enable bit */
>>>        if (reg == PCI_ROM_SLOT && !(new_addr &
>>> PCI_ROM_ADDRESS_ENABLE)) {
>>>            return PCI_BAR_UNMAPPED;
>>> @@ -1228,6 +1282,7 @@ void pci_default_write_config(PCIDevice *d,
>>> uint32_t addr, uint32_t val_in, int
>>>
>>>        msi_write_config(d, addr, val_in, l);
>>>        msix_write_config(d, addr, val_in, l);
>>> +    pcie_sriov_config_write(d, addr, val_in, l);
>>>    }
>>>
>>>    /***********************************************************/
>>> diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
>>> index 6e28985..ba49c0f 100644
>>> --- a/hw/pci/pcie.c
>>> +++ b/hw/pci/pcie.c
>>> @@ -253,7 +253,7 @@ void pcie_cap_slot_hotplug_cb(HotplugHandler
>>> *hotplug_dev, DeviceState *dev,
>>>         * Right now, only a device of function = 0 is allowed to be
>>>         * hot plugged/unplugged.
>>>         */
>>> -    assert(PCI_FUNC(pci_dev->devfn) == 0);
>>> +    assert(PCI_FUNC(pci_dev->devfn) == 0 || pci_is_vf(pci_dev));
>>>
>>>        pci_word_test_and_set_mask(exp_cap + PCI_EXP_SLTSTA,
>>>                                   PCI_EXP_SLTSTA_PDS);
>>> @@ -265,10 +265,11 @@ void
>>> pcie_cap_slot_hot_unplug_request_cb(HotplugHandler *hotplug_dev,
>>>                                             DeviceState *dev, Error
>>> **errp)
>>>    {
>>>        uint8_t *exp_cap;
>>> +    PCIDevice *pdev = PCI_DEVICE(hotplug_dev);
>>>
>>> -    pcie_cap_slot_hotplug_common(PCI_DEVICE(hotplug_dev), dev,
>>> &exp_cap, errp);
>>> +    pcie_cap_slot_hotplug_common(pdev, dev, &exp_cap, errp);
>>>
>>> -    pcie_cap_slot_push_attention_button(PCI_DEVICE(hotplug_dev));
>>> +    pcie_cap_slot_push_attention_button(pdev);
>>>    }
>>
>> This chunk is not directly liked to the patch, I would put it in a
>> different patch.
>
> ok
>
>>>   /* pci express slot for pci express root/downstream port
>>> @@ -408,7 +409,7 @@ void pcie_cap_slot_write_config(PCIDevice *dev,
>>>        }
>>>
>>>        /*
>>> -     * If the slot is polulated, power indicator is off and power
>>> +     * If the slot is populated, power indicator is off and power
>>>         * controller is off, it is safe to detach the devices.
>>>         */
>>>        if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val &
>>> PCI_EXP_SLTCTL_PCC) &&
>>
>>
>> Same here. I am always happy to have this kind of types taken care
>> of,
>> but I think a separate patch would be cleaner.
>
> Would you like it in the patch set as a separate patch or should I
> schedule it for a trivial patches set instead, maybe together with the
>   PCI_DEVICE thing?

Hi,

I think in the same patch set  would be just fine.

Thanks,
Marcel

>
[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/3] pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  2015-09-12 12:36 ` [Qemu-devel] [PATCH v4 3/3] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Knut Omang
  2015-09-17 11:49   ` Marcel Apfelbaum
@ 2015-10-07 15:06   ` Marcel Apfelbaum
  2015-10-11 13:56     ` Knut Omang
  1 sibling, 1 reply; 16+ messages in thread
From: Marcel Apfelbaum @ 2015-10-07 15:06 UTC (permalink / raw)
  To: Knut Omang, qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard W.M. Jones,
	Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

On 09/12/2015 03:36 PM, Knut Omang wrote:
> This patch provides the building blocks for creating an SR/IOV
> PCIe Extended Capability header and register/unregister
> SR/IOV Virtual Functions.
>
> Signed-off-by: Knut Omang <knut.omang@oracle.com>
> ---
>   hw/pci/Makefile.objs        |   2 +-
>   hw/pci/pci.c                |  99 ++++++++++++----
>   hw/pci/pcie.c               |   9 +-
>   hw/pci/pcie_sriov.c         | 271 ++++++++++++++++++++++++++++++++++++++++++++
>   include/hw/pci/pci.h        |  11 +-
>   include/hw/pci/pcie.h       |   6 +
>   include/hw/pci/pcie_sriov.h |  55 +++++++++
>   include/qemu/typedefs.h     |   2 +
>   8 files changed, 426 insertions(+), 29 deletions(-)
>   create mode 100644 hw/pci/pcie_sriov.c
>   create mode 100644 include/hw/pci/pcie_sriov.h
>
> diff --git a/hw/pci/Makefile.objs b/hw/pci/Makefile.objs
> index 9f905e6..2226980 100644
> --- a/hw/pci/Makefile.objs
> +++ b/hw/pci/Makefile.objs
> @@ -3,7 +3,7 @@ common-obj-$(CONFIG_PCI) += msix.o msi.o
>   common-obj-$(CONFIG_PCI) += shpc.o
>   common-obj-$(CONFIG_PCI) += slotid_cap.o
>   common-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o
> -common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o
> +common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o pcie_sriov.o
>
>   common-obj-$(call lnot,$(CONFIG_PCI)) += pci-stub.o
>   common-obj-$(CONFIG_ALL) += pci-stub.o
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index a5cc015..9c0eba1 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -153,6 +153,9 @@ int pci_bar(PCIDevice *d, int reg)
>   {
>       uint8_t type;
>
> +    /* PCIe virtual functions do not have their own BARs */
> +    assert(!pci_is_vf(d));
> +
>       if (reg != PCI_ROM_SLOT)
>           return PCI_BASE_ADDRESS_0 + reg * 4;
>
> @@ -211,22 +214,13 @@ void pci_device_deassert_intx(PCIDevice *dev)
>       }
>   }
>
> -static void pci_do_device_reset(PCIDevice *dev)
> +static void pci_reset_regions(PCIDevice *dev)
>   {
>       int r;
> +    if (pci_is_vf(dev)) {
> +        return;
> +    }
>
> -    pci_device_deassert_intx(dev);
> -    assert(dev->irq_state == 0);
> -
> -    /* Clear all writable bits */
> -    pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
> -                                 pci_get_word(dev->wmask + PCI_COMMAND) |
> -                                 pci_get_word(dev->w1cmask + PCI_COMMAND));
> -    pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
> -                                 pci_get_word(dev->wmask + PCI_STATUS) |
> -                                 pci_get_word(dev->w1cmask + PCI_STATUS));
> -    dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> -    dev->config[PCI_INTERRUPT_LINE] = 0x0;
>       for (r = 0; r < PCI_NUM_REGIONS; ++r) {
>           PCIIORegion *region = &dev->io_regions[r];
>           if (!region->size) {
> @@ -240,6 +234,27 @@ static void pci_do_device_reset(PCIDevice *dev)
>               pci_set_long(dev->config + pci_bar(dev, r), region->type);
>           }
>       }
> +}
> +
> +static void pci_do_device_reset(PCIDevice *dev)
> +{
> +    qdev_reset_all(&dev->qdev);
> +
> +    dev->irq_state = 0;
> +    pci_update_irq_status(dev);
> +    pci_device_deassert_intx(dev);
> +    assert(dev->irq_state == 0);
> +
> +    /* Clear all writable bits */
> +    pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
> +                                 pci_get_word(dev->wmask + PCI_COMMAND) |
> +                                 pci_get_word(dev->w1cmask + PCI_COMMAND));
> +    pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
> +                                 pci_get_word(dev->wmask + PCI_STATUS) |
> +                                 pci_get_word(dev->w1cmask + PCI_STATUS));
> +    dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> +    dev->config[PCI_INTERRUPT_LINE] = 0x0;
> +    pci_reset_regions(dev);
>       pci_update_mappings(dev);
>
>       msi_reset(dev);
> @@ -771,6 +786,15 @@ static void pci_init_multifunction(PCIBus *bus, PCIDevice *dev, Error **errp)
>           dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
>       }
>
> +    /* With SR/IOV and ARI, a device at function 0 need not be a multifunction
> +     * device, as it may just be a VF that ended up with function 0 in
> +     * the legacy PCI interpretation. Avoid failing in such cases:
> +     */
> +    if (pci_is_vf(dev) &&
> +        dev->exp.sriov_vf.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> +        return;
> +    }
> +
>       /*
>        * multifunction bit is interpreted in two ways as follows.
>        *   - all functions must set the bit to 1.
> @@ -962,6 +986,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
>       uint64_t wmask;
>       pcibus_t size = memory_region_size(memory);
>
> +    assert(!pci_is_vf(pci_dev)); /* VFs must use pcie_sriov_vf_register_bar */
>       assert(region_num >= 0);
>       assert(region_num < PCI_NUM_REGIONS);
>       if (size & (size-1)) {
> @@ -1060,11 +1085,44 @@ pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int region_num)
>       return pci_dev->io_regions[region_num].addr;
>   }
>
> -static pcibus_t pci_bar_address(PCIDevice *d,
> -				int reg, uint8_t type, pcibus_t size)
> +
> +static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
> +                                        uint8_t type, pcibus_t size)
> +{
> +    pcibus_t new_addr;
> +    if (!pci_is_vf(d)) {
> +        int bar = pci_bar(d, reg);
> +        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +            new_addr = pci_get_quad(d->config + bar);
> +        } else {
> +            new_addr = pci_get_long(d->config + bar);
> +        }
> +    } else {
> +        PCIDevice *pf = d->exp.sriov_vf.pf;
> +        uint16_t sriov_cap = pf->exp.sriov_cap;
> +        int bar = sriov_cap + PCI_SRIOV_BAR + reg * 4;
> +        uint16_t vf_offset = pci_get_word(pf->config + sriov_cap + PCI_SRIOV_VF_OFFSET);
> +        uint16_t vf_stride = pci_get_word(pf->config + sriov_cap + PCI_SRIOV_VF_STRIDE);
> +        uint32_t vf_num = (d->devfn - (pf->devfn + vf_offset)) / vf_stride;
> +
> +        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +            new_addr = pci_get_quad(pf->config + bar);
> +        } else {
> +            new_addr = pci_get_long(pf->config + bar);
> +        }
> +        new_addr += vf_num * size;
> +    }
> +    if (reg != PCI_ROM_SLOT) {
> +        /* Preserve the rom enable bit */
> +        new_addr &= ~(size - 1);
> +    }
> +    return new_addr;
> +}
> +
> +pcibus_t pci_bar_address(PCIDevice *d,
> +                         int reg, uint8_t type, pcibus_t size)
>   {
>       pcibus_t new_addr, last_addr;
> -    int bar = pci_bar(d, reg);
>       uint16_t cmd = pci_get_word(d->config + PCI_COMMAND);
>       Object *machine = qdev_get_machine();
>       ObjectClass *oc = object_get_class(machine);
> @@ -1075,7 +1133,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
>           if (!(cmd & PCI_COMMAND_IO)) {
>               return PCI_BAR_UNMAPPED;
>           }
> -        new_addr = pci_get_long(d->config + bar) & ~(size - 1);
> +        new_addr = pci_config_get_bar_addr(d, reg, type, size);
>           last_addr = new_addr + size - 1;
>           /* Check if 32 bit BAR wraps around explicitly.
>            * TODO: make priorities correct and remove this work around.
> @@ -1090,11 +1148,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
>       if (!(cmd & PCI_COMMAND_MEMORY)) {
>           return PCI_BAR_UNMAPPED;
>       }
> -    if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> -        new_addr = pci_get_quad(d->config + bar);
> -    } else {
> -        new_addr = pci_get_long(d->config + bar);
> -    }
> +    new_addr = pci_config_get_bar_addr(d, reg, type, size);
>       /* the ROM slot has a specific enable bit */
>       if (reg == PCI_ROM_SLOT && !(new_addr & PCI_ROM_ADDRESS_ENABLE)) {
>           return PCI_BAR_UNMAPPED;
> @@ -1228,6 +1282,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int
>
>       msi_write_config(d, addr, val_in, l);
>       msix_write_config(d, addr, val_in, l);
> +    pcie_sriov_config_write(d, addr, val_in, l);
>   }
>
>   /***********************************************************/
> diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> index 6e28985..ba49c0f 100644
> --- a/hw/pci/pcie.c
> +++ b/hw/pci/pcie.c
> @@ -253,7 +253,7 @@ void pcie_cap_slot_hotplug_cb(HotplugHandler *hotplug_dev, DeviceState *dev,
>        * Right now, only a device of function = 0 is allowed to be
>        * hot plugged/unplugged.
>        */
> -    assert(PCI_FUNC(pci_dev->devfn) == 0);
> +    assert(PCI_FUNC(pci_dev->devfn) == 0 || pci_is_vf(pci_dev));
>
>       pci_word_test_and_set_mask(exp_cap + PCI_EXP_SLTSTA,
>                                  PCI_EXP_SLTSTA_PDS);
> @@ -265,10 +265,11 @@ void pcie_cap_slot_hot_unplug_request_cb(HotplugHandler *hotplug_dev,
>                                            DeviceState *dev, Error **errp)
>   {
>       uint8_t *exp_cap;
> +    PCIDevice *pdev = PCI_DEVICE(hotplug_dev);
>
> -    pcie_cap_slot_hotplug_common(PCI_DEVICE(hotplug_dev), dev, &exp_cap, errp);
> +    pcie_cap_slot_hotplug_common(pdev, dev, &exp_cap, errp);
>
> -    pcie_cap_slot_push_attention_button(PCI_DEVICE(hotplug_dev));
> +    pcie_cap_slot_push_attention_button(pdev);
>   }
>
>   /* pci express slot for pci express root/downstream port
> @@ -408,7 +409,7 @@ void pcie_cap_slot_write_config(PCIDevice *dev,
>       }
>
>       /*
> -     * If the slot is polulated, power indicator is off and power
> +     * If the slot is populated, power indicator is off and power
>        * controller is off, it is safe to detach the devices.
>        */
>       if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> new file mode 100644
> index 0000000..14f31cc
> --- /dev/null
> +++ b/hw/pci/pcie_sriov.c
> @@ -0,0 +1,271 @@
> +/*
> + * pcie_sriov.c:
> + *
> + * Implementation of SR/IOV emulation support.
> + *
> + * Copyright (c) 2014 Knut Omang <knut.omang@oracle.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "hw/pci/pci.h"
> +#include "hw/pci/pcie.h"
> +#include "hw/pci/pci_bus.h"
> +#include "qemu/range.h"
> +
> +//#define DEBUG_SRIOV
> +#ifdef DEBUG_SRIOV
> +# define SRIOV_DPRINTF(fmt, ...)                                         \
> +    fprintf(stderr, "%s:%d " fmt, __func__, __LINE__, ## __VA_ARGS__)
> +#else
> +# define SRIOV_DPRINTF(fmt, ...) do {} while (0)
> +#endif
> +#define SRIOV_DEV_PRINTF(dev, fmt, ...)                                  \
> +    SRIOV_DPRINTF("%s:%x "fmt, (dev)->name, (dev)->devfn, ## __VA_ARGS__)
> +
> +static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char *name);
> +static void unregister_vfs(PCIDevice *dev);
> +
> +void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> +                        const char *vfname, uint16_t vf_dev_id,
> +                        uint16_t init_vfs, uint16_t total_vfs,
> +                        uint16_t vf_offset, uint16_t vf_stride)
> +{
> +    uint8_t *cfg = dev->config + offset;
> +    uint8_t *wmask;
> +
> +    pcie_add_capability(dev, PCI_EXT_CAP_ID_SRIOV, 1,
> +                        offset, PCI_EXT_CAP_SRIOV_SIZEOF);
> +    dev->exp.sriov_cap = offset;
> +    dev->exp.sriov_pf.num_vfs = 0;
> +    dev->exp.sriov_pf.vfname = g_strdup(vfname);
> +    dev->exp.sriov_pf.vf = NULL;
> +
> +    pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
> +    pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
> +
> +    /* Minimal page size support values required by the SR/IOV standard:
> +     * 0x553 << 12 = 0x553000 = 4K + 8K + 64K + 256K + 1M + 4M
> +     * Hard coded for simplicity in this version
> +     */
> +    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, 0x553);
> +
> +    /* Default is to use 4K pages, software can modify it
> +     * to any of the supported bits
> +     */
> +    pci_set_word(cfg + PCI_SRIOV_SYS_PGSIZE, 0x1);
> +
> +    /* Set up device ID and initial/total number of VFs available */
> +    pci_set_word(cfg + PCI_SRIOV_VF_DID, vf_dev_id);
> +    pci_set_word(cfg + PCI_SRIOV_INITIAL_VF, init_vfs);
> +    pci_set_word(cfg + PCI_SRIOV_TOTAL_VF, total_vfs);
> +    pci_set_word(cfg + PCI_SRIOV_NUM_VF, 0);
> +
> +    /* Write enable control bits */
> +    wmask = dev->wmask + offset;
> +    pci_set_word(wmask + PCI_SRIOV_CTRL,
> +                 PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE | PCI_SRIOV_CTRL_ARI);
> +    pci_set_word(wmask + PCI_SRIOV_NUM_VF, 0xffff);
> +    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, 0x553);
> +
> +    qdev_prop_set_bit(&dev->qdev, "multifunction", true);
> +}
> +
> +void pcie_sriov_pf_exit(PCIDevice *dev)
> +{
> +    SRIOV_DPRINTF("\n");
> +    unregister_vfs(dev);
> +    g_free((char *)dev->exp.sriov_pf.vfname);
> +    dev->exp.sriov_pf.vfname = NULL;
> +}
> +
> +void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
> +                               uint8_t type, dma_addr_t size)
> +{
> +    uint32_t addr;
> +    uint64_t wmask;
> +    uint16_t sriov_cap = dev->exp.sriov_cap;
> +
> +    assert(sriov_cap > 0);
> +    assert(region_num >= 0);
> +    assert(region_num < PCI_NUM_REGIONS);
> +    assert(region_num != PCI_ROM_SLOT);
> +
> +    wmask = ~(size - 1);
> +    addr = sriov_cap + PCI_SRIOV_BAR + region_num * 4;
> +
> +    pci_set_long(dev->config + addr, type);
> +    if (!(type & PCI_BASE_ADDRESS_SPACE_IO) &&
> +        type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +        pci_set_quad(dev->wmask + addr, wmask);
> +        pci_set_quad(dev->cmask + addr, ~0ULL);
> +    } else {
> +        pci_set_long(dev->wmask + addr, wmask & 0xffffffff);
> +        pci_set_long(dev->cmask + addr, 0xffffffff);
> +    }
> +    dev->exp.sriov_pf.vf_bar_type[region_num] = type;
> +}
> +
> +void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
> +                                MemoryRegion *memory)
> +{
> +    PCIIORegion *r;
> +    uint8_t type;
> +    pcibus_t size = memory_region_size(memory);
> +
> +    assert(pci_is_vf(dev)); /* PFs must use pci_register_bar */
> +    assert(region_num >= 0);
> +    assert(region_num < PCI_NUM_REGIONS);
> +    type = dev->exp.sriov_vf.pf->exp.sriov_pf.vf_bar_type[region_num];
> +
> +    if (size & (size-1)) {
> +        fprintf(stderr, "ERROR: PCI region size must be a power"
> +                " of two - type=0x%x, size=0x%"FMT_PCIBUS"\n", type, size);
> +        exit(1);
> +    }
> +
> +    r = &dev->io_regions[region_num];
> +    r->memory = memory;
> +    r->address_space =
> +        type & PCI_BASE_ADDRESS_SPACE_IO
> +        ? dev->bus->address_space_io
> +        : dev->bus->address_space_mem;
> +    r->size = size;
> +    r->type = type;
> +
> +    r->addr = pci_bar_address(dev, region_num, r->type, r->size);
> +    if (r->addr != PCI_BAR_UNMAPPED) {
> +        memory_region_add_subregion_overlap(r->address_space,
> +                                            r->addr, r->memory, 1);
> +    }
> +}
> +
> +static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char *name)
> +{
> +    PCIDevice *dev = pci_create(pf->bus, devfn, name);
> +    dev->exp.sriov_vf.pf = pf;
> +    Error *local_err = NULL;
> +
> +    object_property_set_bool(OBJECT(&dev->qdev), true, "realized", &local_err);
> +    if (local_err) {
> +        fprintf(stderr, "Failed to enable: %s\n",
> +                error_get_pretty(local_err));
> +        error_free(local_err);
> +        return NULL;
> +    }
> +
> +    /* set vid/did according to sr/iov spec - they are not used */
> +    pci_config_set_vendor_id(dev->config, 0xffff);
> +    pci_config_set_device_id(dev->config, 0xffff);
> +    return dev;
> +}
> +
> +static void register_vfs(PCIDevice *dev)
> +{
> +    uint16_t num_vfs;
> +    uint16_t i;
> +    uint16_t sriov_cap = dev->exp.sriov_cap;
> +    uint16_t vf_offset = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_OFFSET);
> +    uint16_t vf_stride = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_STRIDE);
> +    int32_t devfn = dev->devfn + vf_offset;
> +
> +    assert(sriov_cap > 0);
> +    num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
> +
> +    dev->exp.sriov_pf.vf = g_malloc(sizeof(PCIDevice *) * num_vfs);
> +    assert(dev->exp.sriov_pf.vf);
> +
> +    SRIOV_DEV_PRINTF(dev, "creating %d vf devs\n", num_vfs);
> +    for (i = 0; i < num_vfs; i++) {
> +        dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn, dev->exp.sriov_pf.vfname);
> +        if (!dev->exp.sriov_pf.vf[i]) {
> +            SRIOV_DEV_PRINTF(dev, "Failed to create VF %d at devfn %x.%x\n", i,
> +                             PCI_SLOT(devfn), PCI_FUNC(devfn));
> +            num_vfs = i;
> +            break;
> +        }
> +        devfn += vf_stride;
> +    }
> +    dev->exp.sriov_pf.num_vfs = num_vfs;
> +}
> +
> +static void unregister_vfs(PCIDevice *dev)
> +{
> +    Error *local_err = NULL;
> +    uint16_t num_vfs = dev->exp.sriov_pf.num_vfs;
> +    uint16_t i;
> +    SRIOV_DEV_PRINTF(dev, "Unregistering %d vf devs\n", num_vfs);
> +    for (i = 0; i < num_vfs; i++) {
> +        object_property_set_bool(OBJECT(&dev->exp.sriov_pf.vf[i]->qdev), false, "realized", &local_err);
> +        if (local_err) {
> +            fprintf(stderr, "Failed to unplug: %s\n",
> +                    error_get_pretty(local_err));
> +            error_free(local_err);
> +        }
> +    }
> +    g_free(dev->exp.sriov_pf.vf);
> +    dev->exp.sriov_pf.vf = NULL;
> +    dev->exp.sriov_pf.num_vfs = 0;
> +}
> +
> +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address, uint32_t val, int len)
> +{
> +    uint32_t off;
> +    uint16_t sriov_cap = dev->exp.sriov_cap;
> +
> +    if (!sriov_cap || address < sriov_cap) {
> +        return;
> +    }
> +    off = address - sriov_cap;
> +    if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
> +        return;
> +    }
> +
> +    SRIOV_DEV_PRINTF(dev, "cap at 0x%x sriov offset 0x%x val 0x%x len %d\n",
> +                 sriov_cap, off, val, len);
> +
> +    if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> +        if (dev->exp.sriov_pf.num_vfs) {
> +            if (!(val & PCI_SRIOV_CTRL_VFE)) {
> +                unregister_vfs(dev);
> +            }
> +        } else {
> +            if (val & PCI_SRIOV_CTRL_VFE) {
> +                register_vfs(dev);
> +            }
> +        }
> +    }
> +}
> +
> +
> +/* Reset SR/IOV VF Enable bit to trigger an unregister of all VFs */
> +void pcie_sriov_pf_disable_vfs(PCIDevice *dev)
> +{
> +    uint16_t sriov_cap = dev->exp.sriov_cap;
> +    if (sriov_cap) {
> +        uint32_t val = pci_get_byte(dev->config + sriov_cap + PCI_SRIOV_CTRL);
> +        if (val & PCI_SRIOV_CTRL_VFE) {
> +            val &= ~PCI_SRIOV_CTRL_VFE;
> +            pcie_sriov_config_write(dev, sriov_cap + PCI_SRIOV_CTRL, val, 1);
> +        }
> +    }
> +}
> +
> +/* Add optional supported page sizes to the mask of supported page sizes */
> +void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t opt_sup_pgsize)
> +{
> +    uint8_t *cfg = dev->config + dev->exp.sriov_cap;
> +    uint8_t *wmask = dev->wmask + dev->exp.sriov_cap;
> +
> +    uint16_t sup_pgsize = pci_get_word(cfg + PCI_SRIOV_SUP_PGSIZE);
> +
> +    sup_pgsize |= opt_sup_pgsize;
> +
> +    /* Make sure the new bits are set, and that system page size
> +     * also can be set to any of the new values according to spec:
> +     */
> +    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, sup_pgsize);
> +    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, sup_pgsize);
> +}
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 551cb3d..2e9d8ba 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -11,8 +11,6 @@
>   /* PCI includes legacy ISA access.  */
>   #include "hw/isa/isa.h"
>
> -#include "hw/pci/pcie.h"
> -
>   /* PCI bus */
>
>   #define PCI_DEVFN(slot, func)   ((((slot) & 0x1f) << 3) | ((func) & 0x07))
> @@ -132,6 +130,7 @@ enum {
>   #define QEMU_PCI_VGA_IO_HI_SIZE 0x20
>
>   #include "hw/pci/pci_regs.h"
> +#include "hw/pci/pcie.h"
>
>   /* PCI HEADER_TYPE */
>   #define  PCI_HEADER_TYPE_MULTI_FUNCTION 0x80
> @@ -421,6 +420,9 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
>   AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>   void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
>
> +pcibus_t pci_bar_address(PCIDevice *d,
> +                         int reg, uint8_t type, pcibus_t size);
> +
>   static inline void
>   pci_set_byte(uint8_t *config, uint8_t val)
>   {
> @@ -672,6 +674,11 @@ static inline int pci_is_express(const PCIDevice *d)
>       return d->cap_present & QEMU_PCI_CAP_EXPRESS;
>   }
>
> +static inline int pci_is_vf(const PCIDevice *d)
> +{
> +    return d->exp.sriov_vf.pf != NULL;
> +}
> +
>   static inline uint32_t pci_config_size(const PCIDevice *d)
>   {
>       return pci_is_express(d) ? PCIE_CONFIG_SPACE_SIZE : PCI_CONFIG_SPACE_SIZE;
> diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> index b48a7a2..b09f79a 100644
> --- a/include/hw/pci/pcie.h
> +++ b/include/hw/pci/pcie.h
> @@ -25,6 +25,7 @@
>   #include "hw/pci/pci_regs.h"
>   #include "hw/pci/pcie_regs.h"
>   #include "hw/pci/pcie_aer.h"
> +#include "hw/pci/pcie_sriov.h"
>   #include "hw/hotplug.h"
>
>   typedef enum {
> @@ -74,6 +75,11 @@ struct PCIExpressDevice {
>       /* AER */
>       uint16_t aer_cap;
>       PCIEAERLog aer_log;
> +
> +    /* SR/IOV */
> +    uint16_t sriov_cap;
> +    PCIESriovPF sriov_pf;
> +    PCIESriovVF sriov_vf;
>   };
>
>   #define COMPAT_PROP_PCP "power_controller_present"
> diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
> new file mode 100644
> index 0000000..061f5cf
> --- /dev/null
> +++ b/include/hw/pci/pcie_sriov.h
> @@ -0,0 +1,55 @@
> +/*
> + * pcie_sriov.h:
> + *
> + * Implementation of SR/IOV emulation support.
> + *
> + * Copyright (c) 2014 Knut Omang <knut.omang@oracle.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef QEMU_PCIE_SRIOV_H
> +#define QEMU_PCIE_SRIOV_H
> +
> +
> +
> +struct PCIESriovPF {
> +    uint16_t num_vfs;           /* Number of virtual functions created */
> +    uint8_t vf_bar_type[PCI_NUM_REGIONS];  /* Store type for each VF bar */
> +    const char *vfname;         /* Reference to the device type used for the VFs */
> +    PCIDevice **vf;             /* Pointer to an array of num_vfs VF devices */
> +};
> +
> +struct PCIESriovVF {
> +    PCIDevice *pf;              /* Pointer back to owner physical function */
> +};
> +
> +void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> +                        const char *vfname, uint16_t vf_dev_id,
> +                        uint16_t init_vfs, uint16_t total_vfs,
> +                        uint16_t vf_offset, uint16_t vf_stride);
> +void pcie_sriov_pf_exit(PCIDevice *dev);
> +
> +/* Set up a VF bar in the SR/IOV bar area */
> +void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
> +                               uint8_t type, dma_addr_t size);
> +
> +/* Instantiate a bar for a VF */
> +void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
> +                                MemoryRegion *memory);
> +
> +/* Add optional supported page sizes to the mask of supported page sizes
> + * Page size values are interpreted as opt_sup_pgsize << 12.
> + */
> +void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t opt_sup_pgsize);
> +
> +/* SR/IOV capability config write handler */
> +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
> +                             uint32_t val, int len);
> +
> +/* Reset SR/IOV VF Enable bit to unregister all VFs */
> +void pcie_sriov_pf_disable_vfs(PCIDevice *dev);
> +
> +#endif /* QEMU_PCIE_SRIOV_H */
> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> index f8a9dd6..fa36b8c 100644
> --- a/include/qemu/typedefs.h
> +++ b/include/qemu/typedefs.h
> @@ -53,6 +53,8 @@ typedef struct PCIDevice PCIDevice;
>   typedef struct PCIEAERErr PCIEAERErr;
>   typedef struct PCIEAERLog PCIEAERLog;o
>   typedef struct PCIEAERMsg PCIEAERMsg;
> +typedef struct PCIESriovPF PCIESriovPF;
> +typedef struct PCIESriovVF PCIESriovVF;
>   typedef struct PCIEPort PCIEPort;
>   typedef struct PCIESlot PCIESlot;
>   typedef struct PCIExpressDevice PCIExpressDevice;
>

I reviewed the patch and it looks OK to me (besides the other comments in the thread).
Since we don't have code using it yet and the tests are only an RFC, I suggest adding
to the series a document file explaining how to use it (template code from the RFC)
and the current limitations/issues.

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/3] pci: Update pci_regs header
  2015-10-07 13:32   ` Marcel Apfelbaum
@ 2015-10-10 10:46     ` Knut Omang
  0 siblings, 0 replies; 16+ messages in thread
From: Knut Omang @ 2015-10-10 10:46 UTC (permalink / raw)
  To: marcel, qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard W.M. Jones,
	Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

On Wed, 2015-10-07 at 16:32 +0300, Marcel Apfelbaum wrote:
> On 09/12/2015 03:36 PM, Knut Omang wrote:
> > Merge in new definitions from kernel v4.2
> > Adds definition necessary to support emulated SR/IOV.
> 
> In the meantime Paolo updated this header to 4.3-RC1, is this patch
> still needed?
> 
> Thanks,
> Marcel

Quite right - it is gone after rebase.

Thanks,
Knut

> > 
> > Signed-off-by: Knut Omang <knut.omang@oracle.com>
> > ---
> >   include/standard-headers/linux/pci_regs.h | 142
> > +++++++++++++++++++++++++-----
> >   1 file changed, 118 insertions(+), 24 deletions(-)
> > 
> > diff --git a/include/standard-headers/linux/pci_regs.h
> > b/include/standard-headers/linux/pci_regs.h
> > index 57e8c80..3d56c1e 100644
> > --- a/include/standard-headers/linux/pci_regs.h
> > +++ b/include/standard-headers/linux/pci_regs.h
> > @@ -304,13 +304,14 @@
> >   #define PCI_MSI_PENDING_64	20	/* Pending bits
> > register for 32-bit devices */
> > 
> >   /* MSI-X registers */
> > -#define PCI_MSIX_FLAGS		2
> > -#define  PCI_MSIX_FLAGS_QSIZE	0x7FF
> > -#define  PCI_MSIX_FLAGS_ENABLE	(1 << 15)
> > -#define  PCI_MSIX_FLAGS_MASKALL	(1 << 14)
> > -#define PCI_MSIX_TABLE		4
> > -#define PCI_MSIX_PBA		8
> > +#define PCI_MSIX_FLAGS		2	/* Message Control
> > */
> > +#define  PCI_MSIX_FLAGS_QSIZE	0x7FF	/* Table size */
> > +#define  PCI_MSIX_FLAGS_ENABLE	(1 << 15) /* MSI-X enable */
> > +#define  PCI_MSIX_FLAGS_MASKALL	(1 << 14) /* Mask all
> > vectors for this function */
> > +#define PCI_MSIX_TABLE		4	/* Table offset */
> > +#define PCI_MSIX_PBA		8	/* Pending Bit Array
> > offset */
> >   #define  PCI_MSIX_FLAGS_BIRMASK	(7 << 0)
> > +#define PCI_CAP_MSIX_SIZEOF	12	/* size of MSIX
> > capability structure */
> > 
> >   /* MSI-X entry's format */
> >   #define PCI_MSIX_ENTRY_SIZE		16
> > @@ -517,31 +518,63 @@
> >   #define  PCI_EXP_OBFF_MSG	0x40000 /* New message signaling
> > */
> >   #define  PCI_EXP_OBFF_WAKE	0x80000 /* Re-use WAKE# for
> > OBFF */
> >   #define PCI_EXP_DEVCTL2		40	/* Device
> > Control 2 */
> > -#define  PCI_EXP_DEVCTL2_ARI	0x20	/* Alternative
> > Routing-ID */
> > -#define  PCI_EXP_IDO_REQ_EN	0x100	/* ID-based
> > ordering request enable */
> > -#define  PCI_EXP_IDO_CMP_EN	0x200	/* ID-based
> > ordering completion enable */
> > -#define  PCI_EXP_LTR_EN		0x400	/* Latency
> > tolerance reporting */
> > -#define  PCI_EXP_OBFF_MSGA_EN	0x2000	/* OBFF enable
> > with Message type A */
> > -#define  PCI_EXP_OBFF_MSGB_EN	0x4000	/* OBFF enable
> > with Message type B */
> > -#define  PCI_EXP_OBFF_WAKE_EN	0x6000	/* OBFF using
> > WAKE# signaling */
> > +#define  PCI_EXP_DEVCTL2_COMP_TIMEOUT	0x000f	/*
> > Completion Timeout Value */
> > +#define  PCI_EXP_DEVCTL2_ARI		0x0020	/*
> > Alternative Routing-ID */
> > +#define  PCI_EXP_DEVCTL2_IDO_REQ_EN	0x0100	/* Allow
> > IDO for requests */
> > +#define  PCI_EXP_DEVCTL2_IDO_CMP_EN	0x0200	/* Allow
> > IDO for completions */
> > +#define  PCI_EXP_DEVCTL2_LTR_EN		0x0400	/*
> > Enable LTR mechanism */
> > +#define  PCI_EXP_DEVCTL2_OBFF_MSGA_EN	0x2000	/*
> > Enable OBFF Message type A */
> > +#define  PCI_EXP_DEVCTL2_OBFF_MSGB_EN	0x4000	/*
> > Enable OBFF Message type B */
> > +#define  PCI_EXP_DEVCTL2_OBFF_WAKE_EN	0x6000	/* OBFF
> > using WAKE# signaling */
> > +#define PCI_EXP_DEVSTA2		42	/* Device Status
> > 2 */
> > +#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2	44	/* v2
> > endpoints end here */
> > +#define PCI_EXP_LNKCAP2		44	/* Link
> > Capabilities 2 */
> > +#define  PCI_EXP_LNKCAP2_SLS_2_5GB	0x00000002 /* Supported
> > Speed 2.5GT/s */
> > +#define  PCI_EXP_LNKCAP2_SLS_5_0GB	0x00000004 /* Supported
> > Speed 5.0GT/s */
> > +#define  PCI_EXP_LNKCAP2_SLS_8_0GB	0x00000008 /* Supported
> > Speed 8.0GT/s */
> > +#define  PCI_EXP_LNKCAP2_CROSSLINK	0x00000100 /* Crosslink
> > supported */
> >   #define PCI_EXP_LNKCTL2		48	/* Link Control
> > 2 */
> > -#define PCI_EXP_SLTCTL2		56	/* Slot Control 2
> > */
> > +#define PCI_EXP_LNKSTA2		50	/* Link Status 2
> > */
> > +#define PCI_EXP_SLTCAP2		52	/* Slot
> > Capabilities 2 */
> > +#define PCI_EXP_SLTCTL2 	56	/* Slot Control 2 */
> > +#define PCI_EXP_SLTSTA2		58	/* Slot Status 2
> > */
> > 
> >   /* Extended Capabilities (PCI-X 2.0 and Express) */
> >   #define PCI_EXT_CAP_ID(header)		(header &
> > 0x0000ffff)
> >   #define PCI_EXT_CAP_VER(header)		((header >> 16) &
> > 0xf)
> >   #define PCI_EXT_CAP_NEXT(header)	((header >> 20) & 0xffc)
> > 
> > -#define PCI_EXT_CAP_ID_ERR	1
> > -#define PCI_EXT_CAP_ID_VC	2
> > -#define PCI_EXT_CAP_ID_DSN	3
> > -#define PCI_EXT_CAP_ID_PWR	4
> > -#define PCI_EXT_CAP_ID_VNDR	11
> > -#define PCI_EXT_CAP_ID_ACS	13
> > -#define PCI_EXT_CAP_ID_ARI	14
> > -#define PCI_EXT_CAP_ID_ATS	15
> > -#define PCI_EXT_CAP_ID_SRIOV	16
> > -#define PCI_EXT_CAP_ID_LTR	24
> > +#define PCI_EXT_CAP_ID_ERR	0x01	/* Advanced Error
> > Reporting */
> > +#define PCI_EXT_CAP_ID_VC	0x02	/* Virtual Channel
> > Capability */
> > +#define PCI_EXT_CAP_ID_DSN	0x03	/* Device Serial
> > Number */
> > +#define PCI_EXT_CAP_ID_PWR	0x04	/* Power Budgeting
> > */
> > +#define PCI_EXT_CAP_ID_RCLD	0x05	/* Root Complex
> > Link Declaration */
> > +#define PCI_EXT_CAP_ID_RCILC	0x06	/* Root Complex
> > Internal Link Control */
> > +#define PCI_EXT_CAP_ID_RCEC	0x07	/* Root Complex
> > Event Collector */
> > +#define PCI_EXT_CAP_ID_MFVC	0x08	/* Multi-Function
> > VC Capability */
> > +#define PCI_EXT_CAP_ID_VC9	0x09	/* same as _VC */
> > +#define PCI_EXT_CAP_ID_RCRB	0x0A	/* Root Complex RB?
> > */
> > +#define PCI_EXT_CAP_ID_VNDR	0x0B	/* Vendor-Specific
> > */
> > +#define PCI_EXT_CAP_ID_CAC	0x0C	/* Config Access -
> > obsolete */
> > +#define PCI_EXT_CAP_ID_ACS	0x0D	/* Access Control
> > Services */
> > +#define PCI_EXT_CAP_ID_ARI	0x0E	/* Alternate Routing
> > ID */
> > +#define PCI_EXT_CAP_ID_ATS	0x0F	/* Address
> > Translation Services */
> > +#define PCI_EXT_CAP_ID_SRIOV	0x10	/* Single Root I/O
> > Virtualization */
> > +#define PCI_EXT_CAP_ID_MRIOV	0x11	/* Multi Root I/O
> > Virtualization */
> > +#define PCI_EXT_CAP_ID_MCAST	0x12	/* Multicast */
> > +#define PCI_EXT_CAP_ID_PRI	0x13	/* Page Request
> > Interface */
> > +#define PCI_EXT_CAP_ID_AMD_XXX	0x14	/* Reserved for
> > AMD */
> > +#define PCI_EXT_CAP_ID_REBAR	0x15	/* Resizable BAR
> > */
> > +#define PCI_EXT_CAP_ID_DPA	0x16	/* Dynamic Power
> > Allocation */
> > +#define PCI_EXT_CAP_ID_TPH	0x17	/* TPH Requester */
> > +#define PCI_EXT_CAP_ID_LTR	0x18	/* Latency Tolerance
> > Reporting */
> > +#define PCI_EXT_CAP_ID_SECPCI	0x19	/* Secondary PCIe
> > Capability */
> > +#define PCI_EXT_CAP_ID_PMUX	0x1A	/* Protocol
> > Multiplexing */
> > +#define PCI_EXT_CAP_ID_PASID	0x1B	/* Process Address
> > Space ID */
> > +#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PASID
> > +
> > +#define PCI_EXT_CAP_DSN_SIZEOF	12
> > +#define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
> > 
> >   /* Advanced Error Reporting */
> >   #define PCI_ERR_UNCOR_STATUS	4	/* Uncorrectable
> > Error Status */
> > @@ -556,6 +589,11 @@
> >   #define  PCI_ERR_UNC_MALF_TLP	0x00040000	/*
> > Malformed TLP */
> >   #define  PCI_ERR_UNC_ECRC	0x00080000	/* ECRC Error
> > Status */
> >   #define  PCI_ERR_UNC_UNSUP	0x00100000	/*
> > Unsupported Request */
> > +#define  PCI_ERR_UNC_ACSV	0x00200000	/* ACS
> > Violation */
> > +#define  PCI_ERR_UNC_INTN	0x00400000	/* internal
> > error */
> > +#define  PCI_ERR_UNC_MCBTLP	0x00800000	/* MC blocked
> > TLP */
> > +#define  PCI_ERR_UNC_ATOMEG	0x01000000	/* Atomic
> > egress blocked */
> > +#define  PCI_ERR_UNC_TLPPRE	0x02000000	/* TLP prefix
> > blocked */
> >   #define PCI_ERR_UNCOR_MASK	8	/* Uncorrectable Error
> > Mask */
> >   	/* Same bits as above */
> >   #define PCI_ERR_UNCOR_SEVER	12	/* Uncorrectable
> > Error Severity */
> > @@ -657,6 +695,7 @@
> >   #define  PCI_ARI_CTRL_MFVC	0x0001	/* MFVC Function
> > Groups Enable */
> >   #define  PCI_ARI_CTRL_ACS	0x0002	/* ACS Function
> > Groups Enable */
> >   #define  PCI_ARI_CTRL_FG(x)	(((x) >> 4) & 7) /* Function
> > Group */
> > +#define PCI_EXT_CAP_ARI_SIZEOF	8
> > 
> >   /* Address Translation Service */
> >   #define PCI_ATS_CAP		0x04	/* ATS Capability
> > Register */
> > @@ -666,6 +705,29 @@
> >   #define  PCI_ATS_CTRL_ENABLE	0x8000	/* ATS Enable
> > */
> >   #define  PCI_ATS_CTRL_STU(x)	((x) & 0x1f)	/*
> > Smallest Translation Unit */
> >   #define  PCI_ATS_MIN_STU	12	/* shift of minimum STU
> > block */
> > +#define PCI_EXT_CAP_ATS_SIZEOF	8
> > +
> > +/* Page Request Interface */
> > +#define PCI_PRI_CTRL		0x04	/* PRI control
> > register */
> > +#define  PCI_PRI_CTRL_ENABLE	0x01	/* Enable */
> > +#define  PCI_PRI_CTRL_RESET	0x02	/* Reset */
> > +#define PCI_PRI_STATUS		0x06	/* PRI status
> > register */
> > +#define  PCI_PRI_STATUS_RF	0x001	/* Response Failure
> > */
> > +#define  PCI_PRI_STATUS_UPRGI	0x002	/* Unexpected
> > PRG index */
> > +#define  PCI_PRI_STATUS_STOPPED	0x100	/* PRI Stopped
> > */
> > +#define PCI_PRI_MAX_REQ		0x08	/* PRI max reqs
> > supported */
> > +#define PCI_PRI_ALLOC_REQ	0x0c	/* PRI max reqs
> > allowed */
> > +#define PCI_EXT_CAP_PRI_SIZEOF	16
> > +
> > +/* Process Address Space ID */
> > +#define PCI_PASID_CAP		0x04    /* PASID feature
> > register */
> > +#define  PCI_PASID_CAP_EXEC	0x02	/* Exec permissions
> > Supported */
> > +#define  PCI_PASID_CAP_PRIV	0x04	/* Privilege Mode
> > Supported */
> > +#define PCI_PASID_CTRL		0x06    /* PASID control
> > register */
> > +#define  PCI_PASID_CTRL_ENABLE	0x01	/* Enable bit */
> > +#define  PCI_PASID_CTRL_EXEC	0x02	/* Exec
> > permissions Enable */
> > +#define  PCI_PASID_CTRL_PRIV	0x04	/* Privilege Mode
> > Enable */
> > +#define PCI_EXT_CAP_PASID_SIZEOF	8
> > 
> >   /* Single Root I/O Virtualization */
> >   #define PCI_SRIOV_CAP		0x04	/* SR-IOV
> > Capabilities */
> > @@ -697,6 +759,7 @@
> >   #define  PCI_SRIOV_VFM_MI	0x1	/* Dormant.MigrateIn
> > */
> >   #define  PCI_SRIOV_VFM_MO	0x2	/* Active.MigrateOut
> > */
> >   #define  PCI_SRIOV_VFM_AV	0x3	/* Active.Available
> > */
> > +#define PCI_EXT_CAP_SRIOV_SIZEOF 64
> > 
> >   #define PCI_LTR_MAX_SNOOP_LAT	0x4
> >   #define PCI_LTR_MAX_NOSNOOP_LAT	0x6
> > @@ -713,7 +776,38 @@
> >   #define  PCI_ACS_UF		0x10	/* Upstream
> > Forwarding */
> >   #define  PCI_ACS_EC		0x20	/* P2P Egress
> > Control */
> >   #define  PCI_ACS_DT		0x40	/* Direct
> > Translated P2P */
> > +#define PCI_ACS_EGRESS_BITS	0x05	/* ACS Egress
> > Control Vector Size */
> >   #define PCI_ACS_CTRL		0x06	/* ACS Control
> > Register */
> >   #define PCI_ACS_EGRESS_CTL_V	0x08	/* ACS Egress
> > Control Vector */
> > 
> > +#define PCI_VSEC_HDR		4	/* extended cap -
> > vendor-specific */
> > +#define  PCI_VSEC_HDR_LEN_SHIFT	20	/* shift for
> > length field */
> > +
> > +/* SATA capability */
> > +#define PCI_SATA_REGS		4	/* SATA REGs
> > specifier */
> > +#define  PCI_SATA_REGS_MASK	0xF	/* location -
> > BAR#/inline */
> > +#define  PCI_SATA_REGS_INLINE	0xF	/* REGS in config
> > space */
> > +#define PCI_SATA_SIZEOF_SHORT	8
> > +#define PCI_SATA_SIZEOF_LONG	16
> > +
> > +/* Resizable BARs */
> > +#define PCI_REBAR_CTRL		8	/* control register
> > */
> > +#define  PCI_REBAR_CTRL_NBAR_MASK	(7 << 5)	/* mask
> > for # bars */
> > +#define  PCI_REBAR_CTRL_NBAR_SHIFT	5	/* shift for #
> > bars */
> > +
> > +/* Dynamic Power Allocation */
> > +#define PCI_DPA_CAP		4	/* capability register
> > */
> > +#define  PCI_DPA_CAP_SUBSTATE_MASK	0x1F	/* #
> > substates - 1 */
> > +#define PCI_DPA_BASE_SIZEOF	16	/* size with 0
> > substates */
> > +
> > +/* TPH Requester */
> > +#define PCI_TPH_CAP		4	/* capability register
> > */
> > +#define  PCI_TPH_CAP_LOC_MASK	0x600	/* location mask
> > */
> > +#define   PCI_TPH_LOC_NONE	0x000	/* no location */
> > +#define   PCI_TPH_LOC_CAP	0x200	/* in capability */
> > +#define   PCI_TPH_LOC_MSIX	0x400	/* in MSI-X */
> > +#define PCI_TPH_CAP_ST_MASK	0x07FF0000	/* st table
> > mask */
> > +#define PCI_TPH_CAP_ST_SHIFT	16	/* st table shift */
> > +#define PCI_TPH_BASE_SIZEOF	12	/* size with no st
> > table */
> > +
> >   #endif /* LINUX_PCI_REGS_H */
> > 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/3] pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  2015-10-07 15:06   ` Marcel Apfelbaum
@ 2015-10-11 13:56     ` Knut Omang
  0 siblings, 0 replies; 16+ messages in thread
From: Knut Omang @ 2015-10-11 13:56 UTC (permalink / raw)
  To: marcel, qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard W.M. Jones,
	Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

On Wed, 2015-10-07 at 18:06 +0300, Marcel Apfelbaum wrote:
> On 09/12/2015 03:36 PM, Knut Omang wrote:
> > This patch provides the building blocks for creating an SR/IOV
> > PCIe Extended Capability header and register/unregister
> > SR/IOV Virtual Functions.
> > 
> > Signed-off-by: Knut Omang <knut.omang@oracle.com>
> > ---
> >   hw/pci/Makefile.objs        |   2 +-
> >   hw/pci/pci.c                |  99 ++++++++++++----
> >   hw/pci/pcie.c               |   9 +-
> >   hw/pci/pcie_sriov.c         | 271
> > ++++++++++++++++++++++++++++++++++++++++++++
> >   include/hw/pci/pci.h        |  11 +-
> >   include/hw/pci/pcie.h       |   6 +
> >   include/hw/pci/pcie_sriov.h |  55 +++++++++
> >   include/qemu/typedefs.h     |   2 +
> >   8 files changed, 426 insertions(+), 29 deletions(-)
> >   create mode 100644 hw/pci/pcie_sriov.c
> >   create mode 100644 include/hw/pci/pcie_sriov.h
> > 
> > diff --git a/hw/pci/Makefile.objs b/hw/pci/Makefile.objs
> > index 9f905e6..2226980 100644
> > --- a/hw/pci/Makefile.objs
> > +++ b/hw/pci/Makefile.objs
> > @@ -3,7 +3,7 @@ common-obj-$(CONFIG_PCI) += msix.o msi.o
> >   common-obj-$(CONFIG_PCI) += shpc.o
> >   common-obj-$(CONFIG_PCI) += slotid_cap.o
> >   common-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o
> > -common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o
> > +common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o
> > pcie_sriov.o
> > 
> >   common-obj-$(call lnot,$(CONFIG_PCI)) += pci-stub.o
> >   common-obj-$(CONFIG_ALL) += pci-stub.o
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index a5cc015..9c0eba1 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -153,6 +153,9 @@ int pci_bar(PCIDevice *d, int reg)
> >   {
> >       uint8_t type;
> > 
> > +    /* PCIe virtual functions do not have their own BARs */
> > +    assert(!pci_is_vf(d));
> > +
> >       if (reg != PCI_ROM_SLOT)
> >           return PCI_BASE_ADDRESS_0 + reg * 4;
> > 
> > @@ -211,22 +214,13 @@ void pci_device_deassert_intx(PCIDevice *dev)
> >       }
> >   }
> > 
> > -static void pci_do_device_reset(PCIDevice *dev)
> > +static void pci_reset_regions(PCIDevice *dev)
> >   {
> >       int r;
> > +    if (pci_is_vf(dev)) {
> > +        return;
> > +    }
> > 
> > -    pci_device_deassert_intx(dev);
> > -    assert(dev->irq_state == 0);
> > -
> > -    /* Clear all writable bits */
> > -    pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
> > -                                 pci_get_word(dev->wmask +
> > PCI_COMMAND) |
> > -                                 pci_get_word(dev->w1cmask +
> > PCI_COMMAND));
> > -    pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
> > -                                 pci_get_word(dev->wmask +
> > PCI_STATUS) |
> > -                                 pci_get_word(dev->w1cmask +
> > PCI_STATUS));
> > -    dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> > -    dev->config[PCI_INTERRUPT_LINE] = 0x0;
> >       for (r = 0; r < PCI_NUM_REGIONS; ++r) {
> >           PCIIORegion *region = &dev->io_regions[r];
> >           if (!region->size) {
> > @@ -240,6 +234,27 @@ static void pci_do_device_reset(PCIDevice
> > *dev)
> >               pci_set_long(dev->config + pci_bar(dev, r), region
> > ->type);
> >           }
> >       }
> > +}
> > +
> > +static void pci_do_device_reset(PCIDevice *dev)
> > +{
> > +    qdev_reset_all(&dev->qdev);
> > +
> > +    dev->irq_state = 0;
> > +    pci_update_irq_status(dev);
> > +    pci_device_deassert_intx(dev);
> > +    assert(dev->irq_state == 0);
> > +
> > +    /* Clear all writable bits */
> > +    pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
> > +                                 pci_get_word(dev->wmask +
> > PCI_COMMAND) |
> > +                                 pci_get_word(dev->w1cmask +
> > PCI_COMMAND));
> > +    pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
> > +                                 pci_get_word(dev->wmask +
> > PCI_STATUS) |
> > +                                 pci_get_word(dev->w1cmask +
> > PCI_STATUS));
> > +    dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> > +    dev->config[PCI_INTERRUPT_LINE] = 0x0;
> > +    pci_reset_regions(dev);
> >       pci_update_mappings(dev);
> > 
> >       msi_reset(dev);
> > @@ -771,6 +786,15 @@ static void pci_init_multifunction(PCIBus
> > *bus, PCIDevice *dev, Error **errp)
> >           dev->config[PCI_HEADER_TYPE] |=
> > PCI_HEADER_TYPE_MULTI_FUNCTION;
> >       }
> > 
> > +    /* With SR/IOV and ARI, a device at function 0 need not be a
> > multifunction
> > +     * device, as it may just be a VF that ended up with function
> > 0 in
> > +     * the legacy PCI interpretation. Avoid failing in such cases:
> > +     */
> > +    if (pci_is_vf(dev) &&
> > +        dev->exp.sriov_vf.pf->cap_present &
> > QEMU_PCI_CAP_MULTIFUNCTION) {
> > +        return;
> > +    }
> > +
> >       /*
> >        * multifunction bit is interpreted in two ways as follows.
> >        *   - all functions must set the bit to 1.
> > @@ -962,6 +986,7 @@ void pci_register_bar(PCIDevice *pci_dev, int
> > region_num,
> >       uint64_t wmask;
> >       pcibus_t size = memory_region_size(memory);
> > 
> > +    assert(!pci_is_vf(pci_dev)); /* VFs must use
> > pcie_sriov_vf_register_bar */
> >       assert(region_num >= 0);
> >       assert(region_num < PCI_NUM_REGIONS);
> >       if (size & (size-1)) {
> > @@ -1060,11 +1085,44 @@ pcibus_t pci_get_bar_addr(PCIDevice
> > *pci_dev, int region_num)
> >       return pci_dev->io_regions[region_num].addr;
> >   }
> > 
> > -static pcibus_t pci_bar_address(PCIDevice *d,
> > -				int reg, uint8_t type, pcibus_t
> > size)
> > +
> > +static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
> > +                                        uint8_t type, pcibus_t
> > size)
> > +{
> > +    pcibus_t new_addr;
> > +    if (!pci_is_vf(d)) {
> > +        int bar = pci_bar(d, reg);
> > +        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > +            new_addr = pci_get_quad(d->config + bar);
> > +        } else {
> > +            new_addr = pci_get_long(d->config + bar);
> > +        }
> > +    } else {
> > +        PCIDevice *pf = d->exp.sriov_vf.pf;
> > +        uint16_t sriov_cap = pf->exp.sriov_cap;
> > +        int bar = sriov_cap + PCI_SRIOV_BAR + reg * 4;
> > +        uint16_t vf_offset = pci_get_word(pf->config + sriov_cap +
> > PCI_SRIOV_VF_OFFSET);
> > +        uint16_t vf_stride = pci_get_word(pf->config + sriov_cap +
> > PCI_SRIOV_VF_STRIDE);
> > +        uint32_t vf_num = (d->devfn - (pf->devfn + vf_offset)) /
> > vf_stride;
> > +
> > +        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > +            new_addr = pci_get_quad(pf->config + bar);
> > +        } else {
> > +            new_addr = pci_get_long(pf->config + bar);
> > +        }
> > +        new_addr += vf_num * size;
> > +    }
> > +    if (reg != PCI_ROM_SLOT) {
> > +        /* Preserve the rom enable bit */
> > +        new_addr &= ~(size - 1);
> > +    }
> > +    return new_addr;
> > +}
> > +
> > +pcibus_t pci_bar_address(PCIDevice *d,
> > +                         int reg, uint8_t type, pcibus_t size)
> >   {
> >       pcibus_t new_addr, last_addr;
> > -    int bar = pci_bar(d, reg);
> >       uint16_t cmd = pci_get_word(d->config + PCI_COMMAND);
> >       Object *machine = qdev_get_machine();
> >       ObjectClass *oc = object_get_class(machine);
> > @@ -1075,7 +1133,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
> >           if (!(cmd & PCI_COMMAND_IO)) {
> >               return PCI_BAR_UNMAPPED;
> >           }
> > -        new_addr = pci_get_long(d->config + bar) & ~(size - 1);
> > +        new_addr = pci_config_get_bar_addr(d, reg, type, size);
> >           last_addr = new_addr + size - 1;
> >           /* Check if 32 bit BAR wraps around explicitly.
> >            * TODO: make priorities correct and remove this work
> > around.
> > @@ -1090,11 +1148,7 @@ static pcibus_t pci_bar_address(PCIDevice
> > *d,
> >       if (!(cmd & PCI_COMMAND_MEMORY)) {
> >           return PCI_BAR_UNMAPPED;
> >       }
> > -    if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > -        new_addr = pci_get_quad(d->config + bar);
> > -    } else {
> > -        new_addr = pci_get_long(d->config + bar);
> > -    }
> > +    new_addr = pci_config_get_bar_addr(d, reg, type, size);
> >       /* the ROM slot has a specific enable bit */
> >       if (reg == PCI_ROM_SLOT && !(new_addr &
> > PCI_ROM_ADDRESS_ENABLE)) {
> >           return PCI_BAR_UNMAPPED;
> > @@ -1228,6 +1282,7 @@ void pci_default_write_config(PCIDevice *d,
> > uint32_t addr, uint32_t val_in, int
> > 
> >       msi_write_config(d, addr, val_in, l);
> >       msix_write_config(d, addr, val_in, l);
> > +    pcie_sriov_config_write(d, addr, val_in, l);
> >   }
> > 
> >   /***********************************************************/
> > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > index 6e28985..ba49c0f 100644
> > --- a/hw/pci/pcie.c
> > +++ b/hw/pci/pcie.c
> > @@ -253,7 +253,7 @@ void pcie_cap_slot_hotplug_cb(HotplugHandler
> > *hotplug_dev, DeviceState *dev,
> >        * Right now, only a device of function = 0 is allowed to be
> >        * hot plugged/unplugged.
> >        */
> > -    assert(PCI_FUNC(pci_dev->devfn) == 0);
> > +    assert(PCI_FUNC(pci_dev->devfn) == 0 || pci_is_vf(pci_dev));
> > 
> >       pci_word_test_and_set_mask(exp_cap + PCI_EXP_SLTSTA,
> >                                  PCI_EXP_SLTSTA_PDS);
> > @@ -265,10 +265,11 @@ void
> > pcie_cap_slot_hot_unplug_request_cb(HotplugHandler *hotplug_dev,
> >                                            DeviceState *dev, Error
> > **errp)
> >   {
> >       uint8_t *exp_cap;
> > +    PCIDevice *pdev = PCI_DEVICE(hotplug_dev);
> > 
> > -    pcie_cap_slot_hotplug_common(PCI_DEVICE(hotplug_dev), dev,
> > &exp_cap, errp);
> > +    pcie_cap_slot_hotplug_common(pdev, dev, &exp_cap, errp);
> > 
> > -    pcie_cap_slot_push_attention_button(PCI_DEVICE(hotplug_dev));
> > +    pcie_cap_slot_push_attention_button(pdev);
> >   }
> > 
> >   /* pci express slot for pci express root/downstream port
> > @@ -408,7 +409,7 @@ void pcie_cap_slot_write_config(PCIDevice *dev,
> >       }
> > 
> >       /*
> > -     * If the slot is polulated, power indicator is off and power
> > +     * If the slot is populated, power indicator is off and power
> >        * controller is off, it is safe to detach the devices.
> >        */
> >       if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val &
> > PCI_EXP_SLTCTL_PCC) &&
> > diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> > new file mode 100644
> > index 0000000..14f31cc
> > --- /dev/null
> > +++ b/hw/pci/pcie_sriov.c
> > @@ -0,0 +1,271 @@
> > +/*
> > + * pcie_sriov.c:
> > + *
> > + * Implementation of SR/IOV emulation support.
> > + *
> > + * Copyright (c) 2014 Knut Omang <knut.omang@oracle.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2
> > or later.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#include "hw/pci/pci.h"
> > +#include "hw/pci/pcie.h"
> > +#include "hw/pci/pci_bus.h"
> > +#include "qemu/range.h"
> > +
> > +//#define DEBUG_SRIOV
> > +#ifdef DEBUG_SRIOV
> > +# define SRIOV_DPRINTF(fmt, ...)                                  
> >        \
> > +    fprintf(stderr, "%s:%d " fmt, __func__, __LINE__, ##
> > __VA_ARGS__)
> > +#else
> > +# define SRIOV_DPRINTF(fmt, ...) do {} while (0)
> > +#endif
> > +#define SRIOV_DEV_PRINTF(dev, fmt, ...)                           
> >        \
> > +    SRIOV_DPRINTF("%s:%x "fmt, (dev)->name, (dev)->devfn, ##
> > __VA_ARGS__)
> > +
> > +static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char
> > *name);
> > +static void unregister_vfs(PCIDevice *dev);
> > +
> > +void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > +                        const char *vfname, uint16_t vf_dev_id,
> > +                        uint16_t init_vfs, uint16_t total_vfs,
> > +                        uint16_t vf_offset, uint16_t vf_stride)
> > +{
> > +    uint8_t *cfg = dev->config + offset;
> > +    uint8_t *wmask;
> > +
> > +    pcie_add_capability(dev, PCI_EXT_CAP_ID_SRIOV, 1,
> > +                        offset, PCI_EXT_CAP_SRIOV_SIZEOF);
> > +    dev->exp.sriov_cap = offset;
> > +    dev->exp.sriov_pf.num_vfs = 0;
> > +    dev->exp.sriov_pf.vfname = g_strdup(vfname);
> > +    dev->exp.sriov_pf.vf = NULL;
> > +
> > +    pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
> > +    pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
> > +
> > +    /* Minimal page size support values required by the SR/IOV
> > standard:
> > +     * 0x553 << 12 = 0x553000 = 4K + 8K + 64K + 256K + 1M + 4M
> > +     * Hard coded for simplicity in this version
> > +     */
> > +    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, 0x553);
> > +
> > +    /* Default is to use 4K pages, software can modify it
> > +     * to any of the supported bits
> > +     */
> > +    pci_set_word(cfg + PCI_SRIOV_SYS_PGSIZE, 0x1);
> > +
> > +    /* Set up device ID and initial/total number of VFs available
> > */
> > +    pci_set_word(cfg + PCI_SRIOV_VF_DID, vf_dev_id);
> > +    pci_set_word(cfg + PCI_SRIOV_INITIAL_VF, init_vfs);
> > +    pci_set_word(cfg + PCI_SRIOV_TOTAL_VF, total_vfs);
> > +    pci_set_word(cfg + PCI_SRIOV_NUM_VF, 0);
> > +
> > +    /* Write enable control bits */
> > +    wmask = dev->wmask + offset;
> > +    pci_set_word(wmask + PCI_SRIOV_CTRL,
> > +                 PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE |
> > PCI_SRIOV_CTRL_ARI);
> > +    pci_set_word(wmask + PCI_SRIOV_NUM_VF, 0xffff);
> > +    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, 0x553);
> > +
> > +    qdev_prop_set_bit(&dev->qdev, "multifunction", true);
> > +}
> > +
> > +void pcie_sriov_pf_exit(PCIDevice *dev)
> > +{
> > +    SRIOV_DPRINTF("\n");
> > +    unregister_vfs(dev);
> > +    g_free((char *)dev->exp.sriov_pf.vfname);
> > +    dev->exp.sriov_pf.vfname = NULL;
> > +}
> > +
> > +void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
> > +                               uint8_t type, dma_addr_t size)
> > +{
> > +    uint32_t addr;
> > +    uint64_t wmask;
> > +    uint16_t sriov_cap = dev->exp.sriov_cap;
> > +
> > +    assert(sriov_cap > 0);
> > +    assert(region_num >= 0);
> > +    assert(region_num < PCI_NUM_REGIONS);
> > +    assert(region_num != PCI_ROM_SLOT);
> > +
> > +    wmask = ~(size - 1);
> > +    addr = sriov_cap + PCI_SRIOV_BAR + region_num * 4;
> > +
> > +    pci_set_long(dev->config + addr, type);
> > +    if (!(type & PCI_BASE_ADDRESS_SPACE_IO) &&
> > +        type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > +        pci_set_quad(dev->wmask + addr, wmask);
> > +        pci_set_quad(dev->cmask + addr, ~0ULL);
> > +    } else {
> > +        pci_set_long(dev->wmask + addr, wmask & 0xffffffff);
> > +        pci_set_long(dev->cmask + addr, 0xffffffff);
> > +    }
> > +    dev->exp.sriov_pf.vf_bar_type[region_num] = type;
> > +}
> > +
> > +void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
> > +                                MemoryRegion *memory)
> > +{
> > +    PCIIORegion *r;
> > +    uint8_t type;
> > +    pcibus_t size = memory_region_size(memory);
> > +
> > +    assert(pci_is_vf(dev)); /* PFs must use pci_register_bar */
> > +    assert(region_num >= 0);
> > +    assert(region_num < PCI_NUM_REGIONS);
> > +    type = dev->exp.sriov_vf.pf
> > ->exp.sriov_pf.vf_bar_type[region_num];
> > +
> > +    if (size & (size-1)) {
> > +        fprintf(stderr, "ERROR: PCI region size must be a power"
> > +                " of two - type=0x%x, size=0x%"FMT_PCIBUS"\n",
> > type, size);
> > +        exit(1);
> > +    }
> > +
> > +    r = &dev->io_regions[region_num];
> > +    r->memory = memory;
> > +    r->address_space =
> > +        type & PCI_BASE_ADDRESS_SPACE_IO
> > +        ? dev->bus->address_space_io
> > +        : dev->bus->address_space_mem;
> > +    r->size = size;
> > +    r->type = type;
> > +
> > +    r->addr = pci_bar_address(dev, region_num, r->type, r->size);
> > +    if (r->addr != PCI_BAR_UNMAPPED) {
> > +        memory_region_add_subregion_overlap(r->address_space,
> > +                                            r->addr, r->memory,
> > 1);
> > +    }
> > +}
> > +
> > +static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char
> > *name)
> > +{
> > +    PCIDevice *dev = pci_create(pf->bus, devfn, name);
> > +    dev->exp.sriov_vf.pf = pf;
> > +    Error *local_err = NULL;
> > +
> > +    object_property_set_bool(OBJECT(&dev->qdev), true, "realized",
> > &local_err);
> > +    if (local_err) {
> > +        fprintf(stderr, "Failed to enable: %s\n",
> > +                error_get_pretty(local_err));
> > +        error_free(local_err);
> > +        return NULL;
> > +    }
> > +
> > +    /* set vid/did according to sr/iov spec - they are not used */
> > +    pci_config_set_vendor_id(dev->config, 0xffff);
> > +    pci_config_set_device_id(dev->config, 0xffff);
> > +    return dev;
> > +}
> > +
> > +static void register_vfs(PCIDevice *dev)
> > +{
> > +    uint16_t num_vfs;
> > +    uint16_t i;
> > +    uint16_t sriov_cap = dev->exp.sriov_cap;
> > +    uint16_t vf_offset = pci_get_word(dev->config + sriov_cap +
> > PCI_SRIOV_VF_OFFSET);
> > +    uint16_t vf_stride = pci_get_word(dev->config + sriov_cap +
> > PCI_SRIOV_VF_STRIDE);
> > +    int32_t devfn = dev->devfn + vf_offset;
> > +
> > +    assert(sriov_cap > 0);
> > +    num_vfs = pci_get_word(dev->config + sriov_cap +
> > PCI_SRIOV_NUM_VF);
> > +
> > +    dev->exp.sriov_pf.vf = g_malloc(sizeof(PCIDevice *) *
> > num_vfs);
> > +    assert(dev->exp.sriov_pf.vf);
> > +
> > +    SRIOV_DEV_PRINTF(dev, "creating %d vf devs\n", num_vfs);
> > +    for (i = 0; i < num_vfs; i++) {
> > +        dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn, dev
> > ->exp.sriov_pf.vfname);
> > +        if (!dev->exp.sriov_pf.vf[i]) {
> > +            SRIOV_DEV_PRINTF(dev, "Failed to create VF %d at devfn
> > %x.%x\n", i,
> > +                             PCI_SLOT(devfn), PCI_FUNC(devfn));
> > +            num_vfs = i;
> > +            break;
> > +        }
> > +        devfn += vf_stride;
> > +    }
> > +    dev->exp.sriov_pf.num_vfs = num_vfs;
> > +}
> > +
> > +static void unregister_vfs(PCIDevice *dev)
> > +{
> > +    Error *local_err = NULL;
> > +    uint16_t num_vfs = dev->exp.sriov_pf.num_vfs;
> > +    uint16_t i;
> > +    SRIOV_DEV_PRINTF(dev, "Unregistering %d vf devs\n", num_vfs);
> > +    for (i = 0; i < num_vfs; i++) {
> > +        object_property_set_bool(OBJECT(&dev->exp.sriov_pf.vf[i]
> > ->qdev), false, "realized", &local_err);
> > +        if (local_err) {
> > +            fprintf(stderr, "Failed to unplug: %s\n",
> > +                    error_get_pretty(local_err));
> > +            error_free(local_err);
> > +        }
> > +    }
> > +    g_free(dev->exp.sriov_pf.vf);
> > +    dev->exp.sriov_pf.vf = NULL;
> > +    dev->exp.sriov_pf.num_vfs = 0;
> > +}
> > +
> > +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
> > uint32_t val, int len)
> > +{
> > +    uint32_t off;
> > +    uint16_t sriov_cap = dev->exp.sriov_cap;
> > +
> > +    if (!sriov_cap || address < sriov_cap) {
> > +        return;
> > +    }
> > +    off = address - sriov_cap;
> > +    if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
> > +        return;
> > +    }
> > +
> > +    SRIOV_DEV_PRINTF(dev, "cap at 0x%x sriov offset 0x%x val 0x%x
> > len %d\n",
> > +                 sriov_cap, off, val, len);
> > +
> > +    if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> > +        if (dev->exp.sriov_pf.num_vfs) {
> > +            if (!(val & PCI_SRIOV_CTRL_VFE)) {
> > +                unregister_vfs(dev);
> > +            }
> > +        } else {
> > +            if (val & PCI_SRIOV_CTRL_VFE) {
> > +                register_vfs(dev);
> > +            }
> > +        }
> > +    }
> > +}
> > +
> > +
> > +/* Reset SR/IOV VF Enable bit to trigger an unregister of all VFs
> > */
> > +void pcie_sriov_pf_disable_vfs(PCIDevice *dev)
> > +{
> > +    uint16_t sriov_cap = dev->exp.sriov_cap;
> > +    if (sriov_cap) {
> > +        uint32_t val = pci_get_byte(dev->config + sriov_cap +
> > PCI_SRIOV_CTRL);
> > +        if (val & PCI_SRIOV_CTRL_VFE) {
> > +            val &= ~PCI_SRIOV_CTRL_VFE;
> > +            pcie_sriov_config_write(dev, sriov_cap +
> > PCI_SRIOV_CTRL, val, 1);
> > +        }
> > +    }
> > +}
> > +
> > +/* Add optional supported page sizes to the mask of supported page
> > sizes */
> > +void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t
> > opt_sup_pgsize)
> > +{
> > +    uint8_t *cfg = dev->config + dev->exp.sriov_cap;
> > +    uint8_t *wmask = dev->wmask + dev->exp.sriov_cap;
> > +
> > +    uint16_t sup_pgsize = pci_get_word(cfg +
> > PCI_SRIOV_SUP_PGSIZE);
> > +
> > +    sup_pgsize |= opt_sup_pgsize;
> > +
> > +    /* Make sure the new bits are set, and that system page size
> > +     * also can be set to any of the new values according to spec:
> > +     */
> > +    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, sup_pgsize);
> > +    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, sup_pgsize);
> > +}
> > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> > index 551cb3d..2e9d8ba 100644
> > --- a/include/hw/pci/pci.h
> > +++ b/include/hw/pci/pci.h
> > @@ -11,8 +11,6 @@
> >   /* PCI includes legacy ISA access.  */
> >   #include "hw/isa/isa.h"
> > 
> > -#include "hw/pci/pcie.h"
> > -
> >   /* PCI bus */
> > 
> >   #define PCI_DEVFN(slot, func)   ((((slot) & 0x1f) << 3) | ((func)
> > & 0x07))
> > @@ -132,6 +130,7 @@ enum {
> >   #define QEMU_PCI_VGA_IO_HI_SIZE 0x20
> > 
> >   #include "hw/pci/pci_regs.h"
> > +#include "hw/pci/pcie.h"
> > 
> >   /* PCI HEADER_TYPE */
> >   #define  PCI_HEADER_TYPE_MULTI_FUNCTION 0x80
> > @@ -421,6 +420,9 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *,
> > void *, int);
> >   AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
> >   void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
> > 
> > +pcibus_t pci_bar_address(PCIDevice *d,
> > +                         int reg, uint8_t type, pcibus_t size);
> > +
> >   static inline void
> >   pci_set_byte(uint8_t *config, uint8_t val)
> >   {
> > @@ -672,6 +674,11 @@ static inline int pci_is_express(const
> > PCIDevice *d)
> >       return d->cap_present & QEMU_PCI_CAP_EXPRESS;
> >   }
> > 
> > +static inline int pci_is_vf(const PCIDevice *d)
> > +{
> > +    return d->exp.sriov_vf.pf != NULL;
> > +}
> > +
> >   static inline uint32_t pci_config_size(const PCIDevice *d)
> >   {
> >       return pci_is_express(d) ? PCIE_CONFIG_SPACE_SIZE :
> > PCI_CONFIG_SPACE_SIZE;
> > diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> > index b48a7a2..b09f79a 100644
> > --- a/include/hw/pci/pcie.h
> > +++ b/include/hw/pci/pcie.h
> > @@ -25,6 +25,7 @@
> >   #include "hw/pci/pci_regs.h"
> >   #include "hw/pci/pcie_regs.h"
> >   #include "hw/pci/pcie_aer.h"
> > +#include "hw/pci/pcie_sriov.h"
> >   #include "hw/hotplug.h"
> > 
> >   typedef enum {
> > @@ -74,6 +75,11 @@ struct PCIExpressDevice {
> >       /* AER */
> >       uint16_t aer_cap;
> >       PCIEAERLog aer_log;
> > +
> > +    /* SR/IOV */
> > +    uint16_t sriov_cap;
> > +    PCIESriovPF sriov_pf;
> > +    PCIESriovVF sriov_vf;
> >   };
> > 
> >   #define COMPAT_PROP_PCP "power_controller_present"
> > diff --git a/include/hw/pci/pcie_sriov.h
> > b/include/hw/pci/pcie_sriov.h
> > new file mode 100644
> > index 0000000..061f5cf
> > --- /dev/null
> > +++ b/include/hw/pci/pcie_sriov.h
> > @@ -0,0 +1,55 @@
> > +/*
> > + * pcie_sriov.h:
> > + *
> > + * Implementation of SR/IOV emulation support.
> > + *
> > + * Copyright (c) 2014 Knut Omang <knut.omang@oracle.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2
> > or later.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#ifndef QEMU_PCIE_SRIOV_H
> > +#define QEMU_PCIE_SRIOV_H
> > +
> > +
> > +
> > +struct PCIESriovPF {
> > +    uint16_t num_vfs;           /* Number of virtual functions
> > created */
> > +    uint8_t vf_bar_type[PCI_NUM_REGIONS];  /* Store type for each
> > VF bar */
> > +    const char *vfname;         /* Reference to the device type
> > used for the VFs */
> > +    PCIDevice **vf;             /* Pointer to an array of num_vfs
> > VF devices */
> > +};
> > +
> > +struct PCIESriovVF {
> > +    PCIDevice *pf;              /* Pointer back to owner physical
> > function */
> > +};
> > +
> > +void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > +                        const char *vfname, uint16_t vf_dev_id,
> > +                        uint16_t init_vfs, uint16_t total_vfs,
> > +                        uint16_t vf_offset, uint16_t vf_stride);
> > +void pcie_sriov_pf_exit(PCIDevice *dev);
> > +
> > +/* Set up a VF bar in the SR/IOV bar area */
> > +void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
> > +                               uint8_t type, dma_addr_t size);
> > +
> > +/* Instantiate a bar for a VF */
> > +void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
> > +                                MemoryRegion *memory);
> > +
> > +/* Add optional supported page sizes to the mask of supported page
> > sizes
> > + * Page size values are interpreted as opt_sup_pgsize << 12.
> > + */
> > +void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t
> > opt_sup_pgsize);
> > +
> > +/* SR/IOV capability config write handler */
> > +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
> > +                             uint32_t val, int len);
> > +
> > +/* Reset SR/IOV VF Enable bit to unregister all VFs */
> > +void pcie_sriov_pf_disable_vfs(PCIDevice *dev);
> > +
> > +#endif /* QEMU_PCIE_SRIOV_H */
> > diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> > index f8a9dd6..fa36b8c 100644
> > --- a/include/qemu/typedefs.h
> > +++ b/include/qemu/typedefs.h
> > @@ -53,6 +53,8 @@ typedef struct PCIDevice PCIDevice;
> >   typedef struct PCIEAERErr PCIEAERErr;
> >   typedef struct PCIEAERLog PCIEAERLog;o
> >   typedef struct PCIEAERMsg PCIEAERMsg;
> > +typedef struct PCIESriovPF PCIESriovPF;
> > +typedef struct PCIESriovVF PCIESriovVF;
> >   typedef struct PCIEPort PCIEPort;
> >   typedef struct PCIESlot PCIESlot;
> >   typedef struct PCIExpressDevice PCIExpressDevice;
> > 
> 
> I reviewed the patch and it looks OK to me (besides the other
> comments in the thread).
> Since we don't have code using it yet and the tests are only an RFC,
> I suggest adding
> to the series a document file explaining how to use it (template code
> from the RFC)
> and the current limitations/issues.

Good idea, the 

docs/pci_expander_bridge.txt

file seems like a good template,

Thanks

Knut

> Thanks,
> Marcel
> 
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/3] pci: Make use of the devfn property when registering new devices
  2015-09-17 12:41       ` Marcel Apfelbaum
@ 2015-10-14 10:27         ` Knut Omang
  0 siblings, 0 replies; 16+ messages in thread
From: Knut Omang @ 2015-10-14 10:27 UTC (permalink / raw)
  To: Marcel Apfelbaum, qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard W.M. Jones,
	Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

On Thu, 2015-09-17 at 15:41 +0300, Marcel Apfelbaum wrote:
> On 09/17/2015 03:12 PM, Knut Omang wrote:
> > On Thu, 2015-09-17 at 14:11 +0300, Marcel Apfelbaum wrote:
> > > On 09/12/2015 03:36 PM, Knut Omang wrote:
> > > > Without this, the devfn argument to pci_create_*()
> > > > does not affect the assigned devfn.
> > > > 
> > > > Needed to support (VF_STRIDE,VF_OFFSET) values other than (1,1)
> > > > for SR/IOV.
> > > > 
> > > > Signed-off-by: Knut Omang <knut.omang@oracle.com>
> > > > ---
> > > >    hw/pci/pci.c | 2 +-
> > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > > > index ccea628..a5cc015 100644
> > > > --- a/hw/pci/pci.c
> > > > +++ b/hw/pci/pci.c
> > > > @@ -1840,7 +1840,7 @@ static void pci_qdev_realize(DeviceState
> > > > *qdev, Error **errp)
> > > >        bus = PCI_BUS(qdev_get_parent_bus(qdev));
> > > >        pci_dev = do_pci_register_device(pci_dev, bus,
> > > > 
> > > >   object_get_typename(OBJECT(qdev)),
> > > > -                                     pci_dev->devfn, errp);
> > > > +
> > > >   object_property_get_int(OBJECT(qdev), "addr", NULL), errp);
> > > Hi,
> > > 
> > > I don't get this, using object_property_get_int on "addr" should
> > > return the value of pci_dev->devfn,
> > > can you please tell what exactly is not working?
> > 
> > The problem is that at that point pci_dev->devfn has not been set
> > yet -
> > have commented on this before somewhere..
> > 
> 
> But "addr" property has the right value? Is indeed strange because it
> should
> get the value from pci_dev->devfn.

In the current version, in pci_qdev_realize() 
the devfn argument to do_pci_register_device() is set to 
pci_dev->devfn. Then inside do_pci_register_device that devfn argument
is again used to set pci_dev->devfn.

When the SR/IOV code tries to add a second device (the first VF) at 
(devfn + offset + stride) to the bus, it fails because devfn is 0 and
that hits the PF in the devices[] array.

> Don't get me wrong, this patch is OK.
> I just want to understand if we have a hidden bug somewhere.

Yes, havent dived into the details for a while but it probably works
because the pci_dev->devfn argument is often -1 (don't care) and then
the code will just search for an available function number.

> Anyway,
> Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>

Thanks,

Knut

> > Knut
> > 
> > > Thanks,
> > > Marcel
> > > 
> > > 
> > > 
> > > >        if (pci_dev == NULL)
> > > >            return;
> > > > 
> > > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/3] pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  2015-09-17 14:10     ` Knut Omang
  2015-10-07 13:45       ` Marcel Apfelbaum
@ 2015-10-14 12:25       ` Knut Omang
  1 sibling, 0 replies; 16+ messages in thread
From: Knut Omang @ 2015-10-14 12:25 UTC (permalink / raw)
  To: marcel, qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard W.M. Jones,
	Alex Williamson, Gonglei (Arei),
	Jan Kiszka, Paolo Bonzini, Dotan Barak, Richard Henderson

On Thu, 2015-09-17 at 16:10 +0200, Knut Omang wrote:
> On Thu, 2015-09-17 at 14:49 +0300, Marcel Apfelbaum wrote:
> > On 09/12/2015 03:36 PM, Knut Omang wrote:
> > > This patch provides the building blocks for creating an SR/IOV
> > > PCIe Extended Capability header and register/unregister
> > > SR/IOV Virtual Functions.
> > > 
> > > Signed-off-by: Knut Omang <knut.omang@oracle.com>
> > > ---
> > >   hw/pci/Makefile.objs        |   2 +-
> > >   hw/pci/pci.c                |  99 ++++++++++++----
> > >   hw/pci/pcie.c               |   9 +-
> > >   hw/pci/pcie_sriov.c         | 271
> > > ++++++++++++++++++++++++++++++++++++++++++++
> > >   include/hw/pci/pci.h        |  11 +-
> > >   include/hw/pci/pcie.h       |   6 +
> > >   include/hw/pci/pcie_sriov.h |  55 +++++++++
> > >   include/qemu/typedefs.h     |   2 +
> > >   8 files changed, 426 insertions(+), 29 deletions(-)
> > >   create mode 100644 hw/pci/pcie_sriov.c
> > >   create mode 100644 include/hw/pci/pcie_sriov.h
> > > 
> > > diff --git a/hw/pci/Makefile.objs b/hw/pci/Makefile.objs
> > > index 9f905e6..2226980 100644
> > > --- a/hw/pci/Makefile.objs
> > > +++ b/hw/pci/Makefile.objs
> > > @@ -3,7 +3,7 @@ common-obj-$(CONFIG_PCI) += msix.o msi.o
> > >   common-obj-$(CONFIG_PCI) += shpc.o
> > >   common-obj-$(CONFIG_PCI) += slotid_cap.o
> > >   common-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o
> > > -common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o
> > > +common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o
> > > pcie_sriov.o
> > > 
> > >   common-obj-$(call lnot,$(CONFIG_PCI)) += pci-stub.o
> > >   common-obj-$(CONFIG_ALL) += pci-stub.o
> > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > > index a5cc015..9c0eba1 100644
> > > --- a/hw/pci/pci.c
> > > +++ b/hw/pci/pci.c
> > > @@ -153,6 +153,9 @@ int pci_bar(PCIDevice *d, int reg)
> > >   {
> > >       uint8_t type;
> > > 
> > > +    /* PCIe virtual functions do not have their own BARs */
> > > +    assert(!pci_is_vf(d));
> > > +
> > >       if (reg != PCI_ROM_SLOT)
> > >           return PCI_BASE_ADDRESS_0 + reg * 4;
> > > 
> > > @@ -211,22 +214,13 @@ void pci_device_deassert_intx(PCIDevice
> > > *dev)
> > >       }
> > >   }
> > > 
> > > -static void pci_do_device_reset(PCIDevice *dev)
> > > +static void pci_reset_regions(PCIDevice *dev)
> > >   {
> > >       int r;
> > > +    if (pci_is_vf(dev)) {
> > > +        return;
> > > +    }
> > > 
> > > -    pci_device_deassert_intx(dev);
> > > -    assert(dev->irq_state == 0);
> > > -
> > > -    /* Clear all writable bits */
> > > -    pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
> > > -                                 pci_get_word(dev->wmask +
> > > PCI_COMMAND) |
> > > -                                 pci_get_word(dev->w1cmask +
> > > PCI_COMMAND));
> > > -    pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
> > > -                                 pci_get_word(dev->wmask +
> > > PCI_STATUS) |
> > > -                                 pci_get_word(dev->w1cmask +
> > > PCI_STATUS));
> > > -    dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> > > -    dev->config[PCI_INTERRUPT_LINE] = 0x0;
> > >       for (r = 0; r < PCI_NUM_REGIONS; ++r) {
> > >           PCIIORegion *region = &dev->io_regions[r];
> > >           if (!region->size) {
> > > @@ -240,6 +234,27 @@ static void pci_do_device_reset(PCIDevice
> > > *dev)
> > >               pci_set_long(dev->config + pci_bar(dev, r), region
> > > ->type);
> > >           }
> > >       }
> > > +}
> > > +
> > > +static void pci_do_device_reset(PCIDevice *dev)
> > > +{
> > > +    qdev_reset_all(&dev->qdev);
> > 
> > Hi,
> > Thank you for resubmitting this series!
> > 
> > This is only a quick look, I hope I'll have more time next week to
> > go
> > over it again.
> > 
> > 
> > > +
> > > +    dev->irq_state = 0;
> > 
> > Are you sure we need the assignment above? It seems that the
> > irq_state
> > should be modified only by the intx wrappers as 
> >  pci_device_deassert_intx and so.
> > 
> > 
> > > +    pci_update_irq_status(dev);
> > 
> > Why do we have to update the irq status on reset?
> 
> I struggled a lot with interrupts and PF disapperance in earlier
> versions, this may be unintended leftovers from rebase.
> 
> The intention was to avoid the qdev removal which caused problems
> with
> unintended "hot unplug of the PF" at VF removal earlier, but after
> some
> of the more recent QOM'ification all those problems disappeared.
> 
> I tried without these lines and the do_device_reset refactor and it
> seems to work just fine, will fix that in v5, thanks!
> 
> > > +    pci_device_deassert_intx(dev);
> > > +    assert(dev->irq_state == 0);
> > > +
> > > +    /* Clear all writable bits */
> > > +    pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
> > > +                                 pci_get_word(dev->wmask +
> > > PCI_COMMAND) |
> > > +                                 pci_get_word(dev->w1cmask +
> > > PCI_COMMAND));
> > > +    pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
> > > +                                 pci_get_word(dev->wmask +
> > > PCI_STATUS) |
> > > +                                 pci_get_word(dev->w1cmask +
> > > PCI_STATUS));
> > > +    dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> > > +    dev->config[PCI_INTERRUPT_LINE] = 0x0;
> > > +    pci_reset_regions(dev);
> > >       pci_update_mappings(dev);
> > > 
> > >       msi_reset(dev);
> > > @@ -771,6 +786,15 @@ static void pci_init_multifunction(PCIBus
> > > *bus, PCIDevice *dev, Error **errp)
> > >           dev->config[PCI_HEADER_TYPE] |=
> > > PCI_HEADER_TYPE_MULTI_FUNCTION;
> > >       }
> > > 
> > > +    /* With SR/IOV and ARI, a device at function 0 need not be a
> > > multifunction
> > > +     * device, as it may just be a VF that ended up with
> > > function
> > > 0 in
> > > +     * the legacy PCI interpretation. Avoid failing in such
> > > cases:
> > > +     */
> > > +    if (pci_is_vf(dev) &&
> > > +        dev->exp.sriov_vf.pf->cap_present &
> > > QEMU_PCI_CAP_MULTIFUNCTION) {
> > > +        return;
> > > +    }
> > > +
> > >       /*
> > >        * multifunction bit is interpreted in two ways as follows.
> > >        *   - all functions must set the bit to 1.
> > > @@ -962,6 +986,7 @@ void pci_register_bar(PCIDevice *pci_dev, int
> > > region_num,
> > >       uint64_t wmask;
> > >       pcibus_t size = memory_region_size(memory);
> > > 
> > > +    assert(!pci_is_vf(pci_dev)); /* VFs must use
> > > pcie_sriov_vf_register_bar */
> > >       assert(region_num >= 0);
> > >       assert(region_num < PCI_NUM_REGIONS);
> > >       if (size & (size-1)) {
> > > @@ -1060,11 +1085,44 @@ pcibus_t pci_get_bar_addr(PCIDevice
> > > *pci_dev, int region_num)
> > >       return pci_dev->io_regions[region_num].addr;
> > >   }
> > > 
> > > -static pcibus_t pci_bar_address(PCIDevice *d,
> > > -				int reg, uint8_t type, pcibus_t
> > > size)
> > > +
> > > +static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
> > > +                                        uint8_t type, pcibus_t
> > > size)
> > > +{
> > > +    pcibus_t new_addr;
> > > +    if (!pci_is_vf(d)) {
> > > +        int bar = pci_bar(d, reg);
> > > +        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > > +            new_addr = pci_get_quad(d->config + bar);
> > > +        } else {
> > > +            new_addr = pci_get_long(d->config + bar);
> > > +        }
> > > +    } else {
> > > +        PCIDevice *pf = d->exp.sriov_vf.pf;
> > > +        uint16_t sriov_cap = pf->exp.sriov_cap;
> > > +        int bar = sriov_cap + PCI_SRIOV_BAR + reg * 4;
> > > +        uint16_t vf_offset = pci_get_word(pf->config + sriov_cap
> > > +
> > > PCI_SRIOV_VF_OFFSET);
> > > +        uint16_t vf_stride = pci_get_word(pf->config + sriov_cap
> > > +
> > > PCI_SRIOV_VF_STRIDE);
> > > +        uint32_t vf_num = (d->devfn - (pf->devfn + vf_offset)) /
> > > vf_stride;
> > > +
> > > +        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > > +            new_addr = pci_get_quad(pf->config + bar);
> > > +        } else {
> > > +            new_addr = pci_get_long(pf->config + bar);
> > > +        }
> > > +        new_addr += vf_num * size;
> > > +    }
> > > +    if (reg != PCI_ROM_SLOT) {
> > > +        /* Preserve the rom enable bit */
> > > +        new_addr &= ~(size - 1);
> > > +    }
> > > +    return new_addr;
> > > +}
> > > +
> > > +pcibus_t pci_bar_address(PCIDevice *d,
> > > +                         int reg, uint8_t type, pcibus_t size)
> > >   {
> > >       pcibus_t new_addr, last_addr;
> > > -    int bar = pci_bar(d, reg);
> > >       uint16_t cmd = pci_get_word(d->config + PCI_COMMAND);
> > >       Object *machine = qdev_get_machine();
> > >       ObjectClass *oc = object_get_class(machine);
> > > @@ -1075,7 +1133,7 @@ static pcibus_t pci_bar_address(PCIDevice
> > > *d,
> > >           if (!(cmd & PCI_COMMAND_IO)) {
> > >               return PCI_BAR_UNMAPPED;
> > >           }
> > > -        new_addr = pci_get_long(d->config + bar) & ~(size - 1);
> > > +        new_addr = pci_config_get_bar_addr(d, reg, type, size);
> > >           last_addr = new_addr + size - 1;
> > >           /* Check if 32 bit BAR wraps around explicitly.
> > >            * TODO: make priorities correct and remove this work
> > > around.
> > > @@ -1090,11 +1148,7 @@ static pcibus_t pci_bar_address(PCIDevice
> > > *d,
> > >       if (!(cmd & PCI_COMMAND_MEMORY)) {
> > >           return PCI_BAR_UNMAPPED;
> > >       }
> > > -    if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > > -        new_addr = pci_get_quad(d->config + bar);
> > > -    } else {
> > > -        new_addr = pci_get_long(d->config + bar);
> > > -    }
> > > +    new_addr = pci_config_get_bar_addr(d, reg, type, size);
> > >       /* the ROM slot has a specific enable bit */
> > >       if (reg == PCI_ROM_SLOT && !(new_addr &
> > > PCI_ROM_ADDRESS_ENABLE)) {
> > >           return PCI_BAR_UNMAPPED;
> > > @@ -1228,6 +1282,7 @@ void pci_default_write_config(PCIDevice *d,
> > > uint32_t addr, uint32_t val_in, int
> > > 
> > >       msi_write_config(d, addr, val_in, l);
> > >       msix_write_config(d, addr, val_in, l);
> > > +    pcie_sriov_config_write(d, addr, val_in, l);
> > >   }
> > > 
> > >   /***********************************************************/
> > > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > > index 6e28985..ba49c0f 100644
> > > --- a/hw/pci/pcie.c
> > > +++ b/hw/pci/pcie.c
> > > @@ -253,7 +253,7 @@ void pcie_cap_slot_hotplug_cb(HotplugHandler
> > > *hotplug_dev, DeviceState *dev,
> > >        * Right now, only a device of function = 0 is allowed to
> > > be
> > >        * hot plugged/unplugged.
> > >        */
> > > -    assert(PCI_FUNC(pci_dev->devfn) == 0);
> > > +    assert(PCI_FUNC(pci_dev->devfn) == 0 || pci_is_vf(pci_dev));
> > > 
> > >       pci_word_test_and_set_mask(exp_cap + PCI_EXP_SLTSTA,
> > >                                  PCI_EXP_SLTSTA_PDS);
> > > @@ -265,10 +265,11 @@ void
> > > pcie_cap_slot_hot_unplug_request_cb(HotplugHandler *hotplug_dev,
> > >                                            DeviceState *dev,
> > > Error
> > > **errp)
> > >   {
> > >       uint8_t *exp_cap;
> > > +    PCIDevice *pdev = PCI_DEVICE(hotplug_dev);
> > > 
> > > -    pcie_cap_slot_hotplug_common(PCI_DEVICE(hotplug_dev), dev,
> > > &exp_cap, errp);
> > > +    pcie_cap_slot_hotplug_common(pdev, dev, &exp_cap, errp);
> > > 
> > > -   
> > >  pcie_cap_slot_push_attention_button(PCI_DEVICE(hotplug_dev));
> > > +    pcie_cap_slot_push_attention_button(pdev);
> > >   }
> > 
> > This chunk is not directly liked to the patch, I would put it in a
> > different patch.
> 
> ok
> 
> > >  /* pci express slot for pci express root/downstream port
> > > @@ -408,7 +409,7 @@ void pcie_cap_slot_write_config(PCIDevice
> > > *dev,
> > >       }
> > > 
> > >       /*
> > > -     * If the slot is polulated, power indicator is off and
> > > power
> > > +     * If the slot is populated, power indicator is off and
> > > power
> > >        * controller is off, it is safe to detach the devices.
> > >        */
> > >       if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val &
> > > PCI_EXP_SLTCTL_PCC) &&
> > 
> > 
> > Same here. I am always happy to have this kind of types taken care
> > of,
> > but I think a separate patch would be cleaner.
> 
> Would you like it in the patch set as a separate patch or should I
> schedule it for a trivial patches set instead, maybe together with
> the
>  PCI_DEVICE thing?
> 
> > > diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> > > new file mode 100644
> > > index 0000000..14f31cc
> > > --- /dev/null
> > > +++ b/hw/pci/pcie_sriov.c
> > > @@ -0,0 +1,271 @@
> > > +/*
> > > + * pcie_sriov.c:
> > > + *
> > > + * Implementation of SR/IOV emulation support.
> > > + *
> > > + * Copyright (c) 2014 Knut Omang <knut.omang@oracle.com>
> > 
> > 2015 :)
> 
> thanks,
> 
> > > + *
> > > + * This work is licensed under the terms of the GNU GPL, version
> > > 2
> > > or later.
> > > + * See the COPYING file in the top-level directory.
> > > + *
> > > + */
> > > +
> > > +#include "hw/pci/pci.h"
> > > +#include "hw/pci/pcie.h"
> > > +#include "hw/pci/pci_bus.h"
> > > +#include "qemu/range.h"
> > > +
> > > +//#define DEBUG_SRIOV
> > > +#ifdef DEBUG_SRIOV
> > > +# define SRIOV_DPRINTF(fmt, ...)                                
> > >   
> > >        \
> > > +    fprintf(stderr, "%s:%d " fmt, __func__, __LINE__, ##
> > > __VA_ARGS__)
> > > +#else
> > > +# define SRIOV_DPRINTF(fmt, ...) do {} while (0)
> > > +#endif
> > > +#define SRIOV_DEV_PRINTF(dev, fmt, ...)                         
> > >   
> > >        \
> > > +    SRIOV_DPRINTF("%s:%x "fmt, (dev)->name, (dev)->devfn, ##
> > > __VA_ARGS__)
> > > +
> > > +static PCIDevice *register_vf(PCIDevice *pf, int devfn, const
> > > char
> > > *name);
> > > +static void unregister_vfs(PCIDevice *dev);
> > > +
> > > +void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > > +                        const char *vfname, uint16_t vf_dev_id,
> > > +                        uint16_t init_vfs, uint16_t total_vfs,
> > > +                        uint16_t vf_offset, uint16_t vf_stride)
> > > +{
> > > +    uint8_t *cfg = dev->config + offset;
> > > +    uint8_t *wmask;
> > > +
> > > +    pcie_add_capability(dev, PCI_EXT_CAP_ID_SRIOV, 1,
> > > +                        offset, PCI_EXT_CAP_SRIOV_SIZEOF);
> > > +    dev->exp.sriov_cap = offset;
> > > +    dev->exp.sriov_pf.num_vfs = 0;
> > > +    dev->exp.sriov_pf.vfname = g_strdup(vfname);
> > > +    dev->exp.sriov_pf.vf = NULL;
> > > +
> > > +    pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
> > > +    pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
> > > +
> > > +    /* Minimal page size support values required by the SR/IOV
> > > standard:
> > > +     * 0x553 << 12 = 0x553000 = 4K + 8K + 64K + 256K + 1M + 4M
> > > +     * Hard coded for simplicity in this version
> > > +     */
> > 
> > A define (with the above very good explanation) would be preferred.
> 
> Will do
> 
> > > +    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, 0x553);
> > > +
> > > +    /* Default is to use 4K pages, software can modify it
> > > +     * to any of the supported bits
> > > +     */
> > > +    pci_set_word(cfg + PCI_SRIOV_SYS_PGSIZE, 0x1);
> > > +
> > > +    /* Set up device ID and initial/total number of VFs
> > > available
> > > */
> > > +    pci_set_word(cfg + PCI_SRIOV_VF_DID, vf_dev_id);
> > > +    pci_set_word(cfg + PCI_SRIOV_INITIAL_VF, init_vfs);
> > > +    pci_set_word(cfg + PCI_SRIOV_TOTAL_VF, total_vfs);
> > > +    pci_set_word(cfg + PCI_SRIOV_NUM_VF, 0);
> > > +
> > > +    /* Write enable control bits */
> > > +    wmask = dev->wmask + offset;
> > > +    pci_set_word(wmask + PCI_SRIOV_CTRL,
> > > +                 PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE |
> > > PCI_SRIOV_CTRL_ARI);
> > > +    pci_set_word(wmask + PCI_SRIOV_NUM_VF, 0xffff);
> > > +    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, 0x553);
> > > +
> > > +    qdev_prop_set_bit(&dev->qdev, "multifunction", true);
> > > +}
> > > +
> > > +void pcie_sriov_pf_exit(PCIDevice *dev)
> > > +{
> > > +    SRIOV_DPRINTF("\n");
> > > +    unregister_vfs(dev);
> > > +    g_free((char *)dev->exp.sriov_pf.vfname);
> > > +    dev->exp.sriov_pf.vfname = NULL;
> > > +}
> > > +
> > > +void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
> > > +                               uint8_t type, dma_addr_t size)
> > > +{
> > > +    uint32_t addr;
> > > +    uint64_t wmask;
> > > +    uint16_t sriov_cap = dev->exp.sriov_cap;
> > > +
> > > +    assert(sriov_cap > 0);
> > > +    assert(region_num >= 0);
> > > +    assert(region_num < PCI_NUM_REGIONS);
> > > +    assert(region_num != PCI_ROM_SLOT);
> > > +
> > > +    wmask = ~(size - 1);
> > > +    addr = sriov_cap + PCI_SRIOV_BAR + region_num * 4;
> > > +
> > > +    pci_set_long(dev->config + addr, type);
> > > +    if (!(type & PCI_BASE_ADDRESS_SPACE_IO) &&
> > > +        type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > > +        pci_set_quad(dev->wmask + addr, wmask);
> > > +        pci_set_quad(dev->cmask + addr, ~0ULL);
> > > +    } else {
> > > +        pci_set_long(dev->wmask + addr, wmask & 0xffffffff);
> > > +        pci_set_long(dev->cmask + addr, 0xffffffff);
> > > +    }
> > > +    dev->exp.sriov_pf.vf_bar_type[region_num] = type;
> > > +}
> > > +
> > > +void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
> > > +                                MemoryRegion *memory)
> > > +{
> > > +    PCIIORegion *r;
> > > +    uint8_t type;
> > > +    pcibus_t size = memory_region_size(memory);
> > > +
> > > +    assert(pci_is_vf(dev)); /* PFs must use pci_register_bar */
> > > +    assert(region_num >= 0);
> > > +    assert(region_num < PCI_NUM_REGIONS);
> > > +    type = dev->exp.sriov_vf.pf
> > > ->exp.sriov_pf.vf_bar_type[region_num];
> > > +
> > > +    if (size & (size-1)) {
> > 
> > We already have a  is_power_of_2 func defined in include/qemu/host
> > -utils.h
> 
> will fix,
> 
> > > +        fprintf(stderr, "ERROR: PCI region size must be a power"
> > > +                " of two - type=0x%x, size=0x%"FMT_PCIBUS"\n",
> > > type, size);
> > 
> > Maybe we should error_report instead of fprintf
> 
> Leftover from far too many rebases, will fix,
> 
> > > +        exit(1);
> > > +    }
> > > +
> > > +    r = &dev->io_regions[region_num];
> > > +    r->memory = memory;
> > > +    r->address_space =
> > > +        type & PCI_BASE_ADDRESS_SPACE_IO
> > > +        ? dev->bus->address_space_io
> > > +        : dev->bus->address_space_mem;
> > > +    r->size = size;
> > > +    r->type = type;
> > > +
> > > +    r->addr = pci_bar_address(dev, region_num, r->type, r
> > > ->size);
> > > +    if (r->addr != PCI_BAR_UNMAPPED) {
> > > +        memory_region_add_subregion_overlap(r->address_space,
> > > +                                            r->addr, r->memory,
> > > 1);
> > > +    }
> > > +}
> > > +
> > > +static PCIDevice *register_vf(PCIDevice *pf, int devfn, const
> > > char
> > > *name)
> > > +{
> > > +    PCIDevice *dev = pci_create(pf->bus, devfn, name);
> > > +    dev->exp.sriov_vf.pf = pf;
> > > +    Error *local_err = NULL;
> > > +
> > > +    object_property_set_bool(OBJECT(&dev->qdev), true,
> > > "realized",
> > > &local_err);
> > > +    if (local_err) {
> > > +        fprintf(stderr, "Failed to enable: %s\n",
> > > +                error_get_pretty(local_err));
> > > +        error_free(local_err);
> > 
> > and error_report_err here
> 
> ok
> 
> > > +        return NULL;
> > > +    }
> > > +
> > > +    /* set vid/did according to sr/iov spec - they are not used
> > > */
> > > +    pci_config_set_vendor_id(dev->config, 0xffff);
> > > +    pci_config_set_device_id(dev->config, 0xffff);
> > > +    return dev;
> > > +}
> > > +
> > > +static void register_vfs(PCIDevice *dev)
> > > +{
> > > +    uint16_t num_vfs;
> > > +    uint16_t i;
> > > +    uint16_t sriov_cap = dev->exp.sriov_cap;
> > > +    uint16_t vf_offset = pci_get_word(dev->config + sriov_cap +
> > > PCI_SRIOV_VF_OFFSET);
> > > +    uint16_t vf_stride = pci_get_word(dev->config + sriov_cap +
> > > PCI_SRIOV_VF_STRIDE);
> > > +    int32_t devfn = dev->devfn + vf_offset;
> > > +
> > > +    assert(sriov_cap > 0);
> > > +    num_vfs = pci_get_word(dev->config + sriov_cap +
> > > PCI_SRIOV_NUM_VF);
> > > +
> > > +    dev->exp.sriov_pf.vf = g_malloc(sizeof(PCIDevice *) *
> > > num_vfs);
> > > +    assert(dev->exp.sriov_pf.vf);
> > > +
> > > +    SRIOV_DEV_PRINTF(dev, "creating %d vf devs\n", num_vfs);
> > > +    for (i = 0; i < num_vfs; i++) {
> > > +        dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn, dev
> > > ->exp.sriov_pf.vfname);
> > > +        if (!dev->exp.sriov_pf.vf[i]) {
> > > +            SRIOV_DEV_PRINTF(dev, "Failed to create VF %d at
> > > devfn
> > > %x.%x\n", i,
> > > +                             PCI_SLOT(devfn), PCI_FUNC(devfn));
> > > +            num_vfs = i;
> > > +            break;
> > > +        }
> > > +        devfn += vf_stride;
> > > +    }
> > > +    dev->exp.sriov_pf.num_vfs = num_vfs;
> > > +}
> > > +
> > > +static void unregister_vfs(PCIDevice *dev)
> > > +{
> > > +    Error *local_err = NULL;
> > > +    uint16_t num_vfs = dev->exp.sriov_pf.num_vfs;
> > > +    uint16_t i;
> > > +    SRIOV_DEV_PRINTF(dev, "Unregistering %d vf devs\n",
> > > num_vfs);
> > > +    for (i = 0; i < num_vfs; i++) {
> > > +        object_property_set_bool(OBJECT(&dev->exp.sriov_pf.vf[i]
> > > ->qdev), false, "realized", &local_err);
> > > +        if (local_err) {
> > > +            fprintf(stderr, "Failed to unplug: %s\n",
> > > +                    error_get_pretty(local_err));
> > > +            error_free(local_err);
> > 
> > error_report_err
> > 
> > > +        }
> > > +    }
> > > +    g_free(dev->exp.sriov_pf.vf);
> > > +    dev->exp.sriov_pf.vf = NULL;
> > > +    dev->exp.sriov_pf.num_vfs = 0;
> > > +}
> > > +
> > > +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
> > > uint32_t val, int len)
> > > +{
> > > +    uint32_t off;
> > > +    uint16_t sriov_cap = dev->exp.sriov_cap;
> > > +
> > > +    if (!sriov_cap || address < sriov_cap) {
> > > +        return;
> > > +    }
> > > +    off = address - sriov_cap;
> > > +    if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
> > > +        return;
> > > +    }
> > > +
> > > +    SRIOV_DEV_PRINTF(dev, "cap at 0x%x sriov offset 0x%x val
> > > 0x%x
> > > len %d\n",
> > > +                 sriov_cap, off, val, len);
> > > +
> > > +    if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> > > +        if (dev->exp.sriov_pf.num_vfs) {
> > > +            if (!(val & PCI_SRIOV_CTRL_VFE)) {
> > > +                unregister_vfs(dev);
> > > +            }
> > > +        } else {
> > > +            if (val & PCI_SRIOV_CTRL_VFE) {
> > > +                register_vfs(dev);
> > > +            }
> > > +        }
> > > +    }
> > > +}
> > > +
> > > +
> > > +/* Reset SR/IOV VF Enable bit to trigger an unregister of all
> > > VFs
> > > */
> > > +void pcie_sriov_pf_disable_vfs(PCIDevice *dev)
> > > +{
> > > +    uint16_t sriov_cap = dev->exp.sriov_cap;
> > > +    if (sriov_cap) {
> > > +        uint32_t val = pci_get_byte(dev->config + sriov_cap +
> > > PCI_SRIOV_CTRL);
> > > +        if (val & PCI_SRIOV_CTRL_VFE) {
> > > +            val &= ~PCI_SRIOV_CTRL_VFE;
> > > +            pcie_sriov_config_write(dev, sriov_cap +
> > > PCI_SRIOV_CTRL, val, 1);
> > > +        }
> > > +    }
> > > +}
> > > +
> > > +/* Add optional supported page sizes to the mask of supported
> > > page
> > > sizes */
> > > +void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t
> > > opt_sup_pgsize)
> > > +{
> > > +    uint8_t *cfg = dev->config + dev->exp.sriov_cap;
> > > +    uint8_t *wmask = dev->wmask + dev->exp.sriov_cap;
> > > +
> > > +    uint16_t sup_pgsize = pci_get_word(cfg +
> > > PCI_SRIOV_SUP_PGSIZE);
> > > +
> > > +    sup_pgsize |= opt_sup_pgsize;
> > > +
> > > +    /* Make sure the new bits are set, and that system page size
> > > +     * also can be set to any of the new values according to
> > > spec:
> > > +     */
> > > +    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, sup_pgsize);
> > > +    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, sup_pgsize);
> > > +}
> > > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> > > index 551cb3d..2e9d8ba 100644
> > > --- a/include/hw/pci/pci.h
> > > +++ b/include/hw/pci/pci.h
> > > @@ -11,8 +11,6 @@
> > >   /* PCI includes legacy ISA access.  */
> > >   #include "hw/isa/isa.h"
> > > 
> > > -#include "hw/pci/pcie.h"
> > > -
> > >   /* PCI bus */
> > > 
> > >   #define PCI_DEVFN(slot, func)   ((((slot) & 0x1f) << 3) |
> > > ((func)
> > > & 0x07))
> > > @@ -132,6 +130,7 @@ enum {
> > >   #define QEMU_PCI_VGA_IO_HI_SIZE 0x20
> > > 
> > >   #include "hw/pci/pci_regs.h"
> > > +#include "hw/pci/pcie.h"
> > 
> > Why is it moved here?
> 
> Probably an unintended side effect of rebasing..

Actually it is due to the need for PCI_NUM_REGIONS (declared above in
pci.h now) to declare the PCIESriovPF data type in pcie_sriov.h, which
I think should naturally be included by pcie.h similar to pcie_aer.h

Knut

> 
> > >   /* PCI HEADER_TYPE */
> > >   #define  PCI_HEADER_TYPE_MULTI_FUNCTION 0x80
> > > @@ -421,6 +420,9 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus
> > > *,
> > > void *, int);
> > >   AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
> > >   void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void
> > > *opaque);
> > > 
> > > +pcibus_t pci_bar_address(PCIDevice *d,
> > > +                         int reg, uint8_t type, pcibus_t size);
> > > +
> > >   static inline void
> > >   pci_set_byte(uint8_t *config, uint8_t val)
> > >   {
> > > @@ -672,6 +674,11 @@ static inline int pci_is_express(const
> > > PCIDevice *d)
> > >       return d->cap_present & QEMU_PCI_CAP_EXPRESS;
> > >   }
> > > 
> > > +static inline int pci_is_vf(const PCIDevice *d)
> > > +{
> > > +    return d->exp.sriov_vf.pf != NULL;
> > > +}
> > > +
> > >   static inline uint32_t pci_config_size(const PCIDevice *d)
> > >   {
> > >       return pci_is_express(d) ? PCIE_CONFIG_SPACE_SIZE :
> > > PCI_CONFIG_SPACE_SIZE;
> > > diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> > > index b48a7a2..b09f79a 100644
> > > --- a/include/hw/pci/pcie.h
> > > +++ b/include/hw/pci/pcie.h
> > > @@ -25,6 +25,7 @@
> > >   #include "hw/pci/pci_regs.h"
> > >   #include "hw/pci/pcie_regs.h"
> > >   #include "hw/pci/pcie_aer.h"
> > > +#include "hw/pci/pcie_sriov.h"
> > >   #include "hw/hotplug.h"
> > > 
> > >   typedef enum {
> > > @@ -74,6 +75,11 @@ struct PCIExpressDevice {
> > >       /* AER */
> > >       uint16_t aer_cap;
> > >       PCIEAERLog aer_log;
> > > +
> > > +    /* SR/IOV */
> > > +    uint16_t sriov_cap;
> > > +    PCIESriovPF sriov_pf;
> > > +    PCIESriovVF sriov_vf;
> > >   };
> > > 
> > >   #define COMPAT_PROP_PCP "power_controller_present"
> > > diff --git a/include/hw/pci/pcie_sriov.h
> > > b/include/hw/pci/pcie_sriov.h
> > > new file mode 100644
> > > index 0000000..061f5cf
> > > --- /dev/null
> > > +++ b/include/hw/pci/pcie_sriov.h
> > > @@ -0,0 +1,55 @@
> > > +/*
> > > + * pcie_sriov.h:
> > > + *
> > > + * Implementation of SR/IOV emulation support.
> > > + *
> > > + * Copyright (c) 2014 Knut Omang <knut.omang@oracle.com>
> > 
> > 2015
> > 
> > > + *
> > > + * This work is licensed under the terms of the GNU GPL, version
> > > 2
> > > or later.
> > > + * See the COPYING file in the top-level directory.
> > > + *
> > > + */
> > > +
> > > +#ifndef QEMU_PCIE_SRIOV_H
> > > +#define QEMU_PCIE_SRIOV_H
> > > +
> > > +
> > > +
> > > +struct PCIESriovPF {
> > > +    uint16_t num_vfs;           /* Number of virtual functions
> > > created */
> > > +    uint8_t vf_bar_type[PCI_NUM_REGIONS];  /* Store type for
> > > each
> > > VF bar */
> > > +    const char *vfname;         /* Reference to the device type
> > > used for the VFs */
> > > +    PCIDevice **vf;             /* Pointer to an array of
> > > num_vfs
> > > VF devices */
> > > +};
> > > +
> > > +struct PCIESriovVF {
> > > +    PCIDevice *pf;              /* Pointer back to owner
> > > physical
> > > function */
> > > +};
> > > +
> > > +void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> > > +                        const char *vfname, uint16_t vf_dev_id,
> > > +                        uint16_t init_vfs, uint16_t total_vfs,
> > > +                        uint16_t vf_offset, uint16_t vf_stride);
> > > +void pcie_sriov_pf_exit(PCIDevice *dev);
> > > +
> > > +/* Set up a VF bar in the SR/IOV bar area */
> > > +void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
> > > +                               uint8_t type, dma_addr_t size);
> > > +
> > > +/* Instantiate a bar for a VF */
> > > +void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
> > > +                                MemoryRegion *memory);
> > > +
> > > +/* Add optional supported page sizes to the mask of supported
> > > page
> > > sizes
> > > + * Page size values are interpreted as opt_sup_pgsize << 12.
> > > + */
> > > +void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t
> > > opt_sup_pgsize);
> > > +
> > > +/* SR/IOV capability config write handler */
> > > +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
> > > +                             uint32_t val, int len);
> > > +
> > > +/* Reset SR/IOV VF Enable bit to unregister all VFs */
> > > +void pcie_sriov_pf_disable_vfs(PCIDevice *dev);
> > > +
> > > +#endif /* QEMU_PCIE_SRIOV_H */
> > > diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> > > index f8a9dd6..fa36b8c 100644
> > > --- a/include/qemu/typedefs.h
> > > +++ b/include/qemu/typedefs.h
> > > @@ -53,6 +53,8 @@ typedef struct PCIDevice PCIDevice;
> > >   typedef struct PCIEAERErr PCIEAERErr;
> > >   typedef struct PCIEAERLog PCIEAERLog;
> > >   typedef struct PCIEAERMsg PCIEAERMsg;
> > > +typedef struct PCIESriovPF PCIESriovPF;
> > > +typedef struct PCIESriovVF PCIESriovVF;
> > >   typedef struct PCIEPort PCIEPort;
> > >   typedef struct PCIESlot PCIESlot;
> > >   typedef struct PCIExpressDevice PCIExpressDevice;
> > > 
> > 
> > 
> > I hope I helped,
> 
> Definitely, 
> thanks for the review!
> 
> Knut
> 
> > Thanks,
> > Marcel
> > 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-10-14 12:26 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-12 12:36 [Qemu-devel] [PATCH v4 0/3] pcie: Add support for Single Root I/O Virtualization Knut Omang
2015-09-12 12:36 ` [Qemu-devel] [PATCH v4 1/3] pci: Make use of the devfn property when registering new devices Knut Omang
2015-09-17 11:11   ` Marcel Apfelbaum
2015-09-17 12:12     ` Knut Omang
2015-09-17 12:41       ` Marcel Apfelbaum
2015-10-14 10:27         ` Knut Omang
2015-09-12 12:36 ` [Qemu-devel] [PATCH v4 2/3] pci: Update pci_regs header Knut Omang
2015-10-07 13:32   ` Marcel Apfelbaum
2015-10-10 10:46     ` Knut Omang
2015-09-12 12:36 ` [Qemu-devel] [PATCH v4 3/3] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Knut Omang
2015-09-17 11:49   ` Marcel Apfelbaum
2015-09-17 14:10     ` Knut Omang
2015-10-07 13:45       ` Marcel Apfelbaum
2015-10-14 12:25       ` Knut Omang
2015-10-07 15:06   ` Marcel Apfelbaum
2015-10-11 13:56     ` Knut Omang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.