kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/9] Introduce vfio-pci-core subsystem
@ 2021-02-01 16:28 Max Gurtovoy
  2021-02-01 16:28 ` [PATCH 1/9] vfio-pci: rename vfio_pci.c to vfio_pci_core.c Max Gurtovoy
                   ` (9 more replies)
  0 siblings, 10 replies; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-01 16:28 UTC (permalink / raw)
  To: jgg, cohuck, kvm, linux-kernel, alex.williamson
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, mjrosato, yishaih, aik,
	Max Gurtovoy

Hi Alex and Cornelia,

This series split the vfio_pci driver into 2 parts: pci driver and a
subsystem driver that will also be library of code. The pci driver,
vfio_pci.ko will be used as before and it will bind to the subsystem
driver vfio_pci_core.ko to register to the VFIO subsystem. This patchset
if fully backward compatible. This is a typical Linux subsystem
framework behaviour. This framework can be also adopted by vfio_mdev
devices as we'll see in the below sketch.

This series is coming to solve the issues that were raised in the
previous attempt for extending vfio-pci for vendor specific
functionality: https://lkml.org/lkml/2020/5/17/376 by Yan Zhao.

This solution is also deterministic in a sense that when a user will
bind to a vendor specific vfio-pci driver, it will get all the special
goodies of the HW.
 
This subsystem framework will also ease on adding vendor specific
functionality to VFIO devices in the future by allowing another module
to provide the pci_driver that can setup number of details before
registering to VFIO subsystem (such as inject its own operations).

Below we can see the proposed changes (this patchset only deals with
VFIO_PCI subsystem but it can easily be extended to VFIO_MDEV subsystem
as well):

+------------------------------------------------------------------+
|                                                                  |
|                               VFIO                               |
|                                                                  |
+------------------------------------------------------------------+

+------------------------------+    +------------------------------+
|                              |    |                              |
|        VFIO_PCI_CORE         |    |         VFIO_MDEV_CORE       |
|                              |    |                              |
+------------------------------+    +------------------------------+

+--------------+ +-------------+    +-------------+ +--------------+
|              | |             |    |             | |              |
|              | |             |    |             | |              |
|   VFIO_PCI   | |MLX5_VFIO_PCI|    |  VFIO_MDEV  | |MLX5_VFIO_MDEV|
|              | |             |    |             | |              |
|              | |             |    |             | |              |
+--------------+ +-------------+    +-------------+ +--------------+

First 3 patches introduce the above changes for vfio_pci and
vfio_pci_core.

Patch (4/9) introduces a new mlx5 vfio-pci module that registers to VFIO
subsystem using vfio_pci_core. It also registers to Auxiliary bus for
binding to mlx5_core that is the parent of mlx5-vfio-pci devices. This
will allow extending mlx5-vfio-pci devices with HW specific features
such as Live Migration (mlx5_core patches are not part of this series
that comes for proposing the changes need for the vfio pci subsystem).

For our testing and development we used 2 VirtIO-BLK physical functions
based on NVIDIAs Bluefield-2 SNAP technology. These 2 PCI functions were
enumerated as 08:00.0 and 09:00.0 by the Hypervisor.

The Bluefield-2 device driver, mlx5_core, will create auxiliary devices
for each VirtIO-BLK SNAP PF (the "parent"/"manager" of each VirtIO-BLK PF
will actually issue auxiliary device creation).

These auxiliary devices will be seen on the Auxiliary bus as:
mlx5_core.vfio_pci.2048 -> ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/mlx5_core.vfio_pci.2048
mlx5_core.vfio_pci.2304 -> ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.1/mlx5_core.vfio_pci.2304

2048 represents BDF 08:00.0 (parent is 0000:07:00.0 Bluefield-2 p0) and
2304 represents BDF 09:00.0 (parent is 0000:07:00.1 Bluefield-2 p1) in
decimal view. In this manner, the administrator will be able to locate the
correct vfio-pci module it should bind the desired BDF to (by finding
the pointer to the module according to the Auxiliary driver of that BDF).

Note: The discovery mechanism we used for the RFC might need some
      improvements as mentioned in the TODO list.

In this way, we'll use the HW vendor driver core to manage the lifecycle
of these devices. This is reasonable since only the vendor driver knows
exactly about the status on its internal state and the capabilities of
its acceleratots, for example.

changes from v1:
 - Create a private and public vfio-pci structures (From Alex)
 - register to vfio core directly from vfio-pci-core (for now, expose
   minimal public funcionality to vfio pci drivers. This will remove the
   implicit behaviour from v1. More power to the drivers can be added in
   the future)
 - Add patch 3/9 to emphasize the needed extension for LM feature (From
   Cornelia)
 - take/release refcount for the pci module during core open/release
 - update nvlink, igd and zdev to PowerNV, X86 and s390 extensions for
   vfio-pci core
 - fix segfault bugs in current vfio-pci zdev code

TODOs:
1. Create subsystem module for VFIO_MDEV. This can be used for vendor
   specific scalable functions for example (SFs).
2. Add Live migration functionality for mlx5 SNAP devices
   (NVMe/Virtio-BLK).
3. Add Live migration functionality for mlx5 VFs
4. Add the needed functionality for mlx5_core
5. Improve userspace and discovery
6. move VGA stuff to x86 extension

I would like to thank the great team that was involved in this
development, design and internal review:
Oren, Liran, Jason, Leon, Aviad, Shahaf, Gary, Artem, Kirti, Neo, Andy
and others.

This series applies cleanly on top of kernel 5.11-rc5+ commit 13391c60da33:
"Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6"
from Linus.

Note: Live migration for MLX5 SNAP devices is WIP and can be the first
      example for adding vendor extension to vfio-pci devices. As the
      changes to the subsystem must be defined as a pre-condition for
      this work, we've decided to split the submission for now.

Max Gurtovoy (9):
  vfio-pci: rename vfio_pci.c to vfio_pci_core.c
  vfio-pci: introduce vfio_pci_core subsystem driver
  vfio-pci-core: export vfio_pci_register_dev_region function
  mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices
  vfio-pci/zdev: remove unused vdev argument
  vfio-pci/zdev: fix possible segmentation fault issue
  vfio/pci: use s390 naming instead of zdev
  vfio/pci: use x86 naming instead of igd
  vfio/pci: use powernv naming instead of nvlink2

 drivers/vfio/pci/Kconfig                      |   57 +-
 drivers/vfio/pci/Makefile                     |   16 +-
 drivers/vfio/pci/mlx5_vfio_pci.c              |  253 ++
 drivers/vfio/pci/vfio_pci.c                   | 2384 +----------------
 drivers/vfio/pci/vfio_pci_config.c            |   56 +-
 drivers/vfio/pci/vfio_pci_core.c              | 2326 ++++++++++++++++
 drivers/vfio/pci/vfio_pci_core.h              |   73 +
 drivers/vfio/pci/vfio_pci_intrs.c             |   22 +-
 ...{vfio_pci_nvlink2.c => vfio_pci_powernv.c} |   47 +-
 drivers/vfio/pci/vfio_pci_private.h           |   44 +-
 drivers/vfio/pci/vfio_pci_rdwr.c              |   14 +-
 .../pci/{vfio_pci_zdev.c => vfio_pci_s390.c}  |   28 +-
 .../pci/{vfio_pci_igd.c => vfio_pci_x86.c}    |   18 +-
 include/linux/mlx5/vfio_pci.h                 |   36 +
 14 files changed, 2916 insertions(+), 2458 deletions(-)
 create mode 100644 drivers/vfio/pci/mlx5_vfio_pci.c
 create mode 100644 drivers/vfio/pci/vfio_pci_core.c
 create mode 100644 drivers/vfio/pci/vfio_pci_core.h
 rename drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_powernv.c} (89%)
 rename drivers/vfio/pci/{vfio_pci_zdev.c => vfio_pci_s390.c} (80%)
 rename drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} (89%)
 create mode 100644 include/linux/mlx5/vfio_pci.h

-- 
2.25.4


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 1/9] vfio-pci: rename vfio_pci.c to vfio_pci_core.c
  2021-02-01 16:28 [PATCH v2 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
@ 2021-02-01 16:28 ` Max Gurtovoy
  2021-02-01 16:28 ` [PATCH 2/9] vfio-pci: introduce vfio_pci_core subsystem driver Max Gurtovoy
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-01 16:28 UTC (permalink / raw)
  To: jgg, cohuck, kvm, linux-kernel, alex.williamson
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, mjrosato, yishaih, aik,
	Max Gurtovoy

This is a preparation patch for separating the vfio_pci driver to a
subsystem driver and a generic pci driver. This patch doesn't change any
logic.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/Makefile                        | 2 +-
 drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} (100%)

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 781e0809d6ee..d5555d350b9b 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
-vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 vfio-pci-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci_core.c
similarity index 100%
rename from drivers/vfio/pci/vfio_pci.c
rename to drivers/vfio/pci/vfio_pci_core.c
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 2/9] vfio-pci: introduce vfio_pci_core subsystem driver
  2021-02-01 16:28 [PATCH v2 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
  2021-02-01 16:28 ` [PATCH 1/9] vfio-pci: rename vfio_pci.c to vfio_pci_core.c Max Gurtovoy
@ 2021-02-01 16:28 ` Max Gurtovoy
  2021-02-01 16:28 ` [PATCH 3/9] vfio-pci-core: export vfio_pci_register_dev_region function Max Gurtovoy
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-01 16:28 UTC (permalink / raw)
  To: jgg, cohuck, kvm, linux-kernel, alex.williamson
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, mjrosato, yishaih, aik,
	Max Gurtovoy

Split the vfio_pci driver into two parts, the 'struct pci_driver'
(vfio_pci) and a library of code (vfio_pci_core) that helps creating a
VFIO device on top of a PCI device.

As before vfio_pci.ko continues to present the same interface under
sysfs and this change should have no functional impact.

vfio_pci_core exposes an interface that is similar to a typical
Linux subsystem, in that a pci_driver doing probe() can setup a number
of details and then create the VFIO char device.

Allowing another module to provide the pci_driver allows that module
to customize how VFIO is setup, inject its own operations, and easily
extend vendor specific functionality.

This is complementary to how VFIO's mediated devices work. Instead of
custome device lifecycle managmenet and a special bus drivers using
this approach will rely on the normal driver core lifecycle (eg
bind/unbind) management and this is optimized to effectively support
customization that is only making small modifications to what vfio_pci
would do normally.

This approach is also a pluggable alternative for the hard wired
CONFIG_VFIO_PCI_IG and CONFIG_VFIO_PCI_NVLINK2 "drivers" that are
built into vfio-pci. Using this work all of that code can be moved to
a dedicated device-specific modules and cleanly split out of the
generic vfio_pci driver.

Below is an example for adding new driver to vfio pci subsystem:
	+----------------------------------+
	|                                  |
	|           VFIO                   |
	|                                  |
	+----------------------------------+

	+----------------------------------+
	|                                  |
	|           VFIO_PCI_CORE          |
	|                                  |
	+----------------------------------+

	+----------------+ +---------------+
	|                | |               |
	|  VFIO_PCI      | | MLX5_VFIO_PCI |
	|                | |               |
	+----------------+ +---------------+

In this way mlx5_vfio_pci will use vfio_pci_core to register to vfio
subsystem and also use the generic PCI functionality exported from it.
Additionally it will add the needed vendor specific logic for HW
specific features such as Live Migration.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/Kconfig            |  24 +-
 drivers/vfio/pci/Makefile           |  13 +-
 drivers/vfio/pci/vfio_pci.c         | 202 ++++++++++++++
 drivers/vfio/pci/vfio_pci_config.c  |  56 ++--
 drivers/vfio/pci/vfio_pci_core.c    | 392 ++++++++++------------------
 drivers/vfio/pci/vfio_pci_core.h    |  45 ++++
 drivers/vfio/pci/vfio_pci_igd.c     |   2 +-
 drivers/vfio/pci/vfio_pci_intrs.c   |  22 +-
 drivers/vfio/pci/vfio_pci_nvlink2.c |  24 +-
 drivers/vfio/pci/vfio_pci_private.h |   4 +-
 drivers/vfio/pci/vfio_pci_rdwr.c    |  14 +-
 drivers/vfio/pci/vfio_pci_zdev.c    |   2 +-
 12 files changed, 471 insertions(+), 329 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci.c
 create mode 100644 drivers/vfio/pci/vfio_pci_core.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 40a223381ab6..b958a48f63a0 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
-config VFIO_PCI
-	tristate "VFIO support for PCI devices"
+config VFIO_PCI_CORE
+	tristate "VFIO core support for PCI devices"
 	depends on VFIO && PCI && EVENTFD
 	select VFIO_VIRQFD
 	select IRQ_BYPASS_MANAGER
@@ -10,9 +10,17 @@ config VFIO_PCI
 
 	  If you don't know what to do here, say N.
 
+config VFIO_PCI
+	tristate "VFIO support for PCI devices"
+	depends on VFIO_PCI_CORE
+	help
+	  This provides a generic PCI support using the VFIO framework.
+
+	  If you don't know what to do here, say N.
+
 config VFIO_PCI_VGA
 	bool "VFIO PCI support for VGA devices"
-	depends on VFIO_PCI && X86 && VGA_ARB
+	depends on VFIO_PCI_CORE && X86 && VGA_ARB
 	help
 	  Support for VGA extension to VFIO PCI.  This exposes an additional
 	  region on VGA devices for accessing legacy VGA addresses used by
@@ -21,16 +29,16 @@ config VFIO_PCI_VGA
 	  If you don't know what to do here, say N.
 
 config VFIO_PCI_MMAP
-	depends on VFIO_PCI
+	depends on VFIO_PCI_CORE
 	def_bool y if !S390
 
 config VFIO_PCI_INTX
-	depends on VFIO_PCI
+	depends on VFIO_PCI_CORE
 	def_bool y if !S390
 
 config VFIO_PCI_IGD
 	bool "VFIO PCI extensions for Intel graphics (GVT-d)"
-	depends on VFIO_PCI && X86
+	depends on VFIO_PCI_CORE && X86
 	default y
 	help
 	  Support for Intel IGD specific extensions to enable direct
@@ -42,13 +50,13 @@ config VFIO_PCI_IGD
 
 config VFIO_PCI_NVLINK2
 	def_bool y
-	depends on VFIO_PCI && PPC_POWERNV
+	depends on VFIO_PCI_CORE && PPC_POWERNV
 	help
 	  VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
 
 config VFIO_PCI_ZDEV
 	bool "VFIO PCI ZPCI device CLP support"
-	depends on VFIO_PCI && S390
+	depends on VFIO_PCI_CORE && S390
 	default y
 	help
 	  Enabling this option exposes VFIO capabilities containing hardware
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index d5555d350b9b..3f2a27e222cd 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,8 +1,11 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
-vfio-pci-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
-vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
-vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
-vfio-pci-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o
-
+obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+
+vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o
+
+vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
new file mode 100644
index 000000000000..fa97420953d8
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -0,0 +1,202 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+
+#include "vfio_pci_core.h"
+
+#define DRIVER_VERSION  "0.2"
+#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
+#define DRIVER_DESC     "VFIO PCI - User Level meta-driver"
+
+static char ids[1024] __initdata;
+module_param_string(ids, ids, sizeof(ids), 0);
+MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
+
+static bool enable_sriov;
+#ifdef CONFIG_PCI_IOV
+module_param(enable_sriov, bool, 0644);
+MODULE_PARM_DESC(enable_sriov, "Enable support for SR-IOV configuration.  Enabling SR-IOV on a PF typically requires support of the userspace PF driver, enabling VFs without such support may result in non-functional VFs or PF.");
+#endif
+
+static bool disable_denylist;
+module_param(disable_denylist, bool, 0444);
+MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
+
+static bool vfio_pci_dev_in_denylist(struct pci_dev *pdev)
+{
+	switch (pdev->vendor) {
+	case PCI_VENDOR_ID_INTEL:
+		switch (pdev->device) {
+		case PCI_DEVICE_ID_INTEL_QAT_C3XXX:
+		case PCI_DEVICE_ID_INTEL_QAT_C3XXX_VF:
+		case PCI_DEVICE_ID_INTEL_QAT_C62X:
+		case PCI_DEVICE_ID_INTEL_QAT_C62X_VF:
+		case PCI_DEVICE_ID_INTEL_QAT_DH895XCC:
+		case PCI_DEVICE_ID_INTEL_QAT_DH895XCC_VF:
+			return true;
+		default:
+			return false;
+		}
+	}
+
+	return false;
+}
+
+static bool vfio_pci_is_denylisted(struct pci_dev *pdev)
+{
+	if (!vfio_pci_dev_in_denylist(pdev))
+		return false;
+
+	if (disable_denylist) {
+		pci_warn(pdev,
+			 "device denylist disabled - allowing device %04x:%04x.\n",
+			 pdev->vendor, pdev->device);
+		return false;
+	}
+
+	pci_warn(pdev, "%04x:%04x exists in vfio-pci device denylist, driver probing disallowed.\n",
+		 pdev->vendor, pdev->device);
+
+	return true;
+}
+
+static const struct vfio_pci_device_ops vfio_pci_ops = {
+	.name		= "vfio-pci",
+	.module		= THIS_MODULE,
+};
+
+static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct vfio_pci_core_device *vpdev;
+
+	if (vfio_pci_is_denylisted(pdev))
+		return -EINVAL;
+
+	vpdev = vfio_create_pci_device(pdev, &vfio_pci_ops, NULL);
+	if (IS_ERR(vpdev))
+		return PTR_ERR(vpdev);
+
+	return 0;
+}
+
+static void vfio_pci_remove(struct pci_dev *pdev)
+{
+	vfio_destroy_pci_device(pdev);
+}
+
+static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
+{
+	might_sleep();
+
+	if (!enable_sriov)
+		return -ENOENT;
+
+	return vfio_pci_core_sriov_configure(pdev, nr_virtfn);
+}
+
+static const struct pci_error_handlers vfio_err_handlers = {
+	.error_detected = vfio_pci_core_aer_err_detected,
+};
+
+static struct pci_driver vfio_pci_driver = {
+	.name			= "vfio-pci",
+	.id_table		= NULL, /* only dynamic ids */
+	.probe			= vfio_pci_probe,
+	.remove			= vfio_pci_remove,
+	.sriov_configure	= vfio_pci_sriov_configure,
+	.err_handler		= &vfio_err_handlers,
+};
+
+static void __exit vfio_pci_cleanup(void)
+{
+	pci_unregister_driver(&vfio_pci_driver);
+}
+
+static void __init vfio_pci_fill_ids(void)
+{
+	char *p, *id;
+	int rc;
+
+	/* no ids passed actually */
+	if (ids[0] == '\0')
+		return;
+
+	/* add ids specified in the module parameter */
+	p = ids;
+	while ((id = strsep(&p, ","))) {
+		unsigned int vendor, device, subvendor = PCI_ANY_ID,
+			subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
+		int fields;
+
+		if (!strlen(id))
+			continue;
+
+		fields = sscanf(id, "%x:%x:%x:%x:%x:%x",
+				&vendor, &device, &subvendor, &subdevice,
+				&class, &class_mask);
+
+		if (fields < 2) {
+			pr_warn("invalid id string \"%s\"\n", id);
+			continue;
+		}
+
+		rc = pci_add_dynid(&vfio_pci_driver, vendor, device,
+				   subvendor, subdevice, class, class_mask, 0);
+		if (rc)
+			pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
+				vendor, device, subvendor, subdevice,
+				class, class_mask, rc);
+		else
+			pr_info("add [%04x:%04x[%04x:%04x]] class %#08x/%08x\n",
+				vendor, device, subvendor, subdevice,
+				class, class_mask);
+	}
+}
+
+static int __init vfio_pci_init(void)
+{
+	int ret;
+
+	/* Register and scan for devices */
+	ret = pci_register_driver(&vfio_pci_driver);
+	if (ret)
+		return ret;
+
+	vfio_pci_fill_ids();
+
+	if (disable_denylist)
+		pr_warn("device denylist disabled.\n");
+
+	return 0;
+}
+
+module_init(vfio_pci_init);
+module_exit(vfio_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index a402adee8a21..105fbae9c035 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -183,7 +183,7 @@ static int vfio_default_config_read(struct vfio_pci_device *vdev, int pos,
 
 	/* Any non-virtualized bits? */
 	if (cpu_to_le32(~0U >> (32 - (count * 8))) != virt) {
-		struct pci_dev *pdev = vdev->pdev;
+		struct pci_dev *pdev = vdev->vpdev.pdev;
 		__le32 phys_val = 0;
 		int ret;
 
@@ -224,7 +224,7 @@ static int vfio_default_config_write(struct vfio_pci_device *vdev, int pos,
 
 	/* Non-virtualzed and writable bits go to hardware */
 	if (write & ~virt) {
-		struct pci_dev *pdev = vdev->pdev;
+		struct pci_dev *pdev = vdev->vpdev.pdev;
 		__le32 phys_val = 0;
 		int ret;
 
@@ -250,7 +250,7 @@ static int vfio_direct_config_read(struct vfio_pci_device *vdev, int pos,
 {
 	int ret;
 
-	ret = vfio_user_config_read(vdev->pdev, pos, val, count);
+	ret = vfio_user_config_read(vdev->vpdev.pdev, pos, val, count);
 	if (ret)
 		return ret;
 
@@ -275,7 +275,7 @@ static int vfio_raw_config_write(struct vfio_pci_device *vdev, int pos,
 {
 	int ret;
 
-	ret = vfio_user_config_write(vdev->pdev, pos, val, count);
+	ret = vfio_user_config_write(vdev->vpdev.pdev, pos, val, count);
 	if (ret)
 		return ret;
 
@@ -288,7 +288,7 @@ static int vfio_raw_config_read(struct vfio_pci_device *vdev, int pos,
 {
 	int ret;
 
-	ret = vfio_user_config_read(vdev->pdev, pos, val, count);
+	ret = vfio_user_config_read(vdev->vpdev.pdev, pos, val, count);
 	if (ret)
 		return ret;
 
@@ -398,7 +398,7 @@ static inline void p_setd(struct perm_bits *p, int off, u32 virt, u32 write)
 /* Caller should hold memory_lock semaphore */
 bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	u16 cmd = le16_to_cpu(*(__le16 *)&vdev->vconfig[PCI_COMMAND]);
 
 	/*
@@ -415,7 +415,7 @@ bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev)
  */
 static void vfio_bar_restore(struct vfio_pci_device *vdev)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	u32 *rbar = vdev->rbar;
 	u16 cmd;
 	int i;
@@ -462,7 +462,7 @@ static __le32 vfio_generate_bar_flags(struct pci_dev *pdev, int bar)
  */
 static void vfio_bar_fixup(struct vfio_pci_device *vdev)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	int i;
 	__le32 *vbar;
 	u64 mask;
@@ -524,7 +524,7 @@ static int vfio_basic_config_read(struct vfio_pci_device *vdev, int pos,
 	count = vfio_default_config_read(vdev, pos, count, perm, offset, val);
 
 	/* Mask in virtual memory enable */
-	if (offset == PCI_COMMAND && vdev->pdev->no_command_memory) {
+	if (offset == PCI_COMMAND && vdev->vpdev.pdev->no_command_memory) {
 		u16 cmd = le16_to_cpu(*(__le16 *)&vdev->vconfig[PCI_COMMAND]);
 		u32 tmp_val = le32_to_cpu(*val);
 
@@ -543,7 +543,7 @@ static bool vfio_need_bar_restore(struct vfio_pci_device *vdev)
 
 	for (; pos <= PCI_BASE_ADDRESS_5; i++, pos += 4) {
 		if (vdev->rbar[i]) {
-			ret = pci_user_read_config_dword(vdev->pdev, pos, &bar);
+			ret = pci_user_read_config_dword(vdev->vpdev.pdev, pos, &bar);
 			if (ret || vdev->rbar[i] != bar)
 				return true;
 		}
@@ -556,7 +556,7 @@ static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos,
 				   int count, struct perm_bits *perm,
 				   int offset, __le32 val)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	__le16 *virt_cmd;
 	u16 new_cmd = 0;
 	int ret;
@@ -751,7 +751,7 @@ static int vfio_vpd_config_write(struct vfio_pci_device *vdev, int pos,
 				 int count, struct perm_bits *perm,
 				 int offset, __le32 val)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	__le16 *paddr = (__le16 *)(vdev->vconfig + pos - offset + PCI_VPD_ADDR);
 	__le32 *pdata = (__le32 *)(vdev->vconfig + pos - offset + PCI_VPD_DATA);
 	u16 addr;
@@ -853,13 +853,13 @@ static int vfio_exp_config_write(struct vfio_pci_device *vdev, int pos,
 
 		*ctrl &= ~cpu_to_le16(PCI_EXP_DEVCTL_BCR_FLR);
 
-		ret = pci_user_read_config_dword(vdev->pdev,
+		ret = pci_user_read_config_dword(vdev->vpdev.pdev,
 						 pos - offset + PCI_EXP_DEVCAP,
 						 &cap);
 
 		if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
-			pci_try_reset_function(vdev->pdev);
+			pci_try_reset_function(vdev->vpdev.pdev);
 			up_write(&vdev->memory_lock);
 		}
 	}
@@ -880,9 +880,9 @@ static int vfio_exp_config_write(struct vfio_pci_device *vdev, int pos,
 	if (readrq != (le16_to_cpu(*ctrl) & PCI_EXP_DEVCTL_READRQ)) {
 		readrq = 128 <<
 			((le16_to_cpu(*ctrl) & PCI_EXP_DEVCTL_READRQ) >> 12);
-		readrq = max(readrq, pcie_get_mps(vdev->pdev));
+		readrq = max(readrq, pcie_get_mps(vdev->vpdev.pdev));
 
-		pcie_set_readrq(vdev->pdev, readrq);
+		pcie_set_readrq(vdev->vpdev.pdev, readrq);
 	}
 
 	return count;
@@ -935,13 +935,13 @@ static int vfio_af_config_write(struct vfio_pci_device *vdev, int pos,
 
 		*ctrl &= ~PCI_AF_CTRL_FLR;
 
-		ret = pci_user_read_config_byte(vdev->pdev,
+		ret = pci_user_read_config_byte(vdev->vpdev.pdev,
 						pos - offset + PCI_AF_CAP,
 						&cap);
 
 		if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
-			pci_try_reset_function(vdev->pdev);
+			pci_try_reset_function(vdev->vpdev.pdev);
 			up_write(&vdev->memory_lock);
 		}
 	}
@@ -1141,7 +1141,7 @@ static int vfio_msi_config_write(struct vfio_pci_device *vdev, int pos,
 
 		/* Write back to virt and to hardware */
 		*pflags = cpu_to_le16(flags);
-		ret = pci_user_write_config_word(vdev->pdev,
+		ret = pci_user_write_config_word(vdev->vpdev.pdev,
 						 start + PCI_MSI_FLAGS,
 						 flags);
 		if (ret)
@@ -1191,7 +1191,7 @@ static int init_pci_cap_msi_perm(struct perm_bits *perm, int len, u16 flags)
 /* Determine MSI CAP field length; initialize msi_perms on 1st call per vdev */
 static int vfio_msi_cap_len(struct vfio_pci_device *vdev, u8 pos)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	int len, ret;
 	u16 flags;
 
@@ -1224,7 +1224,7 @@ static int vfio_msi_cap_len(struct vfio_pci_device *vdev, u8 pos)
 /* Determine extended capability length for VC (2 & 9) and MFVC */
 static int vfio_vc_cap_len(struct vfio_pci_device *vdev, u16 pos)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	u32 tmp;
 	int ret, evcc, phases, vc_arb;
 	int len = PCI_CAP_VC_BASE_SIZEOF;
@@ -1265,7 +1265,7 @@ static int vfio_vc_cap_len(struct vfio_pci_device *vdev, u16 pos)
 
 static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	u32 dword;
 	u16 word;
 	u8 byte;
@@ -1340,7 +1340,7 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos)
 
 static int vfio_ext_cap_len(struct vfio_pci_device *vdev, u16 ecap, u16 epos)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	u8 byte;
 	u32 dword;
 	int ret;
@@ -1415,7 +1415,7 @@ static int vfio_ext_cap_len(struct vfio_pci_device *vdev, u16 ecap, u16 epos)
 static int vfio_fill_vconfig_bytes(struct vfio_pci_device *vdev,
 				   int offset, int size)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	int ret = 0;
 
 	/*
@@ -1461,7 +1461,7 @@ static int vfio_fill_vconfig_bytes(struct vfio_pci_device *vdev,
 
 static int vfio_cap_init(struct vfio_pci_device *vdev)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	u8 *map = vdev->pci_config_map;
 	u16 status;
 	u8 pos, *prev, cap;
@@ -1551,7 +1551,7 @@ static int vfio_cap_init(struct vfio_pci_device *vdev)
 
 static int vfio_ecap_init(struct vfio_pci_device *vdev)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	u8 *map = vdev->pci_config_map;
 	u16 epos;
 	__le32 *prev = NULL;
@@ -1671,7 +1671,7 @@ static const struct pci_device_id known_bogus_vf_intx_pin[] = {
  */
 int vfio_config_init(struct vfio_pci_device *vdev)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	u8 *map, *vconfig;
 	int ret;
 
@@ -1805,7 +1805,7 @@ static size_t vfio_pci_cap_remaining_dword(struct vfio_pci_device *vdev,
 static ssize_t vfio_config_do_rw(struct vfio_pci_device *vdev, char __user *buf,
 				 size_t count, loff_t *ppos, bool iswrite)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	struct perm_bits *perm;
 	__le32 val = 0;
 	int cap_start = 0, offset;
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 706de3ef94bb..d5bf01132c23 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -32,11 +32,7 @@
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
-#define DRIVER_DESC     "VFIO PCI - User Level meta-driver"
-
-static char ids[1024] __initdata;
-module_param_string(ids, ids, sizeof(ids), 0);
-MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
+#define DRIVER_DESC "core driver for VFIO based PCI devices"
 
 static bool nointxmask;
 module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
@@ -54,16 +50,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 		 "Disable using the PCI D3 low power state for idle, unused devices");
 
-static bool enable_sriov;
-#ifdef CONFIG_PCI_IOV
-module_param(enable_sriov, bool, 0644);
-MODULE_PARM_DESC(enable_sriov, "Enable support for SR-IOV configuration.  Enabling SR-IOV on a PF typically requires support of the userspace PF driver, enabling VFs without such support may result in non-functional VFs or PF.");
-#endif
-
-static bool disable_denylist;
-module_param(disable_denylist, bool, 0444);
-MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
-
 static inline bool vfio_vga_disabled(void)
 {
 #ifdef CONFIG_VFIO_PCI_VGA
@@ -73,44 +59,6 @@ static inline bool vfio_vga_disabled(void)
 #endif
 }
 
-static bool vfio_pci_dev_in_denylist(struct pci_dev *pdev)
-{
-	switch (pdev->vendor) {
-	case PCI_VENDOR_ID_INTEL:
-		switch (pdev->device) {
-		case PCI_DEVICE_ID_INTEL_QAT_C3XXX:
-		case PCI_DEVICE_ID_INTEL_QAT_C3XXX_VF:
-		case PCI_DEVICE_ID_INTEL_QAT_C62X:
-		case PCI_DEVICE_ID_INTEL_QAT_C62X_VF:
-		case PCI_DEVICE_ID_INTEL_QAT_DH895XCC:
-		case PCI_DEVICE_ID_INTEL_QAT_DH895XCC_VF:
-			return true;
-		default:
-			return false;
-		}
-	}
-
-	return false;
-}
-
-static bool vfio_pci_is_denylisted(struct pci_dev *pdev)
-{
-	if (!vfio_pci_dev_in_denylist(pdev))
-		return false;
-
-	if (disable_denylist) {
-		pci_warn(pdev,
-			 "device denylist disabled - allowing device %04x:%04x.\n",
-			 pdev->vendor, pdev->device);
-		return false;
-	}
-
-	pci_warn(pdev, "%04x:%04x exists in vfio-pci device denylist, driver probing disallowed.\n",
-		 pdev->vendor, pdev->device);
-
-	return true;
-}
-
 /*
  * Our VGA arbiter participation is limited since we don't know anything
  * about the device itself.  However, if the device is the only VGA device
@@ -122,7 +70,7 @@ static bool vfio_pci_is_denylisted(struct pci_dev *pdev)
 static unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
 {
 	struct vfio_pci_device *vdev = opaque;
-	struct pci_dev *tmp = NULL, *pdev = vdev->pdev;
+	struct pci_dev *tmp = NULL, *pdev = vdev->vpdev.pdev;
 	unsigned char max_busnr;
 	unsigned int decodes;
 
@@ -164,7 +112,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 	for (i = 0; i < PCI_STD_NUM_BARS; i++) {
 		int bar = i + PCI_STD_RESOURCES;
 
-		res = &vdev->pdev->resource[bar];
+		res = &vdev->vpdev.pdev->resource[bar];
 
 		if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
 			goto no_mmap;
@@ -260,7 +208,7 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
 
 static void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	u16 pmcsr;
 
 	if (!pdev->pm_cap)
@@ -280,7 +228,7 @@ static void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
  */
 int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	bool needs_restore = false, needs_save = false;
 	int ret;
 
@@ -311,7 +259,7 @@ int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
 
 static int vfio_pci_enable(struct vfio_pci_device *vdev)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	int ret;
 	u16 cmd;
 	u8 msix_pos;
@@ -378,7 +326,6 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 	if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
 		vdev->has_vga = true;
 
-
 	if (vfio_pci_is_vga(pdev) &&
 	    pdev->vendor == PCI_VENDOR_ID_INTEL &&
 	    IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
@@ -407,6 +354,12 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 		}
 	}
 
+	if (vdev->vpdev.vfio_pci_ops->init) {
+		ret = vdev->vpdev.vfio_pci_ops->init(&vdev->vpdev);
+		if (ret)
+			goto disable_exit;
+	}
+
 	vfio_pci_probe_mmaps(vdev);
 
 	return 0;
@@ -418,7 +371,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 
 static void vfio_pci_disable(struct vfio_pci_device *vdev)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	struct vfio_pci_dummy_resource *dummy_res, *tmp;
 	struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp;
 	int i, bar;
@@ -515,21 +468,19 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 }
 
-static struct pci_driver vfio_pci_driver;
-
 static struct vfio_pci_device *get_pf_vdev(struct vfio_pci_device *vdev,
 					   struct vfio_device **pf_dev)
 {
-	struct pci_dev *physfn = pci_physfn(vdev->pdev);
+	struct pci_dev *physfn = pci_physfn(vdev->vpdev.pdev);
 
-	if (!vdev->pdev->is_virtfn)
+	if (!vdev->vpdev.pdev->is_virtfn)
 		return NULL;
 
 	*pf_dev = vfio_device_get_from_dev(&physfn->dev);
 	if (!*pf_dev)
 		return NULL;
 
-	if (pci_dev_driver(physfn) != &vfio_pci_driver) {
+	if (pci_dev_driver(physfn) != pci_dev_driver(vdev->vpdev.pdev)) {
 		vfio_device_put(*pf_dev);
 		return NULL;
 	}
@@ -553,7 +504,7 @@ static void vfio_pci_vf_token_user_add(struct vfio_pci_device *vdev, int val)
 	vfio_device_put(pf_dev);
 }
 
-static void vfio_pci_release(void *device_data)
+static void vfio_pci_core_release(void *device_data)
 {
 	struct vfio_pci_device *vdev = device_data;
 
@@ -561,7 +512,7 @@ static void vfio_pci_release(void *device_data)
 
 	if (!(--vdev->refcnt)) {
 		vfio_pci_vf_token_user_add(vdev, -1);
-		vfio_spapr_pci_eeh_release(vdev->pdev);
+		vfio_spapr_pci_eeh_release(vdev->vpdev.pdev);
 		vfio_pci_disable(vdev);
 
 		mutex_lock(&vdev->igate);
@@ -578,15 +529,15 @@ static void vfio_pci_release(void *device_data)
 
 	mutex_unlock(&vdev->reflck->lock);
 
-	module_put(THIS_MODULE);
+	module_put(vdev->vpdev.vfio_pci_ops->module);
 }
 
-static int vfio_pci_open(void *device_data)
+static int vfio_pci_core_open(void *device_data)
 {
 	struct vfio_pci_device *vdev = device_data;
 	int ret = 0;
 
-	if (!try_module_get(THIS_MODULE))
+	if (!try_module_get(vdev->vpdev.vfio_pci_ops->module))
 		return -ENODEV;
 
 	mutex_lock(&vdev->reflck->lock);
@@ -596,14 +547,14 @@ static int vfio_pci_open(void *device_data)
 		if (ret)
 			goto error;
 
-		vfio_spapr_pci_eeh_open(vdev->pdev);
+		vfio_spapr_pci_eeh_open(vdev->vpdev.pdev);
 		vfio_pci_vf_token_user_add(vdev, 1);
 	}
 	vdev->refcnt++;
 error:
 	mutex_unlock(&vdev->reflck->lock);
 	if (ret)
-		module_put(THIS_MODULE);
+		module_put(vdev->vpdev.vfio_pci_ops->module);
 	return ret;
 }
 
@@ -613,19 +564,19 @@ static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
 		u8 pin;
 
 		if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) ||
-		    vdev->nointx || vdev->pdev->is_virtfn)
+		    vdev->nointx || vdev->vpdev.pdev->is_virtfn)
 			return 0;
 
-		pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, &pin);
+		pci_read_config_byte(vdev->vpdev.pdev, PCI_INTERRUPT_PIN, &pin);
 
 		return pin ? 1 : 0;
 	} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
 		u8 pos;
 		u16 flags;
 
-		pos = vdev->pdev->msi_cap;
+		pos = vdev->vpdev.pdev->msi_cap;
 		if (pos) {
-			pci_read_config_word(vdev->pdev,
+			pci_read_config_word(vdev->vpdev.pdev,
 					     pos + PCI_MSI_FLAGS, &flags);
 			return 1 << ((flags & PCI_MSI_FLAGS_QMASK) >> 1);
 		}
@@ -633,15 +584,15 @@ static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
 		u8 pos;
 		u16 flags;
 
-		pos = vdev->pdev->msix_cap;
+		pos = vdev->vpdev.pdev->msix_cap;
 		if (pos) {
-			pci_read_config_word(vdev->pdev,
+			pci_read_config_word(vdev->vpdev.pdev,
 					     pos + PCI_MSIX_FLAGS, &flags);
 
 			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
 		}
 	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
-		if (pci_is_pcie(vdev->pdev))
+		if (pci_is_pcie(vdev->vpdev.pdev))
 			return 1;
 	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
 		return 1;
@@ -797,8 +748,8 @@ struct vfio_devices {
 	int max_index;
 };
 
-static long vfio_pci_ioctl(void *device_data,
-			   unsigned int cmd, unsigned long arg)
+static long vfio_pci_core_ioctl(void *device_data, unsigned int cmd,
+		unsigned long arg)
 {
 	struct vfio_pci_device *vdev = device_data;
 	unsigned long minsz;
@@ -836,7 +787,7 @@ static long vfio_pci_ioctl(void *device_data,
 			int ret = vfio_pci_info_zdev_add_caps(vdev, &caps);
 
 			if (ret && ret != -ENODEV) {
-				pci_warn(vdev->pdev, "Failed to setup zPCI info capabilities\n");
+				pci_warn(vdev->vpdev.pdev, "Failed to setup zPCI info capabilities\n");
 				return ret;
 			}
 		}
@@ -863,7 +814,7 @@ static long vfio_pci_ioctl(void *device_data,
 			-EFAULT : 0;
 
 	} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
-		struct pci_dev *pdev = vdev->pdev;
+		struct pci_dev *pdev = vdev->vpdev.pdev;
 		struct vfio_region_info info;
 		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
 		int i, ret;
@@ -1023,7 +974,7 @@ static long vfio_pci_ioctl(void *device_data,
 		case VFIO_PCI_REQ_IRQ_INDEX:
 			break;
 		case VFIO_PCI_ERR_IRQ_INDEX:
-			if (pci_is_pcie(vdev->pdev))
+			if (pci_is_pcie(vdev->vpdev.pdev))
 				break;
 			fallthrough;
 		default:
@@ -1085,7 +1036,7 @@ static long vfio_pci_ioctl(void *device_data,
 			return -EINVAL;
 
 		vfio_pci_zap_and_down_write_memory_lock(vdev);
-		ret = pci_try_reset_function(vdev->pdev);
+		ret = pci_try_reset_function(vdev->vpdev.pdev);
 		up_write(&vdev->memory_lock);
 
 		return ret;
@@ -1108,13 +1059,13 @@ static long vfio_pci_ioctl(void *device_data,
 		hdr.flags = 0;
 
 		/* Can we do a slot or bus reset or neither? */
-		if (!pci_probe_reset_slot(vdev->pdev->slot))
+		if (!pci_probe_reset_slot(vdev->vpdev.pdev->slot))
 			slot = true;
-		else if (pci_probe_reset_bus(vdev->pdev->bus))
+		else if (pci_probe_reset_bus(vdev->vpdev.pdev->bus))
 			return -ENODEV;
 
 		/* How many devices are affected? */
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+		ret = vfio_pci_for_each_slot_or_bus(vdev->vpdev.pdev,
 						    vfio_pci_count_devs,
 						    &fill.max, slot);
 		if (ret)
@@ -1138,7 +1089,7 @@ static long vfio_pci_ioctl(void *device_data,
 
 		fill.devices = devices;
 
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+		ret = vfio_pci_for_each_slot_or_bus(vdev->vpdev.pdev,
 						    vfio_pci_fill_devs,
 						    &fill, slot);
 
@@ -1181,9 +1132,9 @@ static long vfio_pci_ioctl(void *device_data,
 			return -EINVAL;
 
 		/* Can we do a slot or bus reset or neither? */
-		if (!pci_probe_reset_slot(vdev->pdev->slot))
+		if (!pci_probe_reset_slot(vdev->vpdev.pdev->slot))
 			slot = true;
-		else if (pci_probe_reset_bus(vdev->pdev->bus))
+		else if (pci_probe_reset_bus(vdev->vpdev.pdev->bus))
 			return -ENODEV;
 
 		/*
@@ -1192,7 +1143,7 @@ static long vfio_pci_ioctl(void *device_data,
 		 * could be.  Note groups can have multiple devices so
 		 * one group per device is the max.
 		 */
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+		ret = vfio_pci_for_each_slot_or_bus(vdev->vpdev.pdev,
 						    vfio_pci_count_devs,
 						    &count, slot);
 		if (ret)
@@ -1255,7 +1206,7 @@ static long vfio_pci_ioctl(void *device_data,
 		 * Test whether all the affected devices are contained
 		 * by the set of groups provided by the user.
 		 */
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+		ret = vfio_pci_for_each_slot_or_bus(vdev->vpdev.pdev,
 						    vfio_pci_validate_devs,
 						    &info, slot);
 		if (ret)
@@ -1275,7 +1226,7 @@ static long vfio_pci_ioctl(void *device_data,
 		 * the vma_lock for each device, and only then get each
 		 * memory_lock.
 		 */
-		ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
+		ret = vfio_pci_for_each_slot_or_bus(vdev->vpdev.pdev,
 					    vfio_pci_try_zap_and_vma_lock_cb,
 					    &devs, slot);
 		if (ret)
@@ -1295,7 +1246,7 @@ static long vfio_pci_ioctl(void *device_data,
 		}
 
 		/* User has access, do the reset */
-		ret = pci_reset_bus(vdev->pdev);
+		ret = pci_reset_bus(vdev->vpdev.pdev);
 
 hot_reset_release:
 		for (i = 0; i < devs.cur_index; i++) {
@@ -1404,11 +1355,10 @@ static long vfio_pci_ioctl(void *device_data,
 	return -ENOTTY;
 }
 
-static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
+static ssize_t vfio_pci_rw(struct vfio_pci_device *vdev, char __user *buf,
 			   size_t count, loff_t *ppos, bool iswrite)
 {
 	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
-	struct vfio_pci_device *vdev = device_data;
 
 	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
 		return -EINVAL;
@@ -1436,22 +1386,26 @@ static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
 	return -EINVAL;
 }
 
-static ssize_t vfio_pci_read(void *device_data, char __user *buf,
-			     size_t count, loff_t *ppos)
+static ssize_t vfio_pci_core_read(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos)
 {
+	struct vfio_pci_device *vdev = device_data;
+
 	if (!count)
 		return 0;
 
-	return vfio_pci_rw(device_data, buf, count, ppos, false);
+	return vfio_pci_rw(vdev, buf, count, ppos, false);
 }
 
-static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
-			      size_t count, loff_t *ppos)
+static ssize_t vfio_pci_core_write(void *device_data,
+		const char __user *buf, size_t count, loff_t *ppos)
 {
+	struct vfio_pci_device *vdev = device_data;
+
 	if (!count)
 		return 0;
 
-	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
+	return vfio_pci_rw(vdev, (char __user *)buf, count, ppos, true);
 }
 
 /* Return 1 on zap and vma_lock acquired, 0 on contention (only with @try) */
@@ -1555,9 +1509,9 @@ u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_device *vdev)
 	u16 cmd;
 
 	down_write(&vdev->memory_lock);
-	pci_read_config_word(vdev->pdev, PCI_COMMAND, &cmd);
+	pci_read_config_word(vdev->vpdev.pdev, PCI_COMMAND, &cmd);
 	if (!(cmd & PCI_COMMAND_MEMORY))
-		pci_write_config_word(vdev->pdev, PCI_COMMAND,
+		pci_write_config_word(vdev->vpdev.pdev, PCI_COMMAND,
 				      cmd | PCI_COMMAND_MEMORY);
 
 	return cmd;
@@ -1565,7 +1519,7 @@ u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_device *vdev)
 
 void vfio_pci_memory_unlock_and_restore(struct vfio_pci_device *vdev, u16 cmd)
 {
-	pci_write_config_word(vdev->pdev, PCI_COMMAND, cmd);
+	pci_write_config_word(vdev->vpdev.pdev, PCI_COMMAND, cmd);
 	up_write(&vdev->memory_lock);
 }
 
@@ -1648,10 +1602,10 @@ static const struct vm_operations_struct vfio_pci_mmap_ops = {
 	.fault = vfio_pci_mmap_fault,
 };
 
-static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
+static int vfio_pci_core_mmap(void *device_data, struct vm_area_struct *vma)
 {
 	struct vfio_pci_device *vdev = device_data;
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	unsigned int index;
 	u64 phys_len, req_len, pgoff, req_start;
 	int ret;
@@ -1716,10 +1670,10 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	return 0;
 }
 
-static void vfio_pci_request(void *device_data, unsigned int count)
+static void vfio_pci_core_request(void *device_data, unsigned int count)
 {
 	struct vfio_pci_device *vdev = device_data;
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 
 	mutex_lock(&vdev->igate);
 
@@ -1740,6 +1694,7 @@ static void vfio_pci_request(void *device_data, unsigned int count)
 static int vfio_pci_validate_vf_token(struct vfio_pci_device *vdev,
 				      bool vf_token, uuid_t *uuid)
 {
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	/*
 	 * There's always some degree of trust or collaboration between SR-IOV
 	 * PF and VFs, even if just that the PF hosts the SR-IOV capability and
@@ -1765,10 +1720,10 @@ static int vfio_pci_validate_vf_token(struct vfio_pci_device *vdev,
 	 *
 	 * If the VF token is provided but unused, an error is generated.
 	 */
-	if (!vdev->pdev->is_virtfn && !vdev->vf_token && !vf_token)
+	if (!pdev->is_virtfn && !vdev->vf_token && !vf_token)
 		return 0; /* No VF token provided or required */
 
-	if (vdev->pdev->is_virtfn) {
+	if (pdev->is_virtfn) {
 		struct vfio_device *pf_dev;
 		struct vfio_pci_device *pf_vdev = get_pf_vdev(vdev, &pf_dev);
 		bool match;
@@ -1777,14 +1732,14 @@ static int vfio_pci_validate_vf_token(struct vfio_pci_device *vdev,
 			if (!vf_token)
 				return 0; /* PF is not vfio-pci, no VF token */
 
-			pci_info_ratelimited(vdev->pdev,
+			pci_info_ratelimited(pdev,
 				"VF token incorrectly provided, PF not bound to vfio-pci\n");
 			return -EINVAL;
 		}
 
 		if (!vf_token) {
 			vfio_device_put(pf_dev);
-			pci_info_ratelimited(vdev->pdev,
+			pci_info_ratelimited(pdev,
 				"VF token required to access device\n");
 			return -EACCES;
 		}
@@ -1796,7 +1751,7 @@ static int vfio_pci_validate_vf_token(struct vfio_pci_device *vdev,
 		vfio_device_put(pf_dev);
 
 		if (!match) {
-			pci_info_ratelimited(vdev->pdev,
+			pci_info_ratelimited(pdev,
 				"Incorrect VF token provided for device\n");
 			return -EACCES;
 		}
@@ -1805,14 +1760,14 @@ static int vfio_pci_validate_vf_token(struct vfio_pci_device *vdev,
 		if (vdev->vf_token->users) {
 			if (!vf_token) {
 				mutex_unlock(&vdev->vf_token->lock);
-				pci_info_ratelimited(vdev->pdev,
+				pci_info_ratelimited(pdev,
 					"VF token required to access device\n");
 				return -EACCES;
 			}
 
 			if (!uuid_equal(uuid, &vdev->vf_token->uuid)) {
 				mutex_unlock(&vdev->vf_token->lock);
-				pci_info_ratelimited(vdev->pdev,
+				pci_info_ratelimited(pdev,
 					"Incorrect VF token provided for device\n");
 				return -EACCES;
 			}
@@ -1822,7 +1777,7 @@ static int vfio_pci_validate_vf_token(struct vfio_pci_device *vdev,
 
 		mutex_unlock(&vdev->vf_token->lock);
 	} else if (vf_token) {
-		pci_info_ratelimited(vdev->pdev,
+		pci_info_ratelimited(pdev,
 			"VF token incorrectly provided, not a PF or VF\n");
 		return -EINVAL;
 	}
@@ -1832,18 +1787,19 @@ static int vfio_pci_validate_vf_token(struct vfio_pci_device *vdev,
 
 #define VF_TOKEN_ARG "vf_token="
 
-static int vfio_pci_match(void *device_data, char *buf)
+static int vfio_pci_core_match(void *device_data, char *buf)
 {
 	struct vfio_pci_device *vdev = device_data;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	bool vf_token = false;
 	uuid_t uuid;
 	int ret;
 
-	if (strncmp(pci_name(vdev->pdev), buf, strlen(pci_name(vdev->pdev))))
+	if (strncmp(pci_name(pdev), buf, strlen(pci_name(pdev))))
 		return 0; /* No match */
 
-	if (strlen(buf) > strlen(pci_name(vdev->pdev))) {
-		buf += strlen(pci_name(vdev->pdev));
+	if (strlen(buf) > strlen(pci_name(pdev))) {
+		buf += strlen(pci_name(pdev));
 
 		if (*buf != ' ')
 			return 0; /* No match: non-whitespace after name */
@@ -1881,16 +1837,16 @@ static int vfio_pci_match(void *device_data, char *buf)
 	return 1; /* Match */
 }
 
-static const struct vfio_device_ops vfio_pci_ops = {
-	.name		= "vfio-pci",
-	.open		= vfio_pci_open,
-	.release	= vfio_pci_release,
-	.ioctl		= vfio_pci_ioctl,
-	.read		= vfio_pci_read,
-	.write		= vfio_pci_write,
-	.mmap		= vfio_pci_mmap,
-	.request	= vfio_pci_request,
-	.match		= vfio_pci_match,
+static const struct vfio_device_ops vfio_pci_core_ops = {
+	.name		= "vfio-pci-core",
+	.open		= vfio_pci_core_open,
+	.release	= vfio_pci_core_release,
+	.ioctl		= vfio_pci_core_ioctl,
+	.read		= vfio_pci_core_read,
+	.write		= vfio_pci_core_write,
+	.mmap		= vfio_pci_core_mmap,
+	.request	= vfio_pci_core_request,
+	.match		= vfio_pci_core_match,
 };
 
 static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
@@ -1906,17 +1862,18 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
 	struct pci_dev *physfn = pci_physfn(pdev);
 
 	if (action == BUS_NOTIFY_ADD_DEVICE &&
-	    pdev->is_virtfn && physfn == vdev->pdev) {
-		pci_info(vdev->pdev, "Captured SR-IOV VF %s driver_override\n",
+	    pdev->is_virtfn && physfn == vdev->vpdev.pdev) {
+		pci_info(vdev->vpdev.pdev,
+			 "Captured SR-IOV VF %s driver_override\n",
 			 pci_name(pdev));
 		pdev->driver_override = kasprintf(GFP_KERNEL, "%s",
-						  vfio_pci_ops.name);
+						  vdev->vpdev.vfio_pci_ops->name);
 	} else if (action == BUS_NOTIFY_BOUND_DRIVER &&
-		   pdev->is_virtfn && physfn == vdev->pdev) {
+		   pdev->is_virtfn && physfn == vdev->vpdev.pdev) {
 		struct pci_driver *drv = pci_dev_driver(pdev);
 
-		if (drv && drv != &vfio_pci_driver)
-			pci_warn(vdev->pdev,
+		if (drv && drv != pci_dev_driver(vdev->vpdev.pdev))
+			pci_warn(vdev->vpdev.pdev,
 				 "VF %s bound to driver %s while PF bound to vfio-pci\n",
 				 pci_name(pdev), drv->name);
 	}
@@ -1924,17 +1881,16 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
 	return 0;
 }
 
-static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+struct vfio_pci_core_device *vfio_create_pci_device(struct pci_dev *pdev,
+		const struct vfio_pci_device_ops *vfio_pci_ops,
+		void *dd_data)
 {
 	struct vfio_pci_device *vdev;
 	struct iommu_group *group;
 	int ret;
 
-	if (vfio_pci_is_denylisted(pdev))
-		return -EINVAL;
-
 	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
-		return -EINVAL;
+		return ERR_PTR(-EINVAL);
 
 	/*
 	 * Prevent binding to PFs with VFs enabled, the VFs might be in use
@@ -1946,12 +1902,12 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	 */
 	if (pci_num_vf(pdev)) {
 		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
-		return -EBUSY;
+		return ERR_PTR(-EBUSY);
 	}
 
 	group = vfio_iommu_group_get(&pdev->dev);
 	if (!group)
-		return -EINVAL;
+		return ERR_PTR(-EINVAL);
 
 	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
 	if (!vdev) {
@@ -1959,7 +1915,9 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		goto out_group_put;
 	}
 
-	vdev->pdev = pdev;
+	vdev->vpdev.pdev = pdev;
+	vdev->vpdev.vfio_pci_ops = vfio_pci_ops;
+	vdev->vpdev.dd_data = dd_data;
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
 	mutex_init(&vdev->igate);
 	spin_lock_init(&vdev->irqlock);
@@ -1970,7 +1928,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	INIT_LIST_HEAD(&vdev->vma_list);
 	init_rwsem(&vdev->memory_lock);
 
-	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
+	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_core_ops, vdev);
 	if (ret)
 		goto out_free;
 
@@ -2016,7 +1974,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 	}
 
-	return ret;
+	return &vdev->vpdev;
 
 out_vf_token:
 	kfree(vdev->vf_token);
@@ -2028,10 +1986,11 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	kfree(vdev);
 out_group_put:
 	vfio_iommu_group_put(group, &pdev->dev);
-	return ret;
+	return ERR_PTR(ret);
 }
+EXPORT_SYMBOL_GPL(vfio_create_pci_device);
 
-static void vfio_pci_remove(struct pci_dev *pdev)
+void vfio_destroy_pci_device(struct pci_dev *pdev)
 {
 	struct vfio_pci_device *vdev;
 
@@ -2069,9 +2028,10 @@ static void vfio_pci_remove(struct pci_dev *pdev)
 				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
 	}
 }
+EXPORT_SYMBOL_GPL(vfio_destroy_pci_device);
 
-static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
-						  pci_channel_state_t state)
+pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
+		pci_channel_state_t state)
 {
 	struct vfio_pci_device *vdev;
 	struct vfio_device *device;
@@ -2097,18 +2057,14 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 
 	return PCI_ERS_RESULT_CAN_RECOVER;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_core_aer_err_detected);
 
-static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
+int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
 {
 	struct vfio_pci_device *vdev;
 	struct vfio_device *device;
 	int ret = 0;
 
-	might_sleep();
-
-	if (!enable_sriov)
-		return -ENOENT;
-
 	device = vfio_device_get_from_dev(&pdev->dev);
 	if (!device)
 		return -ENODEV;
@@ -2128,19 +2084,7 @@ static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
 
 	return ret < 0 ? ret : nr_virtfn;
 }
-
-static const struct pci_error_handlers vfio_err_handlers = {
-	.error_detected = vfio_pci_aer_err_detected,
-};
-
-static struct pci_driver vfio_pci_driver = {
-	.name			= "vfio-pci",
-	.id_table		= NULL, /* only dynamic ids */
-	.probe			= vfio_pci_probe,
-	.remove			= vfio_pci_remove,
-	.sriov_configure	= vfio_pci_sriov_configure,
-	.err_handler		= &vfio_err_handlers,
-};
+EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure);
 
 static DEFINE_MUTEX(reflck_lock);
 
@@ -2173,13 +2117,13 @@ static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
 	if (!device)
 		return 0;
 
-	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+	vdev = vfio_device_data(device);
+
+	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->vpdev.pdev)) {
 		vfio_device_put(device);
 		return 0;
 	}
 
-	vdev = vfio_device_data(device);
-
 	if (vdev->reflck) {
 		vfio_pci_reflck_get(vdev->reflck);
 		*preflck = vdev->reflck;
@@ -2193,12 +2137,12 @@ static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
 
 static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
 {
-	bool slot = !pci_probe_reset_slot(vdev->pdev->slot);
+	bool slot = !pci_probe_reset_slot(vdev->vpdev.pdev->slot);
 
 	mutex_lock(&reflck_lock);
 
-	if (pci_is_root_bus(vdev->pdev->bus) ||
-	    vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_reflck_find,
+	if (pci_is_root_bus(vdev->vpdev.pdev->bus) ||
+	    vfio_pci_for_each_slot_or_bus(vdev->vpdev.pdev, vfio_pci_reflck_find,
 					  &vdev->reflck, slot) <= 0)
 		vdev->reflck = vfio_pci_reflck_alloc();
 
@@ -2235,13 +2179,13 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
 	if (!device)
 		return -EINVAL;
 
-	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+	vdev = vfio_device_data(device);
+
+	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->vpdev.pdev)) {
 		vfio_device_put(device);
 		return -EBUSY;
 	}
 
-	vdev = vfio_device_data(device);
-
 	/* Fault if the device is not unused */
 	if (vdev->refcnt) {
 		vfio_device_put(device);
@@ -2265,13 +2209,13 @@ static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data)
 	if (!device)
 		return -EINVAL;
 
-	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+	vdev = vfio_device_data(device);
+
+	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->vpdev.pdev)) {
 		vfio_device_put(device);
 		return -EBUSY;
 	}
 
-	vdev = vfio_device_data(device);
-
 	/*
 	 * Locking multiple devices is prone to deadlock, runaway and
 	 * unwind if we hit contention.
@@ -2308,12 +2252,12 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 	bool slot = false;
 	struct vfio_pci_device *tmp;
 
-	if (!pci_probe_reset_slot(vdev->pdev->slot))
+	if (!pci_probe_reset_slot(vdev->vpdev.pdev->slot))
 		slot = true;
-	else if (pci_probe_reset_bus(vdev->pdev->bus))
+	else if (pci_probe_reset_bus(vdev->vpdev.pdev->bus))
 		return;
 
-	if (vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs,
+	if (vfio_pci_for_each_slot_or_bus(vdev->vpdev.pdev, vfio_pci_count_devs,
 					  &i, slot) || !i)
 		return;
 
@@ -2322,7 +2266,7 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 	if (!devs.devices)
 		return;
 
-	if (vfio_pci_for_each_slot_or_bus(vdev->pdev,
+	if (vfio_pci_for_each_slot_or_bus(vdev->vpdev.pdev,
 					  vfio_pci_get_unused_devs,
 					  &devs, slot))
 		goto put_devs;
@@ -2331,7 +2275,7 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 	for (i = 0; i < devs.cur_index; i++) {
 		tmp = vfio_device_data(devs.devices[i]);
 		if (tmp->needs_reset) {
-			ret = pci_reset_bus(vdev->pdev);
+			ret = pci_reset_bus(vdev->vpdev.pdev);
 			break;
 		}
 	}
@@ -2360,81 +2304,19 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 	kfree(devs.devices);
 }
 
-static void __exit vfio_pci_cleanup(void)
+static void __exit vfio_pci_core_cleanup(void)
 {
-	pci_unregister_driver(&vfio_pci_driver);
 	vfio_pci_uninit_perm_bits();
 }
 
-static void __init vfio_pci_fill_ids(void)
+static int __init vfio_pci_core_init(void)
 {
-	char *p, *id;
-	int rc;
-
-	/* no ids passed actually */
-	if (ids[0] == '\0')
-		return;
-
-	/* add ids specified in the module parameter */
-	p = ids;
-	while ((id = strsep(&p, ","))) {
-		unsigned int vendor, device, subvendor = PCI_ANY_ID,
-			subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
-		int fields;
-
-		if (!strlen(id))
-			continue;
-
-		fields = sscanf(id, "%x:%x:%x:%x:%x:%x",
-				&vendor, &device, &subvendor, &subdevice,
-				&class, &class_mask);
-
-		if (fields < 2) {
-			pr_warn("invalid id string \"%s\"\n", id);
-			continue;
-		}
-
-		rc = pci_add_dynid(&vfio_pci_driver, vendor, device,
-				   subvendor, subdevice, class, class_mask, 0);
-		if (rc)
-			pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
-				vendor, device, subvendor, subdevice,
-				class, class_mask, rc);
-		else
-			pr_info("add [%04x:%04x[%04x:%04x]] class %#08x/%08x\n",
-				vendor, device, subvendor, subdevice,
-				class, class_mask);
-	}
-}
-
-static int __init vfio_pci_init(void)
-{
-	int ret;
-
 	/* Allocate shared config space permision data used by all devices */
-	ret = vfio_pci_init_perm_bits();
-	if (ret)
-		return ret;
-
-	/* Register and scan for devices */
-	ret = pci_register_driver(&vfio_pci_driver);
-	if (ret)
-		goto out_driver;
-
-	vfio_pci_fill_ids();
-
-	if (disable_denylist)
-		pr_warn("device denylist disabled.\n");
-
-	return 0;
-
-out_driver:
-	vfio_pci_uninit_perm_bits();
-	return ret;
+	return vfio_pci_init_perm_bits();
 }
 
-module_init(vfio_pci_init);
-module_exit(vfio_pci_cleanup);
+module_init(vfio_pci_core_init);
+module_exit(vfio_pci_core_cleanup);
 
 MODULE_VERSION(DRIVER_VERSION);
 MODULE_LICENSE("GPL v2");
diff --git a/drivers/vfio/pci/vfio_pci_core.h b/drivers/vfio/pci/vfio_pci_core.h
new file mode 100644
index 000000000000..9833935af735
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_core.h
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Max Gurtovoy <mgurtovoy@nvidia.com>
+ *
+ * Derived from original vfio:
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ */
+
+#include <linux/pci.h>
+#include <linux/vfio.h>
+
+#ifndef VFIO_PCI_CORE_H
+#define VFIO_PCI_CORE_H
+
+struct vfio_pci_device_ops;
+
+struct vfio_pci_core_device {
+	struct pci_dev				*pdev;
+	const struct vfio_pci_device_ops	*vfio_pci_ops;
+	void					*dd_data;
+};
+
+/**
+ * struct vfio_pci_device_ops - VFIO PCI device callbacks
+ *
+ * @init: Called when userspace creates new file descriptor for device
+ */
+struct vfio_pci_device_ops {
+	struct module	*module;
+	char		*name;
+	int		(*init)(struct vfio_pci_core_device *vpdev);
+};
+
+/* Exported functions */
+struct vfio_pci_core_device *vfio_create_pci_device(struct pci_dev *pdev,
+		const struct vfio_pci_device_ops *vfio_pci_ops,
+		void *dd_data);
+void vfio_destroy_pci_device(struct pci_dev *pdev);
+int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn);
+pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
+		pci_channel_state_t state);
+
+#endif /* VFIO_PCI_CORE_H */
diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
index 53d97f459252..0cab3c2d35f6 100644
--- a/drivers/vfio/pci/vfio_pci_igd.c
+++ b/drivers/vfio/pci/vfio_pci_igd.c
@@ -59,7 +59,7 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 	void *base;
 	int ret;
 
-	ret = pci_read_config_dword(vdev->pdev, OPREGION_PCI_ADDR, &addr);
+	ret = pci_read_config_dword(vdev->vpdev.pdev, OPREGION_PCI_ADDR, &addr);
 	if (ret)
 		return ret;
 
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 869dce5f134d..8222a0c7cc32 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -35,7 +35,7 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused)
 
 void vfio_pci_intx_mask(struct vfio_pci_device *vdev)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	unsigned long flags;
 
 	spin_lock_irqsave(&vdev->irqlock, flags);
@@ -74,7 +74,7 @@ void vfio_pci_intx_mask(struct vfio_pci_device *vdev)
 static int vfio_pci_intx_unmask_handler(void *opaque, void *unused)
 {
 	struct vfio_pci_device *vdev = opaque;
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	unsigned long flags;
 	int ret = 0;
 
@@ -122,11 +122,11 @@ static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
 	spin_lock_irqsave(&vdev->irqlock, flags);
 
 	if (!vdev->pci_2_3) {
-		disable_irq_nosync(vdev->pdev->irq);
+		disable_irq_nosync(vdev->vpdev.pdev->irq);
 		vdev->ctx[0].masked = true;
 		ret = IRQ_HANDLED;
 	} else if (!vdev->ctx[0].masked &&  /* may be shared */
-		   pci_check_and_mask_intx(vdev->pdev)) {
+		   pci_check_and_mask_intx(vdev->vpdev.pdev)) {
 		vdev->ctx[0].masked = true;
 		ret = IRQ_HANDLED;
 	}
@@ -144,7 +144,7 @@ static int vfio_intx_enable(struct vfio_pci_device *vdev)
 	if (!is_irq_none(vdev))
 		return -EINVAL;
 
-	if (!vdev->pdev->irq)
+	if (!vdev->vpdev.pdev->irq)
 		return -ENODEV;
 
 	vdev->ctx = kzalloc(sizeof(struct vfio_pci_irq_ctx), GFP_KERNEL);
@@ -161,7 +161,7 @@ static int vfio_intx_enable(struct vfio_pci_device *vdev)
 	 */
 	vdev->ctx[0].masked = vdev->virq_disabled;
 	if (vdev->pci_2_3)
-		pci_intx(vdev->pdev, !vdev->ctx[0].masked);
+		pci_intx(vdev->vpdev.pdev, !vdev->ctx[0].masked);
 
 	vdev->irq_type = VFIO_PCI_INTX_IRQ_INDEX;
 
@@ -170,7 +170,7 @@ static int vfio_intx_enable(struct vfio_pci_device *vdev)
 
 static int vfio_intx_set_signal(struct vfio_pci_device *vdev, int fd)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	unsigned long irqflags = IRQF_SHARED;
 	struct eventfd_ctx *trigger;
 	unsigned long flags;
@@ -246,7 +246,7 @@ static irqreturn_t vfio_msihandler(int irq, void *arg)
 
 static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	unsigned int flag = msix ? PCI_IRQ_MSIX : PCI_IRQ_MSI;
 	int ret;
 	u16 cmd;
@@ -288,7 +288,7 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
 static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
 				      int vector, int fd, bool msix)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	struct eventfd_ctx *trigger;
 	int irq, ret;
 	u16 cmd;
@@ -387,7 +387,7 @@ static int vfio_msi_set_block(struct vfio_pci_device *vdev, unsigned start,
 
 static void vfio_msi_disable(struct vfio_pci_device *vdev, bool msix)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	int i;
 	u16 cmd;
 
@@ -672,7 +672,7 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
 	case VFIO_PCI_ERR_IRQ_INDEX:
 		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
 		case VFIO_IRQ_SET_ACTION_TRIGGER:
-			if (pci_is_pcie(vdev->pdev))
+			if (pci_is_pcie(vdev->vpdev.pdev))
 				func = vfio_pci_set_err_trigger;
 			break;
 		}
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
index 9adcf6a8f888..80f0de332338 100644
--- a/drivers/vfio/pci/vfio_pci_nvlink2.c
+++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
@@ -165,7 +165,7 @@ static int vfio_pci_nvgpu_mmap(struct vfio_pci_device *vdev,
 	ret = (int) mm_iommu_newdev(data->mm, data->useraddr,
 			vma_pages(vma), data->gpu_hpa, &data->mem);
 
-	trace_vfio_pci_nvgpu_mmap(vdev->pdev, data->gpu_hpa, data->useraddr,
+	trace_vfio_pci_nvgpu_mmap(vdev->vpdev.pdev, data->gpu_hpa, data->useraddr,
 			vma->vm_end - vma->vm_start, ret);
 
 	return ret;
@@ -222,7 +222,7 @@ int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
 	 * PCI config space does not tell us about NVLink presense but
 	 * platform does, use this.
 	 */
-	npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0);
+	npu_dev = pnv_pci_get_npu_dev(vdev->vpdev.pdev, 0);
 	if (!npu_dev)
 		return -ENODEV;
 
@@ -243,7 +243,7 @@ int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
 		return -EINVAL;
 
 	if (of_property_read_u64(npu_node, "ibm,device-tgt-addr", &tgt)) {
-		dev_warn(&vdev->pdev->dev, "No ibm,device-tgt-addr found\n");
+		dev_warn(&vdev->vpdev.pdev->dev, "No ibm,device-tgt-addr found\n");
 		return -EFAULT;
 	}
 
@@ -255,10 +255,10 @@ int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
 	data->gpu_tgt = tgt;
 	data->size = reg[1];
 
-	dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa,
+	dev_dbg(&vdev->vpdev.pdev->dev, "%lx..%lx\n", data->gpu_hpa,
 			data->gpu_hpa + data->size - 1);
 
-	data->gpdev = vdev->pdev;
+	data->gpdev = vdev->vpdev.pdev;
 	data->group_notifier.notifier_call = vfio_pci_nvgpu_group_notifier;
 
 	ret = vfio_register_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY,
@@ -343,7 +343,7 @@ static int vfio_pci_npu2_mmap(struct vfio_pci_device *vdev,
 
 	ret = remap_pfn_range(vma, vma->vm_start, data->mmio_atsd >> PAGE_SHIFT,
 			req_len, vma->vm_page_prot);
-	trace_vfio_pci_npu2_mmap(vdev->pdev, data->mmio_atsd, vma->vm_start,
+	trace_vfio_pci_npu2_mmap(vdev->vpdev.pdev, data->mmio_atsd, vma->vm_start,
 			vma->vm_end - vma->vm_start, ret);
 
 	return ret;
@@ -394,7 +394,7 @@ int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
 	struct vfio_pci_npu2_data *data;
 	struct device_node *nvlink_dn;
 	u32 nvlink_index = 0, mem_phandle = 0;
-	struct pci_dev *npdev = vdev->pdev;
+	struct pci_dev *npdev = vdev->vpdev.pdev;
 	struct device_node *npu_node = pci_device_to_OF_node(npdev);
 	struct pci_controller *hose = pci_bus_to_host(npdev->bus);
 	u64 mmio_atsd = 0;
@@ -405,7 +405,7 @@ int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
 	 * PCI config space does not tell us about NVLink presense but
 	 * platform does, use this.
 	 */
-	if (!pnv_pci_get_gpu_dev(vdev->pdev))
+	if (!pnv_pci_get_gpu_dev(vdev->vpdev.pdev))
 		return -ENODEV;
 
 	if (of_property_read_u32(npu_node, "memory-region", &mem_phandle))
@@ -427,21 +427,21 @@ int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
 			&mmio_atsd)) {
 		if (of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", 0,
 				&mmio_atsd)) {
-			dev_warn(&vdev->pdev->dev, "No available ATSD found\n");
+			dev_warn(&vdev->vpdev.pdev->dev, "No available ATSD found\n");
 			mmio_atsd = 0;
 		} else {
-			dev_warn(&vdev->pdev->dev,
+			dev_warn(&vdev->vpdev.pdev->dev,
 				 "Using fallback ibm,mmio-atsd[0] for ATSD.\n");
 		}
 	}
 
 	if (of_property_read_u64(npu_node, "ibm,device-tgt-addr", &tgt)) {
-		dev_warn(&vdev->pdev->dev, "No ibm,device-tgt-addr found\n");
+		dev_warn(&vdev->vpdev.pdev->dev, "No ibm,device-tgt-addr found\n");
 		return -EFAULT;
 	}
 
 	if (of_property_read_u32(npu_node, "ibm,nvlink-speed", &link_speed)) {
-		dev_warn(&vdev->pdev->dev, "No ibm,nvlink-speed found\n");
+		dev_warn(&vdev->vpdev.pdev->dev, "No ibm,nvlink-speed found\n");
 		return -EFAULT;
 	}
 
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 5c90e560c5c7..82de00508377 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -15,6 +15,8 @@
 #include <linux/uuid.h>
 #include <linux/notifier.h>
 
+#include "vfio_pci_core.h"
+
 #ifndef VFIO_PCI_PRIVATE_H
 #define VFIO_PCI_PRIVATE_H
 
@@ -100,7 +102,7 @@ struct vfio_pci_mmap_vma {
 };
 
 struct vfio_pci_device {
-	struct pci_dev		*pdev;
+	struct vfio_pci_core_device vpdev;
 	void __iomem		*barmap[PCI_STD_NUM_BARS];
 	bool			bar_mmap_supported[PCI_STD_NUM_BARS];
 	u8			*pci_config_map;
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index a0b5fc8e46f4..d58dc3e863b0 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -202,7 +202,7 @@ static ssize_t do_io_rw(struct vfio_pci_device *vdev, bool test_mem,
 
 static int vfio_pci_setup_barmap(struct vfio_pci_device *vdev, int bar)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	int ret;
 	void __iomem *io;
 
@@ -227,13 +227,13 @@ static int vfio_pci_setup_barmap(struct vfio_pci_device *vdev, int bar)
 ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
 			size_t count, loff_t *ppos, bool iswrite)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
 	int bar = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
 	size_t x_start = 0, x_end = 0;
 	resource_size_t end;
 	void __iomem *io;
-	struct resource *res = &vdev->pdev->resource[bar];
+	struct resource *res = &vdev->vpdev.pdev->resource[bar];
 	ssize_t done;
 
 	if (pci_resource_start(pdev, bar))
@@ -333,7 +333,7 @@ ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf,
 	if (!iomem)
 		return -ENOMEM;
 
-	ret = vga_get_interruptible(vdev->pdev, rsrc);
+	ret = vga_get_interruptible(vdev->vpdev.pdev, rsrc);
 	if (ret) {
 		is_ioport ? ioport_unmap(iomem) : iounmap(iomem);
 		return ret;
@@ -346,7 +346,7 @@ ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf,
 	 */
 	done = do_io_rw(vdev, false, iomem, buf, off, count, 0, 0, iswrite);
 
-	vga_put(vdev->pdev, rsrc);
+	vga_put(vdev->vpdev.pdev, rsrc);
 
 	is_ioport ? ioport_unmap(iomem) : iounmap(iomem);
 
@@ -413,7 +413,7 @@ static void vfio_pci_ioeventfd_thread(void *opaque, void *unused)
 long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset,
 			uint64_t data, int count, int fd)
 {
-	struct pci_dev *pdev = vdev->pdev;
+	struct pci_dev *pdev = vdev->vpdev.pdev;
 	loff_t pos = offset & VFIO_PCI_OFFSET_MASK;
 	int ret, bar = VFIO_PCI_OFFSET_TO_INDEX(offset);
 	struct vfio_pci_ioeventfd *ioeventfd;
@@ -480,7 +480,7 @@ long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset,
 	ioeventfd->pos = pos;
 	ioeventfd->bar = bar;
 	ioeventfd->count = count;
-	ioeventfd->test_mem = vdev->pdev->resource[bar].flags & IORESOURCE_MEM;
+	ioeventfd->test_mem = vdev->vpdev.pdev->resource[bar].flags & IORESOURCE_MEM;
 
 	ret = vfio_virqfd_enable(ioeventfd, vfio_pci_ioeventfd_handler,
 				 vfio_pci_ioeventfd_thread, NULL,
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 229685634031..7b20b34b1034 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -117,7 +117,7 @@ static int zpci_pfip_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
 int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
 				struct vfio_info_cap *caps)
 {
-	struct zpci_dev *zdev = to_zpci(vdev->pdev);
+	struct zpci_dev *zdev = to_zpci(vdev->vpdev.pdev);
 	int ret;
 
 	if (!zdev)
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 3/9] vfio-pci-core: export vfio_pci_register_dev_region function
  2021-02-01 16:28 [PATCH v2 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
  2021-02-01 16:28 ` [PATCH 1/9] vfio-pci: rename vfio_pci.c to vfio_pci_core.c Max Gurtovoy
  2021-02-01 16:28 ` [PATCH 2/9] vfio-pci: introduce vfio_pci_core subsystem driver Max Gurtovoy
@ 2021-02-01 16:28 ` Max Gurtovoy
  2021-02-01 16:28 ` [PATCH 4/9] mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices Max Gurtovoy
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-01 16:28 UTC (permalink / raw)
  To: jgg, cohuck, kvm, linux-kernel, alex.williamson
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, mjrosato, yishaih, aik,
	Max Gurtovoy

This function will be used to allow vendor drivers to register regions
to be used and accessed by the core subsystem driver. This way, the core
will use the region ops that are vendor specific and managed by the
vendor vfio-pci driver.

Next step that can be made is to move the logic of igd and nvlink to a
dedicated module instead of managing their functionality in the core
driver.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c    | 12 +++++-----
 drivers/vfio/pci/vfio_pci_core.h    | 28 ++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_igd.c     | 16 ++++++++------
 drivers/vfio/pci/vfio_pci_nvlink2.c | 25 +++++++++++----------
 drivers/vfio/pci/vfio_pci_private.h | 34 +++++------------------------
 5 files changed, 64 insertions(+), 51 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index d5bf01132c23..a0a91331f575 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -395,7 +395,7 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
 	vdev->virq_disabled = false;
 
 	for (i = 0; i < vdev->num_regions; i++)
-		vdev->region[i].ops->release(vdev, &vdev->region[i]);
+		vdev->region[i].ops->release(&vdev->vpdev, &vdev->region[i]);
 
 	vdev->num_regions = 0;
 	kfree(vdev->region);
@@ -716,11 +716,12 @@ static int msix_mmappable_cap(struct vfio_pci_device *vdev,
 	return vfio_info_add_capability(caps, &header, sizeof(header));
 }
 
-int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
+int vfio_pci_register_dev_region(struct vfio_pci_core_device *vpdev,
 				 unsigned int type, unsigned int subtype,
 				 const struct vfio_pci_regops *ops,
 				 size_t size, u32 flags, void *data)
 {
+	struct vfio_pci_device *vdev = vpdev_to_vdev(vpdev);
 	struct vfio_pci_region *region;
 
 	region = krealloc(vdev->region,
@@ -741,6 +742,7 @@ int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_register_dev_region);
 
 struct vfio_devices {
 	struct vfio_device **devices;
@@ -928,7 +930,7 @@ static long vfio_pci_core_ioctl(void *device_data, unsigned int cmd,
 				return ret;
 
 			if (vdev->region[i].ops->add_capability) {
-				ret = vdev->region[i].ops->add_capability(vdev,
+				ret = vdev->region[i].ops->add_capability(&vdev->vpdev,
 						&vdev->region[i], &caps);
 				if (ret)
 					return ret;
@@ -1379,7 +1381,7 @@ static ssize_t vfio_pci_rw(struct vfio_pci_device *vdev, char __user *buf,
 		return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
 	default:
 		index -= VFIO_PCI_NUM_REGIONS;
-		return vdev->region[index].ops->rw(vdev, buf,
+		return vdev->region[index].ops->rw(&vdev->vpdev, buf,
 						   count, ppos, iswrite);
 	}
 
@@ -1622,7 +1624,7 @@ static int vfio_pci_core_mmap(void *device_data, struct vm_area_struct *vma)
 
 		if (region && region->ops && region->ops->mmap &&
 		    (region->flags & VFIO_REGION_INFO_FLAG_MMAP))
-			return region->ops->mmap(vdev, region, vma);
+			return region->ops->mmap(&vdev->vpdev, region, vma);
 		return -EINVAL;
 	}
 	if (index >= VFIO_PCI_ROM_REGION_INDEX)
diff --git a/drivers/vfio/pci/vfio_pci_core.h b/drivers/vfio/pci/vfio_pci_core.h
index 9833935af735..0b227ee3f377 100644
--- a/drivers/vfio/pci/vfio_pci_core.h
+++ b/drivers/vfio/pci/vfio_pci_core.h
@@ -15,6 +15,7 @@
 #define VFIO_PCI_CORE_H
 
 struct vfio_pci_device_ops;
+struct vfio_pci_region;
 
 struct vfio_pci_core_device {
 	struct pci_dev				*pdev;
@@ -22,6 +23,29 @@ struct vfio_pci_core_device {
 	void					*dd_data;
 };
 
+struct vfio_pci_regops {
+	size_t	(*rw)(struct vfio_pci_core_device *vpdev, char __user *buf,
+		      size_t count, loff_t *ppos, bool iswrite);
+	void	(*release)(struct vfio_pci_core_device *vpdev,
+			   struct vfio_pci_region *region);
+	int	(*mmap)(struct vfio_pci_core_device *vpdev,
+			struct vfio_pci_region *region,
+			struct vm_area_struct *vma);
+	int	(*add_capability)(struct vfio_pci_core_device *vpdev,
+				  struct vfio_pci_region *region,
+				  struct vfio_info_cap *caps);
+};
+
+struct vfio_pci_region {
+	u32				type;
+	u32				subtype;
+	const struct vfio_pci_regops	*ops;
+	void				*data;
+	size_t				size;
+	u32				flags;
+};
+
+
 /**
  * struct vfio_pci_device_ops - VFIO PCI device callbacks
  *
@@ -41,5 +65,9 @@ void vfio_destroy_pci_device(struct pci_dev *pdev);
 int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn);
 pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
 		pci_channel_state_t state);
+int vfio_pci_register_dev_region(struct vfio_pci_core_device *vdev,
+		unsigned int type, unsigned int subtype,
+		const struct vfio_pci_regops *ops,
+		size_t size, u32 flags, void *data);
 
 #endif /* VFIO_PCI_CORE_H */
diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
index 0cab3c2d35f6..0a9e0edbb0ac 100644
--- a/drivers/vfio/pci/vfio_pci_igd.c
+++ b/drivers/vfio/pci/vfio_pci_igd.c
@@ -21,9 +21,10 @@
 #define OPREGION_SIZE		(8 * 1024)
 #define OPREGION_PCI_ADDR	0xfc
 
-static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
+static size_t vfio_pci_igd_rw(struct vfio_pci_core_device *vpdev, char __user *buf,
 			      size_t count, loff_t *ppos, bool iswrite)
 {
+	struct vfio_pci_device *vdev = vpdev_to_vdev(vpdev);
 	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
 	void *base = vdev->region[i].data;
 	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
@@ -41,7 +42,7 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
 	return count;
 }
 
-static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
+static void vfio_pci_igd_release(struct vfio_pci_core_device *vpdev,
 				 struct vfio_pci_region *region)
 {
 	memunmap(region->data);
@@ -90,7 +91,7 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 			return -ENOMEM;
 	}
 
-	ret = vfio_pci_register_dev_region(vdev,
+	ret = vfio_pci_register_dev_region(&vdev->vpdev,
 		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
 		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
 		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
@@ -107,10 +108,11 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 	return ret;
 }
 
-static size_t vfio_pci_igd_cfg_rw(struct vfio_pci_device *vdev,
+static size_t vfio_pci_igd_cfg_rw(struct vfio_pci_core_device *vpdev,
 				  char __user *buf, size_t count, loff_t *ppos,
 				  bool iswrite)
 {
+	struct vfio_pci_device *vdev = vpdev_to_vdev(vpdev);
 	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
 	struct pci_dev *pdev = vdev->region[i].data;
 	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
@@ -200,7 +202,7 @@ static size_t vfio_pci_igd_cfg_rw(struct vfio_pci_device *vdev,
 	return count;
 }
 
-static void vfio_pci_igd_cfg_release(struct vfio_pci_device *vdev,
+static void vfio_pci_igd_cfg_release(struct vfio_pci_core_device *vpdev,
 				     struct vfio_pci_region *region)
 {
 	struct pci_dev *pdev = region->data;
@@ -228,7 +230,7 @@ static int vfio_pci_igd_cfg_init(struct vfio_pci_device *vdev)
 		return -EINVAL;
 	}
 
-	ret = vfio_pci_register_dev_region(vdev,
+	ret = vfio_pci_register_dev_region(&vdev->vpdev,
 		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
 		VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG,
 		&vfio_pci_igd_cfg_regops, host_bridge->cfg_size,
@@ -248,7 +250,7 @@ static int vfio_pci_igd_cfg_init(struct vfio_pci_device *vdev)
 		return -EINVAL;
 	}
 
-	ret = vfio_pci_register_dev_region(vdev,
+	ret = vfio_pci_register_dev_region(&vdev->vpdev,
 		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
 		VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG,
 		&vfio_pci_igd_cfg_regops, lpc_bridge->cfg_size,
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
index 80f0de332338..a682e2bc9175 100644
--- a/drivers/vfio/pci/vfio_pci_nvlink2.c
+++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
@@ -39,9 +39,10 @@ struct vfio_pci_nvgpu_data {
 	struct notifier_block group_notifier;
 };
 
-static size_t vfio_pci_nvgpu_rw(struct vfio_pci_device *vdev,
+static size_t vfio_pci_nvgpu_rw(struct vfio_pci_core_device *vpdev,
 		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
 {
+	struct vfio_pci_device *vdev = vpdev_to_vdev(vpdev);
 	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
 	struct vfio_pci_nvgpu_data *data = vdev->region[i].data;
 	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
@@ -89,7 +90,7 @@ static size_t vfio_pci_nvgpu_rw(struct vfio_pci_device *vdev,
 	return count;
 }
 
-static void vfio_pci_nvgpu_release(struct vfio_pci_device *vdev,
+static void vfio_pci_nvgpu_release(struct vfio_pci_core_device *vpdev,
 		struct vfio_pci_region *region)
 {
 	struct vfio_pci_nvgpu_data *data = region->data;
@@ -136,7 +137,7 @@ static const struct vm_operations_struct vfio_pci_nvgpu_mmap_vmops = {
 	.fault = vfio_pci_nvgpu_mmap_fault,
 };
 
-static int vfio_pci_nvgpu_mmap(struct vfio_pci_device *vdev,
+static int vfio_pci_nvgpu_mmap(struct vfio_pci_core_device *vpdev,
 		struct vfio_pci_region *region, struct vm_area_struct *vma)
 {
 	int ret;
@@ -165,13 +166,13 @@ static int vfio_pci_nvgpu_mmap(struct vfio_pci_device *vdev,
 	ret = (int) mm_iommu_newdev(data->mm, data->useraddr,
 			vma_pages(vma), data->gpu_hpa, &data->mem);
 
-	trace_vfio_pci_nvgpu_mmap(vdev->vpdev.pdev, data->gpu_hpa, data->useraddr,
+	trace_vfio_pci_nvgpu_mmap(vpdev->pdev, data->gpu_hpa, data->useraddr,
 			vma->vm_end - vma->vm_start, ret);
 
 	return ret;
 }
 
-static int vfio_pci_nvgpu_add_capability(struct vfio_pci_device *vdev,
+static int vfio_pci_nvgpu_add_capability(struct vfio_pci_core_device *vpdev,
 		struct vfio_pci_region *region, struct vfio_info_cap *caps)
 {
 	struct vfio_pci_nvgpu_data *data = region->data;
@@ -275,7 +276,7 @@ int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
 	vfio_unregister_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY,
 			&data->group_notifier);
 
-	ret = vfio_pci_register_dev_region(vdev,
+	ret = vfio_pci_register_dev_region(&vdev->vpdev,
 			PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
 			VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM,
 			&vfio_pci_nvgpu_regops,
@@ -304,9 +305,10 @@ struct vfio_pci_npu2_data {
 	unsigned int link_speed; /* The link speed from DT's ibm,nvlink-speed */
 };
 
-static size_t vfio_pci_npu2_rw(struct vfio_pci_device *vdev,
+static size_t vfio_pci_npu2_rw(struct vfio_pci_core_device *vpdev,
 		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
 {
+	struct vfio_pci_device *vdev = vpdev_to_vdev(vpdev);
 	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
 	struct vfio_pci_npu2_data *data = vdev->region[i].data;
 	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
@@ -328,9 +330,10 @@ static size_t vfio_pci_npu2_rw(struct vfio_pci_device *vdev,
 	return count;
 }
 
-static int vfio_pci_npu2_mmap(struct vfio_pci_device *vdev,
+static int vfio_pci_npu2_mmap(struct vfio_pci_core_device *vpdev,
 		struct vfio_pci_region *region, struct vm_area_struct *vma)
 {
+	struct vfio_pci_device *vdev = vpdev_to_vdev(vpdev);
 	int ret;
 	struct vfio_pci_npu2_data *data = region->data;
 	unsigned long req_len = vma->vm_end - vma->vm_start;
@@ -349,7 +352,7 @@ static int vfio_pci_npu2_mmap(struct vfio_pci_device *vdev,
 	return ret;
 }
 
-static void vfio_pci_npu2_release(struct vfio_pci_device *vdev,
+static void vfio_pci_npu2_release(struct vfio_pci_core_device *vpdev,
 		struct vfio_pci_region *region)
 {
 	struct vfio_pci_npu2_data *data = region->data;
@@ -358,7 +361,7 @@ static void vfio_pci_npu2_release(struct vfio_pci_device *vdev,
 	kfree(data);
 }
 
-static int vfio_pci_npu2_add_capability(struct vfio_pci_device *vdev,
+static int vfio_pci_npu2_add_capability(struct vfio_pci_core_device *vpdev,
 		struct vfio_pci_region *region, struct vfio_info_cap *caps)
 {
 	struct vfio_pci_npu2_data *data = region->data;
@@ -466,7 +469,7 @@ int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
 	 * belong to VFIO regions and normally there will be ATSD register
 	 * assigned to the NVLink bridge.
 	 */
-	ret = vfio_pci_register_dev_region(vdev,
+	ret = vfio_pci_register_dev_region(&vdev->vpdev,
 			PCI_VENDOR_ID_IBM |
 			VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
 			VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 82de00508377..1c3bb809b5c0 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -55,29 +55,6 @@ struct vfio_pci_irq_ctx {
 };
 
 struct vfio_pci_device;
-struct vfio_pci_region;
-
-struct vfio_pci_regops {
-	size_t	(*rw)(struct vfio_pci_device *vdev, char __user *buf,
-		      size_t count, loff_t *ppos, bool iswrite);
-	void	(*release)(struct vfio_pci_device *vdev,
-			   struct vfio_pci_region *region);
-	int	(*mmap)(struct vfio_pci_device *vdev,
-			struct vfio_pci_region *region,
-			struct vm_area_struct *vma);
-	int	(*add_capability)(struct vfio_pci_device *vdev,
-				  struct vfio_pci_region *region,
-				  struct vfio_info_cap *caps);
-};
-
-struct vfio_pci_region {
-	u32				type;
-	u32				subtype;
-	const struct vfio_pci_regops	*ops;
-	void				*data;
-	size_t				size;
-	u32				flags;
-};
 
 struct vfio_pci_dummy_resource {
 	struct resource		resource;
@@ -178,11 +155,6 @@ extern void vfio_pci_uninit_perm_bits(void);
 extern int vfio_config_init(struct vfio_pci_device *vdev);
 extern void vfio_config_free(struct vfio_pci_device *vdev);
 
-extern int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
-					unsigned int type, unsigned int subtype,
-					const struct vfio_pci_regops *ops,
-					size_t size, u32 flags, void *data);
-
 extern int vfio_pci_set_power_state(struct vfio_pci_device *vdev,
 				    pci_power_t state);
 
@@ -227,4 +199,10 @@ static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
 }
 #endif
 
+static inline struct vfio_pci_device*
+vpdev_to_vdev(struct vfio_pci_core_device *vpdev)
+{
+	return container_of(vpdev, struct vfio_pci_device, vpdev);
+}
+
 #endif /* VFIO_PCI_PRIVATE_H */
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 4/9] mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices
  2021-02-01 16:28 [PATCH v2 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
                   ` (2 preceding siblings ...)
  2021-02-01 16:28 ` [PATCH 3/9] vfio-pci-core: export vfio_pci_register_dev_region function Max Gurtovoy
@ 2021-02-01 16:28 ` Max Gurtovoy
  2021-02-01 16:28 ` [PATCH 5/9] vfio-pci/zdev: remove unused vdev argument Max Gurtovoy
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-01 16:28 UTC (permalink / raw)
  To: jgg, cohuck, kvm, linux-kernel, alex.williamson
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, mjrosato, yishaih, aik,
	Max Gurtovoy

This driver will register to PCI bus and Auxiliary bus. In case the
probe of both devices will succeed, we'll have a vendor specific VFIO
PCI device. mlx5_vfio_pci use vfio_pci_core to register and create a
VFIO device and use auxiliary_device to get the needed extension from
the vendor device driver. If one of the probe() functions will fail, the
VFIO char device will not be created. For now, only register and bind
the auxiliary_device to the pci_device in case we have a match between
the auxiliary_device id to the pci_device BDF. Later, vendor specific
features such as live migration will be added and will be available to
the virtualization software.

Note: Although we've created the mlx5-vfio-pci.ko, the binding to
vfio-pci.ko will still work as before. It's fully backward compatible.
Of course, the extended vendor functionality will not exist in case one
will bind the device to the generic vfio_pci.ko.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/Kconfig         |  10 ++
 drivers/vfio/pci/Makefile        |   3 +
 drivers/vfio/pci/mlx5_vfio_pci.c | 253 +++++++++++++++++++++++++++++++
 include/linux/mlx5/vfio_pci.h    |  36 +++++
 4 files changed, 302 insertions(+)
 create mode 100644 drivers/vfio/pci/mlx5_vfio_pci.c
 create mode 100644 include/linux/mlx5/vfio_pci.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index b958a48f63a0..dcb164d7d641 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -65,3 +65,13 @@ config VFIO_PCI_ZDEV
 	  for zPCI devices passed through via VFIO on s390.
 
 	  Say Y here.
+
+config MLX5_VFIO_PCI
+	tristate "VFIO support for MLX5 PCI devices"
+	depends on VFIO_PCI_CORE && MLX5_CORE
+	select AUXILIARY_BUS
+	help
+	  This provides a generic PCI support for MLX5 devices using the VFIO
+	  framework.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 3f2a27e222cd..9f67edca31c5 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,6 +2,7 @@
 
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
@@ -9,3 +10,5 @@ vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o
 
 vfio-pci-y := vfio_pci.o
+
+mlx5-vfio-pci-y := mlx5_vfio_pci.o
diff --git a/drivers/vfio/pci/mlx5_vfio_pci.c b/drivers/vfio/pci/mlx5_vfio_pci.c
new file mode 100644
index 000000000000..4e6b256c74bf
--- /dev/null
+++ b/drivers/vfio/pci/mlx5_vfio_pci.c
@@ -0,0 +1,253 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Max Gurtovoy <mgurtovoy@nvidia.com>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/sched/mm.h>
+#include <linux/mlx5/vfio_pci.h>
+
+#include "vfio_pci_core.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Max Gurtovoy <mgurtovoy@nvidia.com>"
+#define DRIVER_DESC     "MLX5 VFIO PCI - User Level meta-driver for NVIDIA MLX5 device family"
+
+/* 16k migration data size */
+#define MLX5_MIGRATION_REGION_DATA_SIZE	SZ_16K
+/* Data section offset from migration region */
+#define MLX5_MIGRATION_REGION_DATA_OFFSET  (sizeof(struct vfio_device_migration_info))
+
+struct mlx5_vfio_pci_migration_info {
+	struct vfio_device_migration_info mig;
+	char data[MLX5_MIGRATION_REGION_DATA_SIZE];
+};
+
+static LIST_HEAD(aux_devs_list);
+static DEFINE_MUTEX(aux_devs_lock);
+
+static struct mlx5_vfio_pci_adev *mlx5_vfio_pci_find_adev(struct pci_dev *pdev)
+{
+	struct mlx5_vfio_pci_adev *mvadev, *found = NULL;
+
+	mutex_lock(&aux_devs_lock);
+	list_for_each_entry(mvadev, &aux_devs_list, entry) {
+		if (mvadev->madev.adev.id == pci_dev_id(pdev)) {
+			found = mvadev;
+			break;
+		}
+	}
+	mutex_unlock(&aux_devs_lock);
+
+	return found;
+}
+
+static int mlx5_vfio_pci_aux_probe(struct auxiliary_device *adev,
+		const struct auxiliary_device_id *id)
+{
+	struct mlx5_vfio_pci_adev *mvadev;
+
+	mvadev = adev_to_mvadev(adev);
+
+	pr_info("%s aux probing bdf %02x:%02x.%d mdev is %s\n",
+		adev->name,
+		PCI_BUS_NUM(adev->id & 0xffff),
+		PCI_SLOT(adev->id & 0xff),
+		PCI_FUNC(adev->id & 0xff), dev_name(mvadev->madev.mdev->device));
+
+	mutex_lock(&aux_devs_lock);
+	list_add(&mvadev->entry, &aux_devs_list);
+	mutex_unlock(&aux_devs_lock);
+
+	return 0;
+}
+
+static void mlx5_vfio_pci_aux_remove(struct auxiliary_device *adev)
+{
+	struct mlx5_vfio_pci_adev *mvadev = adev_to_mvadev(adev);
+	struct vfio_pci_core_device *vpdev = dev_get_drvdata(&adev->dev);
+
+	/* TODO: is this the right thing to do ? maybe FLR ? */
+	if (vpdev)
+		pci_reset_function(vpdev->pdev);
+
+	mutex_lock(&aux_devs_lock);
+	list_del(&mvadev->entry);
+	mutex_unlock(&aux_devs_lock);
+}
+
+static const struct auxiliary_device_id mlx5_vfio_pci_aux_id_table[] = {
+	{ .name = MLX5_ADEV_NAME ".vfio_pci", },
+	{},
+};
+
+MODULE_DEVICE_TABLE(auxiliary, mlx5_vfio_pci_aux_id_table);
+
+static struct auxiliary_driver mlx5_vfio_pci_aux_driver = {
+	.name = "vfio_pci_ex",
+	.probe = mlx5_vfio_pci_aux_probe,
+	.remove = mlx5_vfio_pci_aux_remove,
+	.id_table = mlx5_vfio_pci_aux_id_table,
+};
+
+static void mlx5_vfio_pci_mig_release(struct vfio_pci_core_device *vpdev,
+		struct vfio_pci_region *region)
+{
+	kfree(region->data);
+}
+
+static size_t mlx5_vfio_pci_mig_rw(struct vfio_pci_core_device *vpdev,
+		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
+{
+	/* TODO: add all migration logic here */
+
+	return -EINVAL;
+}
+
+static struct vfio_pci_regops migraion_ops = {
+	.rw = mlx5_vfio_pci_mig_rw,
+	.release = mlx5_vfio_pci_mig_release,
+};
+
+static int mlx5_vfio_pci_op_init(struct vfio_pci_core_device *vpdev)
+{
+	struct mlx5_vfio_pci_migration_info *vmig;
+	int ret;
+
+	vmig = kzalloc(sizeof(*vmig), GFP_KERNEL);
+	if (!vmig)
+		return -ENOMEM;
+
+	ret = vfio_pci_register_dev_region(vpdev,
+			VFIO_REGION_TYPE_MIGRATION,
+			VFIO_REGION_SUBTYPE_MIGRATION,
+			&migraion_ops, sizeof(*vmig),
+			VFIO_REGION_INFO_FLAG_READ |
+			VFIO_REGION_INFO_FLAG_WRITE, vmig);
+	if (ret)
+		goto out_free;
+
+	return 0;
+
+out_free:
+	kfree(vmig);
+	return ret;
+}
+
+static const struct vfio_pci_device_ops mlx5_vfio_pci_ops = {
+	.name		= "mlx5-vfio-pci",
+	.module		= THIS_MODULE,
+	.init		= mlx5_vfio_pci_op_init,
+};
+
+static int mlx5_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct vfio_pci_core_device *vpdev;
+	struct mlx5_vfio_pci_adev *mvadev;
+
+	mvadev = mlx5_vfio_pci_find_adev(pdev);
+	if (!mvadev) {
+		pr_err("failed to find aux device for %s\n",
+		       dev_name(&pdev->dev));
+		return -ENODEV;
+	}
+
+	vpdev = vfio_create_pci_device(pdev, &mlx5_vfio_pci_ops, mvadev);
+	if (IS_ERR(vpdev))
+		return PTR_ERR(vpdev);
+
+	dev_set_drvdata(&mvadev->madev.adev.dev, vpdev);
+	return 0;
+}
+
+static void mlx5_vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct mlx5_vfio_pci_adev *mvadev;
+
+	mvadev = mlx5_vfio_pci_find_adev(pdev);
+	if (mvadev)
+		dev_set_drvdata(&mvadev->madev.adev.dev, NULL);
+
+	vfio_destroy_pci_device(pdev);
+}
+
+#ifdef CONFIG_PCI_IOV
+static int mlx5_vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
+{
+	might_sleep();
+
+	/* DO vendor specific stuff here */
+
+	return vfio_pci_core_sriov_configure(pdev, nr_virtfn);
+}
+#endif
+
+static const struct pci_error_handlers mlx5_vfio_err_handlers = {
+	.error_detected = vfio_pci_core_aer_err_detected,
+};
+
+static const struct pci_device_id mlx5_vfio_pci_table[] = {
+	{ PCI_VDEVICE(MELLANOX, 0x6001) }, /* NVMe SNAP controllers */
+	{ PCI_DEVICE_SUB(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1042,
+			 PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID) }, /* Virtio SNAP controllers */
+	{ 0, }
+};
+
+static struct pci_driver mlx5_vfio_pci_driver = {
+	.name			= "mlx5-vfio-pci",
+	.id_table		= mlx5_vfio_pci_table,
+	.probe			= mlx5_vfio_pci_probe,
+	.remove			= mlx5_vfio_pci_remove,
+#ifdef CONFIG_PCI_IOV
+	.sriov_configure	= mlx5_vfio_pci_sriov_configure,
+#endif
+	.err_handler		= &mlx5_vfio_err_handlers,
+};
+
+static void __exit mlx5_vfio_pci_cleanup(void)
+{
+	auxiliary_driver_unregister(&mlx5_vfio_pci_aux_driver);
+	pci_unregister_driver(&mlx5_vfio_pci_driver);
+}
+
+static int __init mlx5_vfio_pci_init(void)
+{
+	int ret;
+
+	ret = pci_register_driver(&mlx5_vfio_pci_driver);
+	if (ret)
+		return ret;
+
+	ret = auxiliary_driver_register(&mlx5_vfio_pci_aux_driver);
+	if (ret)
+		goto out_unregister;
+
+	return 0;
+
+out_unregister:
+	pci_unregister_driver(&mlx5_vfio_pci_driver);
+	return ret;
+}
+
+module_init(mlx5_vfio_pci_init);
+module_exit(mlx5_vfio_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/linux/mlx5/vfio_pci.h b/include/linux/mlx5/vfio_pci.h
new file mode 100644
index 000000000000..c1e7b4d6da30
--- /dev/null
+++ b/include/linux/mlx5/vfio_pci.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/*
+ * Copyright (c) 2020 NVIDIA Corporation
+ */
+
+#ifndef _VFIO_PCI_H
+#define _VFIO_PCI_H
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/auxiliary_bus.h>
+#include <linux/mlx5/device.h>
+#include <linux/mlx5/driver.h>
+
+struct mlx5_vfio_pci_adev {
+	struct mlx5_adev	madev;
+
+	/* These fields should not be used outside mlx5_vfio_pci.ko */
+	struct list_head		entry;
+};
+
+static inline struct mlx5_vfio_pci_adev*
+madev_to_mvadev(struct mlx5_adev *madev)
+{
+	return container_of(madev, struct mlx5_vfio_pci_adev, madev);
+}
+
+static inline struct mlx5_vfio_pci_adev*
+adev_to_mvadev(struct auxiliary_device *adev)
+{
+	struct mlx5_adev *madev = container_of(adev, struct mlx5_adev, adev);
+
+	return madev_to_mvadev(madev);
+}
+
+#endif
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 5/9] vfio-pci/zdev: remove unused vdev argument
  2021-02-01 16:28 [PATCH v2 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
                   ` (3 preceding siblings ...)
  2021-02-01 16:28 ` [PATCH 4/9] mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices Max Gurtovoy
@ 2021-02-01 16:28 ` Max Gurtovoy
  2021-02-01 17:27   ` Matthew Rosato
  2021-02-02  7:57   ` Cornelia Huck
  2021-02-01 16:28 ` [PATCH 6/9] vfio-pci/zdev: fix possible segmentation fault issue Max Gurtovoy
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-01 16:28 UTC (permalink / raw)
  To: jgg, cohuck, kvm, linux-kernel, alex.williamson
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, mjrosato, yishaih, aik,
	Max Gurtovoy

Zdev static functions does not use vdev argument. Remove it.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_zdev.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 7b20b34b1034..175096fcd902 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -24,8 +24,7 @@
 /*
  * Add the Base PCI Function information to the device info region.
  */
-static int zpci_base_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
-			 struct vfio_info_cap *caps)
+static int zpci_base_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps)
 {
 	struct vfio_device_info_cap_zpci_base cap = {
 		.header.id = VFIO_DEVICE_INFO_CAP_ZPCI_BASE,
@@ -45,8 +44,7 @@ static int zpci_base_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
 /*
  * Add the Base PCI Function Group information to the device info region.
  */
-static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
-			  struct vfio_info_cap *caps)
+static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps)
 {
 	struct vfio_device_info_cap_zpci_group cap = {
 		.header.id = VFIO_DEVICE_INFO_CAP_ZPCI_GROUP,
@@ -66,8 +64,7 @@ static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
 /*
  * Add the device utility string to the device info region.
  */
-static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
-			 struct vfio_info_cap *caps)
+static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps)
 {
 	struct vfio_device_info_cap_zpci_util *cap;
 	int cap_size = sizeof(*cap) + CLP_UTIL_STR_LEN;
@@ -90,8 +87,7 @@ static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
 /*
  * Add the function path string to the device info region.
  */
-static int zpci_pfip_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
-			 struct vfio_info_cap *caps)
+static int zpci_pfip_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps)
 {
 	struct vfio_device_info_cap_zpci_pfip *cap;
 	int cap_size = sizeof(*cap) + CLP_PFIP_NR_SEGMENTS;
@@ -123,21 +119,21 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
 	if (!zdev)
 		return -ENODEV;
 
-	ret = zpci_base_cap(zdev, vdev, caps);
+	ret = zpci_base_cap(zdev, caps);
 	if (ret)
 		return ret;
 
-	ret = zpci_group_cap(zdev, vdev, caps);
+	ret = zpci_group_cap(zdev, caps);
 	if (ret)
 		return ret;
 
 	if (zdev->util_str_avail) {
-		ret = zpci_util_cap(zdev, vdev, caps);
+		ret = zpci_util_cap(zdev, caps);
 		if (ret)
 			return ret;
 	}
 
-	ret = zpci_pfip_cap(zdev, vdev, caps);
+	ret = zpci_pfip_cap(zdev, caps);
 
 	return ret;
 }
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 6/9] vfio-pci/zdev: fix possible segmentation fault issue
  2021-02-01 16:28 [PATCH v2 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
                   ` (4 preceding siblings ...)
  2021-02-01 16:28 ` [PATCH 5/9] vfio-pci/zdev: remove unused vdev argument Max Gurtovoy
@ 2021-02-01 16:28 ` Max Gurtovoy
  2021-02-01 16:52   ` Cornelia Huck
  2021-02-01 16:28 ` [PATCH 7/9] vfio/pci: use s390 naming instead of zdev Max Gurtovoy
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-01 16:28 UTC (permalink / raw)
  To: jgg, cohuck, kvm, linux-kernel, alex.williamson
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, mjrosato, yishaih, aik,
	Max Gurtovoy

In case allocation fails, we must behave correctly and exit with error.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_zdev.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 175096fcd902..e9ef4239ef7a 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -71,6 +71,8 @@ static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps)
 	int ret;
 
 	cap = kmalloc(cap_size, GFP_KERNEL);
+	if (!cap)
+		return -ENOMEM;
 
 	cap->header.id = VFIO_DEVICE_INFO_CAP_ZPCI_UTIL;
 	cap->header.version = 1;
@@ -94,6 +96,8 @@ static int zpci_pfip_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps)
 	int ret;
 
 	cap = kmalloc(cap_size, GFP_KERNEL);
+	if (!cap)
+		return -ENOMEM;
 
 	cap->header.id = VFIO_DEVICE_INFO_CAP_ZPCI_PFIP;
 	cap->header.version = 1;
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 7/9] vfio/pci: use s390 naming instead of zdev
  2021-02-01 16:28 [PATCH v2 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
                   ` (5 preceding siblings ...)
  2021-02-01 16:28 ` [PATCH 6/9] vfio-pci/zdev: fix possible segmentation fault issue Max Gurtovoy
@ 2021-02-01 16:28 ` Max Gurtovoy
  2021-02-01 16:28 ` [PATCH 8/9] vfio/pci: use x86 naming instead of igd Max Gurtovoy
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-01 16:28 UTC (permalink / raw)
  To: jgg, cohuck, kvm, linux-kernel, alex.williamson
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, mjrosato, yishaih, aik,
	Max Gurtovoy

This patch doesn't change any logic but only the concept of
vfio_pci_core extensions. Extensions that are related to a platform and
not to a specific vendor of PCI devices should be part of the core
driver. Extensions that are specific for PCI device vendor should go to
a dedicated vendor vfio-pci driver.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/Kconfig                              | 4 ++--
 drivers/vfio/pci/Makefile                             | 2 +-
 drivers/vfio/pci/vfio_pci_core.c                      | 2 +-
 drivers/vfio/pci/vfio_pci_private.h                   | 2 +-
 drivers/vfio/pci/{vfio_pci_zdev.c => vfio_pci_s390.c} | 2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)
 rename drivers/vfio/pci/{vfio_pci_zdev.c => vfio_pci_s390.c} (98%)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index dcb164d7d641..6e6c976b9512 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -54,8 +54,8 @@ config VFIO_PCI_NVLINK2
 	help
 	  VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
 
-config VFIO_PCI_ZDEV
-	bool "VFIO PCI ZPCI device CLP support"
+config VFIO_PCI_S390
+	bool "VFIO PCI extensions for S390 platform"
 	depends on VFIO_PCI_CORE && S390
 	default y
 	help
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 9f67edca31c5..90962a495eba 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -7,7 +7,7 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_s390.o
 
 vfio-pci-y := vfio_pci.o
 
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a0a91331f575..c559027def2d 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -785,7 +785,7 @@ static long vfio_pci_core_ioctl(void *device_data, unsigned int cmd,
 		info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
 		info.num_irqs = VFIO_PCI_NUM_IRQS;
 
-		if (IS_ENABLED(CONFIG_VFIO_PCI_ZDEV)) {
+		if (IS_ENABLED(CONFIG_VFIO_PCI_S390)) {
 			int ret = vfio_pci_info_zdev_add_caps(vdev, &caps);
 
 			if (ret && ret != -ENODEV) {
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 1c3bb809b5c0..7c3c2cd575f8 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -188,7 +188,7 @@ static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
 }
 #endif
 
-#ifdef CONFIG_VFIO_PCI_ZDEV
+#ifdef CONFIG_VFIO_PCI_S390
 extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
 				       struct vfio_info_cap *caps);
 #else
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_s390.c
similarity index 98%
rename from drivers/vfio/pci/vfio_pci_zdev.c
rename to drivers/vfio/pci/vfio_pci_s390.c
index e9ef4239ef7a..b6421a7f767f 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_s390.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0+
 /*
- * VFIO ZPCI devices support
+ * VFIO PCI support for S390 platform (a.k.a ZPCI devices)
  *
  * Copyright (C) IBM Corp. 2020.  All rights reserved.
  *	Author(s): Pierre Morel <pmorel@linux.ibm.com>
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-01 16:28 [PATCH v2 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
                   ` (6 preceding siblings ...)
  2021-02-01 16:28 ` [PATCH 7/9] vfio/pci: use s390 naming instead of zdev Max Gurtovoy
@ 2021-02-01 16:28 ` Max Gurtovoy
  2021-02-01 17:14   ` Cornelia Huck
  2021-02-01 16:28 ` [PATCH 9/9] vfio/pci: use powernv naming instead of nvlink2 Max Gurtovoy
  2021-02-10  7:52 ` [PATCH v2 0/9] Introduce vfio-pci-core subsystem Tian, Kevin
  9 siblings, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-01 16:28 UTC (permalink / raw)
  To: jgg, cohuck, kvm, linux-kernel, alex.williamson
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, mjrosato, yishaih, aik,
	Max Gurtovoy

This patch doesn't change any logic but only align to the concept of
vfio_pci_core extensions. Extensions that are related to a platform
and not to a specific vendor of PCI devices should be part of the core
driver. Extensions that are specific for PCI device vendor should go
to a dedicated vendor vfio-pci driver.

For now, x86 extensions will include only igd.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/Kconfig                            | 13 ++++++-------
 drivers/vfio/pci/Makefile                           |  2 +-
 drivers/vfio/pci/vfio_pci_core.c                    |  2 +-
 drivers/vfio/pci/vfio_pci_private.h                 |  2 +-
 drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} |  0
 5 files changed, 9 insertions(+), 10 deletions(-)
 rename drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} (100%)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 6e6c976b9512..c98f2df01a60 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -36,17 +36,16 @@ config VFIO_PCI_INTX
 	depends on VFIO_PCI_CORE
 	def_bool y if !S390
 
-config VFIO_PCI_IGD
-	bool "VFIO PCI extensions for Intel graphics (GVT-d)"
+config VFIO_PCI_X86
+	bool "VFIO PCI extensions for Intel X86 platform"
 	depends on VFIO_PCI_CORE && X86
 	default y
 	help
-	  Support for Intel IGD specific extensions to enable direct
-	  assignment to virtual machines.  This includes exposing an IGD
-	  specific firmware table and read-only copies of the host bridge
-	  and LPC bridge config space.
+	  Support for Intel X86 specific extensions for VFIO PCI devices.
+	  This includes exposing an IGD specific firmware table and
+	  read-only copies of the host bridge and LPC bridge config space.
 
-	  To enable Intel IGD assignment through vfio-pci, say Y.
+	  To enable Intel X86 extensions for vfio-pci-core, say Y.
 
 config VFIO_PCI_NVLINK2
 	def_bool y
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 90962a495eba..d8ccb70e015a 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -5,7 +5,7 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
 obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_X86) += vfio_pci_x86.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_s390.o
 
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index c559027def2d..e0e258c37fb5 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -328,7 +328,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 
 	if (vfio_pci_is_vga(pdev) &&
 	    pdev->vendor == PCI_VENDOR_ID_INTEL &&
-	    IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
+	    IS_ENABLED(CONFIG_VFIO_PCI_X86)) {
 		ret = vfio_pci_igd_init(vdev);
 		if (ret && ret != -ENODEV) {
 			pci_warn(pdev, "Failed to setup Intel IGD regions\n");
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 7c3c2cd575f8..efc688525784 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -165,7 +165,7 @@ extern u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_device *vdev);
 extern void vfio_pci_memory_unlock_and_restore(struct vfio_pci_device *vdev,
 					       u16 cmd);
 
-#ifdef CONFIG_VFIO_PCI_IGD
+#ifdef CONFIG_VFIO_PCI_X86
 extern int vfio_pci_igd_init(struct vfio_pci_device *vdev);
 #else
 static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_x86.c
similarity index 100%
rename from drivers/vfio/pci/vfio_pci_igd.c
rename to drivers/vfio/pci/vfio_pci_x86.c
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 9/9] vfio/pci: use powernv naming instead of nvlink2
  2021-02-01 16:28 [PATCH v2 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
                   ` (7 preceding siblings ...)
  2021-02-01 16:28 ` [PATCH 8/9] vfio/pci: use x86 naming instead of igd Max Gurtovoy
@ 2021-02-01 16:28 ` Max Gurtovoy
  2021-02-01 18:35   ` Jason Gunthorpe
  2021-02-10  7:52 ` [PATCH v2 0/9] Introduce vfio-pci-core subsystem Tian, Kevin
  9 siblings, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-01 16:28 UTC (permalink / raw)
  To: jgg, cohuck, kvm, linux-kernel, alex.williamson
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, mjrosato, yishaih, aik,
	Max Gurtovoy

This patch doesn't change any logic but only align to the concept of
vfio_pci_core extensions. Extensions that are related to a platform
and not to a specific vendor of PCI devices should be part of the
core driver. Extensions that are specific for PCI device vendor should go
to a dedicated vendor vfio-pci driver.

For now, powernv extensions will include only nvlink2.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/Kconfig                                    | 6 ++++--
 drivers/vfio/pci/Makefile                                   | 2 +-
 drivers/vfio/pci/vfio_pci_core.c                            | 4 ++--
 drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_powernv.c} | 0
 drivers/vfio/pci/vfio_pci_private.h                         | 2 +-
 5 files changed, 8 insertions(+), 6 deletions(-)
 rename drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_powernv.c} (100%)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index c98f2df01a60..fe0264b3d02f 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -47,11 +47,13 @@ config VFIO_PCI_X86
 
 	  To enable Intel X86 extensions for vfio-pci-core, say Y.
 
-config VFIO_PCI_NVLINK2
+config VFIO_PCI_POWERNV
 	def_bool y
 	depends on VFIO_PCI_CORE && PPC_POWERNV
 	help
-	  VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
+	  VFIO PCI extensions for IBM PowerNV (Non-Virtualized) platform
+
+	  To enable POWERNV extensions for vfio-pci-core, say Y.
 
 config VFIO_PCI_S390
 	bool "VFIO PCI extensions for S390 platform"
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index d8ccb70e015a..442b7c78de4c 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_X86) += vfio_pci_x86.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_POWERNV) += vfio_pci_powernv.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_s390.o
 
 vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index e0e258c37fb5..90cc728fffc7 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -337,7 +337,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 	}
 
 	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
-	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+	    IS_ENABLED(CONFIG_VFIO_PCI_POWERNV)) {
 		ret = vfio_pci_nvdia_v100_nvlink2_init(vdev);
 		if (ret && ret != -ENODEV) {
 			pci_warn(pdev, "Failed to setup NVIDIA NV2 RAM region\n");
@@ -346,7 +346,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 	}
 
 	if (pdev->vendor == PCI_VENDOR_ID_IBM &&
-	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+	    IS_ENABLED(CONFIG_VFIO_PCI_POWERNV)) {
 		ret = vfio_pci_ibm_npu2_init(vdev);
 		if (ret && ret != -ENODEV) {
 			pci_warn(pdev, "Failed to setup NVIDIA NV2 ATSD region\n");
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_powernv.c
similarity index 100%
rename from drivers/vfio/pci/vfio_pci_nvlink2.c
rename to drivers/vfio/pci/vfio_pci_powernv.c
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index efc688525784..dc6a9191a704 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -173,7 +173,7 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
 	return -ENODEV;
 }
 #endif
-#ifdef CONFIG_VFIO_PCI_NVLINK2
+#ifdef CONFIG_VFIO_PCI_POWERNV
 extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev);
 extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev);
 #else
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/9] vfio-pci/zdev: fix possible segmentation fault issue
  2021-02-01 16:28 ` [PATCH 6/9] vfio-pci/zdev: fix possible segmentation fault issue Max Gurtovoy
@ 2021-02-01 16:52   ` Cornelia Huck
  2021-02-01 17:08     ` Matthew Rosato
  0 siblings, 1 reply; 57+ messages in thread
From: Cornelia Huck @ 2021-02-01 16:52 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: jgg, kvm, linux-kernel, alex.williamson, liranl, oren, tzahio,
	leonro, yarong, aviadye, shahafs, artemp, kwankhede, ACurrid,
	gmataev, cjia, mjrosato, yishaih, aik

On Mon, 1 Feb 2021 16:28:25 +0000
Max Gurtovoy <mgurtovoy@nvidia.com> wrote:

> In case allocation fails, we must behave correctly and exit with error.
> 
> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>

Fixes: e6b817d4b821 ("vfio-pci/zdev: Add zPCI capabilities to VFIO_DEVICE_GET_INFO")

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

I think this should go in independently of this series.

> ---
>  drivers/vfio/pci/vfio_pci_zdev.c | 4 ++++
>  1 file changed, 4 insertions(+)


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/9] vfio-pci/zdev: fix possible segmentation fault issue
  2021-02-01 16:52   ` Cornelia Huck
@ 2021-02-01 17:08     ` Matthew Rosato
  2021-02-01 20:47       ` Alex Williamson
  0 siblings, 1 reply; 57+ messages in thread
From: Matthew Rosato @ 2021-02-01 17:08 UTC (permalink / raw)
  To: Cornelia Huck, Max Gurtovoy
  Cc: jgg, kvm, linux-kernel, alex.williamson, liranl, oren, tzahio,
	leonro, yarong, aviadye, shahafs, artemp, kwankhede, ACurrid,
	gmataev, cjia, yishaih, aik

On 2/1/21 11:52 AM, Cornelia Huck wrote:
> On Mon, 1 Feb 2021 16:28:25 +0000
> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> 
>> In case allocation fails, we must behave correctly and exit with error.
>>
>> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> 
> Fixes: e6b817d4b821 ("vfio-pci/zdev: Add zPCI capabilities to VFIO_DEVICE_GET_INFO")
> 
> Reviewed-by: Cornelia Huck <cohuck@redhat.com>
> 
> I think this should go in independently of this series. >

Agreed, makes sense to me -- thanks for finding.

Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>

>> ---
>>   drivers/vfio/pci/vfio_pci_zdev.c | 4 ++++
>>   1 file changed, 4 insertions(+)
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-01 16:28 ` [PATCH 8/9] vfio/pci: use x86 naming instead of igd Max Gurtovoy
@ 2021-02-01 17:14   ` Cornelia Huck
  2021-02-01 17:49     ` Matthew Rosato
  0 siblings, 1 reply; 57+ messages in thread
From: Cornelia Huck @ 2021-02-01 17:14 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: jgg, kvm, linux-kernel, alex.williamson, liranl, oren, tzahio,
	leonro, yarong, aviadye, shahafs, artemp, kwankhede, ACurrid,
	gmataev, cjia, mjrosato, yishaih, aik

On Mon, 1 Feb 2021 16:28:27 +0000
Max Gurtovoy <mgurtovoy@nvidia.com> wrote:

> This patch doesn't change any logic but only align to the concept of
> vfio_pci_core extensions. Extensions that are related to a platform
> and not to a specific vendor of PCI devices should be part of the core
> driver. Extensions that are specific for PCI device vendor should go
> to a dedicated vendor vfio-pci driver.

My understanding is that igd means support for Intel graphics, i.e. a
strict subset of x86. If there are other future extensions that e.g.
only make sense for some devices found only on AMD systems, I don't
think they should all be included under the same x86 umbrella.

Similar reasoning for nvlink, that only seems to cover support for some
GPUs under Power, and is not a general platform-specific extension IIUC.

We can arguably do the zdev -> s390 rename (as zpci appears only on
s390, and all PCI devices will be zpci on that platform), although I'm
not sure about the benefit.

> 
> For now, x86 extensions will include only igd.
> 
> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> ---
>  drivers/vfio/pci/Kconfig                            | 13 ++++++-------
>  drivers/vfio/pci/Makefile                           |  2 +-
>  drivers/vfio/pci/vfio_pci_core.c                    |  2 +-
>  drivers/vfio/pci/vfio_pci_private.h                 |  2 +-
>  drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} |  0
>  5 files changed, 9 insertions(+), 10 deletions(-)
>  rename drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} (100%)

(...)

> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index c559027def2d..e0e258c37fb5 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -328,7 +328,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  
>  	if (vfio_pci_is_vga(pdev) &&
>  	    pdev->vendor == PCI_VENDOR_ID_INTEL &&
> -	    IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
> +	    IS_ENABLED(CONFIG_VFIO_PCI_X86)) {
>  		ret = vfio_pci_igd_init(vdev);

This one explicitly checks for Intel devices, so I'm not sure why you
want to generalize this to x86?

>  		if (ret && ret != -ENODEV) {
>  			pci_warn(pdev, "Failed to setup Intel IGD regions\n");


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/9] vfio-pci/zdev: remove unused vdev argument
  2021-02-01 16:28 ` [PATCH 5/9] vfio-pci/zdev: remove unused vdev argument Max Gurtovoy
@ 2021-02-01 17:27   ` Matthew Rosato
  2021-02-02  7:57   ` Cornelia Huck
  1 sibling, 0 replies; 57+ messages in thread
From: Matthew Rosato @ 2021-02-01 17:27 UTC (permalink / raw)
  To: Max Gurtovoy, jgg, cohuck, kvm, linux-kernel, alex.williamson
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On 2/1/21 11:28 AM, Max Gurtovoy wrote:
> Zdev static functions does not use vdev argument. Remove it.
> 
> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>

Huh.  I must have dropped the use of vdev somewhere during review 
versions.  Thanks!

Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>

@Alex/@Connie This one is just a cleanup and could also go separately 
from this set if it makes sense.

> ---
>   drivers/vfio/pci/vfio_pci_zdev.c | 20 ++++++++------------
>   1 file changed, 8 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
> index 7b20b34b1034..175096fcd902 100644
> --- a/drivers/vfio/pci/vfio_pci_zdev.c
> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
> @@ -24,8 +24,7 @@
>   /*
>    * Add the Base PCI Function information to the device info region.
>    */
> -static int zpci_base_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
> -			 struct vfio_info_cap *caps)
> +static int zpci_base_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps)
>   {
>   	struct vfio_device_info_cap_zpci_base cap = {
>   		.header.id = VFIO_DEVICE_INFO_CAP_ZPCI_BASE,
> @@ -45,8 +44,7 @@ static int zpci_base_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
>   /*
>    * Add the Base PCI Function Group information to the device info region.
>    */
> -static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
> -			  struct vfio_info_cap *caps)
> +static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps)
>   {
>   	struct vfio_device_info_cap_zpci_group cap = {
>   		.header.id = VFIO_DEVICE_INFO_CAP_ZPCI_GROUP,
> @@ -66,8 +64,7 @@ static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
>   /*
>    * Add the device utility string to the device info region.
>    */
> -static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
> -			 struct vfio_info_cap *caps)
> +static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps)
>   {
>   	struct vfio_device_info_cap_zpci_util *cap;
>   	int cap_size = sizeof(*cap) + CLP_UTIL_STR_LEN;
> @@ -90,8 +87,7 @@ static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
>   /*
>    * Add the function path string to the device info region.
>    */
> -static int zpci_pfip_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
> -			 struct vfio_info_cap *caps)
> +static int zpci_pfip_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps)
>   {
>   	struct vfio_device_info_cap_zpci_pfip *cap;
>   	int cap_size = sizeof(*cap) + CLP_PFIP_NR_SEGMENTS;
> @@ -123,21 +119,21 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
>   	if (!zdev)
>   		return -ENODEV;
>   
> -	ret = zpci_base_cap(zdev, vdev, caps);
> +	ret = zpci_base_cap(zdev, caps);
>   	if (ret)
>   		return ret;
>   
> -	ret = zpci_group_cap(zdev, vdev, caps);
> +	ret = zpci_group_cap(zdev, caps);
>   	if (ret)
>   		return ret;
>   
>   	if (zdev->util_str_avail) {
> -		ret = zpci_util_cap(zdev, vdev, caps);
> +		ret = zpci_util_cap(zdev, caps);
>   		if (ret)
>   			return ret;
>   	}
>   
> -	ret = zpci_pfip_cap(zdev, vdev, caps);
> +	ret = zpci_pfip_cap(zdev, caps);
>   
>   	return ret;
>   }
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-01 17:14   ` Cornelia Huck
@ 2021-02-01 17:49     ` Matthew Rosato
  2021-02-01 18:42       ` Alex Williamson
  0 siblings, 1 reply; 57+ messages in thread
From: Matthew Rosato @ 2021-02-01 17:49 UTC (permalink / raw)
  To: Cornelia Huck, Max Gurtovoy
  Cc: jgg, kvm, linux-kernel, alex.williamson, liranl, oren, tzahio,
	leonro, yarong, aviadye, shahafs, artemp, kwankhede, ACurrid,
	gmataev, cjia, yishaih, aik

On 2/1/21 12:14 PM, Cornelia Huck wrote:
> On Mon, 1 Feb 2021 16:28:27 +0000
> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> 
>> This patch doesn't change any logic but only align to the concept of
>> vfio_pci_core extensions. Extensions that are related to a platform
>> and not to a specific vendor of PCI devices should be part of the core
>> driver. Extensions that are specific for PCI device vendor should go
>> to a dedicated vendor vfio-pci driver.
> 
> My understanding is that igd means support for Intel graphics, i.e. a
> strict subset of x86. If there are other future extensions that e.g.
> only make sense for some devices found only on AMD systems, I don't
> think they should all be included under the same x86 umbrella.
> 
> Similar reasoning for nvlink, that only seems to cover support for some
> GPUs under Power, and is not a general platform-specific extension IIUC.
> 
> We can arguably do the zdev -> s390 rename (as zpci appears only on
> s390, and all PCI devices will be zpci on that platform), although I'm
> not sure about the benefit.

As far as I can tell, there isn't any benefit for s390 it's just 
"re-branding" to match the platform name rather than the zdev moniker, 
which admittedly perhaps makes it more clear to someone outside of s390 
that any PCI device on s390 is a zdev/zpci type, and thus will use this 
extension to vfio_pci(_core).  This would still be true even if we added 
something later that builds atop it (e.g. a platform-specific device 
like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on s390x uses 
these zdev extensions today and would need to continue using them in a 
world where mlx5-vfio-pci.ko exists.

I guess all that to say: if such a rename matches the 'grand scheme' of 
this design where we treat arch-level extensions to vfio_pci(_core) as 
"vfio_pci_(arch)" then I'm not particularly opposed to the rename.  But 
by itself it's not very exciting :)

> 
>>
>> For now, x86 extensions will include only igd.
>>
>> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
>> ---
>>   drivers/vfio/pci/Kconfig                            | 13 ++++++-------
>>   drivers/vfio/pci/Makefile                           |  2 +-
>>   drivers/vfio/pci/vfio_pci_core.c                    |  2 +-
>>   drivers/vfio/pci/vfio_pci_private.h                 |  2 +-
>>   drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} |  0
>>   5 files changed, 9 insertions(+), 10 deletions(-)
>>   rename drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} (100%)
> 
> (...)
> 
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index c559027def2d..e0e258c37fb5 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -328,7 +328,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>   
>>   	if (vfio_pci_is_vga(pdev) &&
>>   	    pdev->vendor == PCI_VENDOR_ID_INTEL &&
>> -	    IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
>> +	    IS_ENABLED(CONFIG_VFIO_PCI_X86)) {
>>   		ret = vfio_pci_igd_init(vdev);
> 
> This one explicitly checks for Intel devices, so I'm not sure why you
> want to generalize this to x86?
> 
>>   		if (ret && ret != -ENODEV) {
>>   			pci_warn(pdev, "Failed to setup Intel IGD regions\n");
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 9/9] vfio/pci: use powernv naming instead of nvlink2
  2021-02-01 16:28 ` [PATCH 9/9] vfio/pci: use powernv naming instead of nvlink2 Max Gurtovoy
@ 2021-02-01 18:35   ` Jason Gunthorpe
  0 siblings, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2021-02-01 18:35 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: cohuck, kvm, linux-kernel, alex.williamson, liranl, oren, tzahio,
	leonro, yarong, aviadye, shahafs, artemp, kwankhede, ACurrid,
	gmataev, cjia, mjrosato, yishaih, aik

On Mon, Feb 01, 2021 at 04:28:28PM +0000, Max Gurtovoy wrote:
> This patch doesn't change any logic but only align to the concept of
> vfio_pci_core extensions. Extensions that are related to a platform
> and not to a specific vendor of PCI devices should be part of the
> core driver. Extensions that are specific for PCI device vendor should go
> to a dedicated vendor vfio-pci driver.
> 
> For now, powernv extensions will include only nvlink2.
> 
> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
>  drivers/vfio/pci/Kconfig                                    | 6 ++++--
>  drivers/vfio/pci/Makefile                                   | 2 +-
>  drivers/vfio/pci/vfio_pci_core.c                            | 4 ++--
>  drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_powernv.c} | 0
>  drivers/vfio/pci/vfio_pci_private.h                         | 2 +-
>  5 files changed, 8 insertions(+), 6 deletions(-)
>  rename drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_powernv.c} (100%)

This is really nothing to do with PPC, "nvlink" is a PCI device that
shows the entire GPU memory space on these special power systems, and
the this driver changes the normal vfio-pci behavior to match the
single device.

This is probably the best existing example of something that could be
a vendor PCI driver because of how single-device specific it really
is.

Read 7f92891778dff62303c070ac81de7b7d80de331a to get some sense of how
very special a device it is.

This could be like mlx5, with the single PCI ID pre-populated in a
match table.

That is probably the key test for vfio_pci_core vs vfio_pci - if the
modification is triggered by a single PCI ID that can be matched it is
vfio_pci side, not core.  Compared to the s390 stuff which applies to
all PCI devices in the system.

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-01 17:49     ` Matthew Rosato
@ 2021-02-01 18:42       ` Alex Williamson
  2021-02-02 16:06         ` Cornelia Huck
  0 siblings, 1 reply; 57+ messages in thread
From: Alex Williamson @ 2021-02-01 18:42 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: Cornelia Huck, Max Gurtovoy, jgg, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Mon, 1 Feb 2021 12:49:12 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> On 2/1/21 12:14 PM, Cornelia Huck wrote:
> > On Mon, 1 Feb 2021 16:28:27 +0000
> > Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> >   
> >> This patch doesn't change any logic but only align to the concept of
> >> vfio_pci_core extensions. Extensions that are related to a platform
> >> and not to a specific vendor of PCI devices should be part of the core
> >> driver. Extensions that are specific for PCI device vendor should go
> >> to a dedicated vendor vfio-pci driver.  
> > 
> > My understanding is that igd means support for Intel graphics, i.e. a
> > strict subset of x86. If there are other future extensions that e.g.
> > only make sense for some devices found only on AMD systems, I don't
> > think they should all be included under the same x86 umbrella.
> > 
> > Similar reasoning for nvlink, that only seems to cover support for some
> > GPUs under Power, and is not a general platform-specific extension IIUC.
> > 
> > We can arguably do the zdev -> s390 rename (as zpci appears only on
> > s390, and all PCI devices will be zpci on that platform), although I'm
> > not sure about the benefit.  
> 
> As far as I can tell, there isn't any benefit for s390 it's just 
> "re-branding" to match the platform name rather than the zdev moniker, 
> which admittedly perhaps makes it more clear to someone outside of s390 
> that any PCI device on s390 is a zdev/zpci type, and thus will use this 
> extension to vfio_pci(_core).  This would still be true even if we added 
> something later that builds atop it (e.g. a platform-specific device 
> like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on s390x uses 
> these zdev extensions today and would need to continue using them in a 
> world where mlx5-vfio-pci.ko exists.
> 
> I guess all that to say: if such a rename matches the 'grand scheme' of 
> this design where we treat arch-level extensions to vfio_pci(_core) as 
> "vfio_pci_(arch)" then I'm not particularly opposed to the rename.  But 
> by itself it's not very exciting :)

This all seems like the wrong direction to me.  The goal here is to
modularize vfio-pci into a core library and derived vendor modules that
make use of that core library.  If existing device specific extensions
within vfio-pci cannot be turned into vendor modules through this
support and are instead redefined as platform specific features of the
new core library, that feels like we're already admitting failure of
this core library to support known devices, let alone future devices.

IGD is a specific set of devices.  They happen to rely on some platform
specific support, whose availability should be determined via the
vendor module probe callback.  Packing that support into an "x86"
component as part of the core feels not only short sighted, but also
avoids addressing the issues around how userspace determines an optimal
module to use for a device.  Thanks,

Alex

> >> For now, x86 extensions will include only igd.
> >>
> >> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> >> ---
> >>   drivers/vfio/pci/Kconfig                            | 13 ++++++-------
> >>   drivers/vfio/pci/Makefile                           |  2 +-
> >>   drivers/vfio/pci/vfio_pci_core.c                    |  2 +-
> >>   drivers/vfio/pci/vfio_pci_private.h                 |  2 +-
> >>   drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} |  0
> >>   5 files changed, 9 insertions(+), 10 deletions(-)
> >>   rename drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} (100%)  
> > 
> > (...)
> >   
> >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> >> index c559027def2d..e0e258c37fb5 100644
> >> --- a/drivers/vfio/pci/vfio_pci_core.c
> >> +++ b/drivers/vfio/pci/vfio_pci_core.c
> >> @@ -328,7 +328,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
> >>   
> >>   	if (vfio_pci_is_vga(pdev) &&
> >>   	    pdev->vendor == PCI_VENDOR_ID_INTEL &&
> >> -	    IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
> >> +	    IS_ENABLED(CONFIG_VFIO_PCI_X86)) {
> >>   		ret = vfio_pci_igd_init(vdev);  
> > 
> > This one explicitly checks for Intel devices, so I'm not sure why you
> > want to generalize this to x86?
> >   
> >>   		if (ret && ret != -ENODEV) {
> >>   			pci_warn(pdev, "Failed to setup Intel IGD regions\n");  
> >   
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/9] vfio-pci/zdev: fix possible segmentation fault issue
  2021-02-01 17:08     ` Matthew Rosato
@ 2021-02-01 20:47       ` Alex Williamson
  2021-02-02  7:58         ` Cornelia Huck
  0 siblings, 1 reply; 57+ messages in thread
From: Alex Williamson @ 2021-02-01 20:47 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: Cornelia Huck, Max Gurtovoy, jgg, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Mon, 1 Feb 2021 12:08:45 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> On 2/1/21 11:52 AM, Cornelia Huck wrote:
> > On Mon, 1 Feb 2021 16:28:25 +0000
> > Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> >   
> >> In case allocation fails, we must behave correctly and exit with error.
> >>
> >> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>  
> > 
> > Fixes: e6b817d4b821 ("vfio-pci/zdev: Add zPCI capabilities to VFIO_DEVICE_GET_INFO")
> > 
> > Reviewed-by: Cornelia Huck <cohuck@redhat.com>
> > 
> > I think this should go in independently of this series. >  
> 
> Agreed, makes sense to me -- thanks for finding.
> 
> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>

I can grab this one, and 5/9.  Connie do you want to toss an R-b at
5/9?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/9] vfio-pci/zdev: remove unused vdev argument
  2021-02-01 16:28 ` [PATCH 5/9] vfio-pci/zdev: remove unused vdev argument Max Gurtovoy
  2021-02-01 17:27   ` Matthew Rosato
@ 2021-02-02  7:57   ` Cornelia Huck
  2021-02-02 17:21     ` Alex Williamson
  1 sibling, 1 reply; 57+ messages in thread
From: Cornelia Huck @ 2021-02-02  7:57 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: jgg, kvm, linux-kernel, alex.williamson, liranl, oren, tzahio,
	leonro, yarong, aviadye, shahafs, artemp, kwankhede, ACurrid,
	gmataev, cjia, mjrosato, yishaih, aik

On Mon, 1 Feb 2021 16:28:24 +0000
Max Gurtovoy <mgurtovoy@nvidia.com> wrote:

> Zdev static functions does not use vdev argument. Remove it.

s/does not use/do not use the/

> 
> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_zdev.c | 20 ++++++++------------
>  1 file changed, 8 insertions(+), 12 deletions(-)

Reviewed-by: Cornelia Huck <cohuck@redhat.com>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/9] vfio-pci/zdev: fix possible segmentation fault issue
  2021-02-01 20:47       ` Alex Williamson
@ 2021-02-02  7:58         ` Cornelia Huck
  0 siblings, 0 replies; 57+ messages in thread
From: Cornelia Huck @ 2021-02-02  7:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Matthew Rosato, Max Gurtovoy, jgg, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Mon, 1 Feb 2021 13:47:57 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Mon, 1 Feb 2021 12:08:45 -0500
> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> 
> > On 2/1/21 11:52 AM, Cornelia Huck wrote:  
> > > On Mon, 1 Feb 2021 16:28:25 +0000
> > > Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> > >     
> > >> In case allocation fails, we must behave correctly and exit with error.
> > >>
> > >> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>    
> > > 
> > > Fixes: e6b817d4b821 ("vfio-pci/zdev: Add zPCI capabilities to VFIO_DEVICE_GET_INFO")
> > > 
> > > Reviewed-by: Cornelia Huck <cohuck@redhat.com>
> > > 
> > > I think this should go in independently of this series. >    
> > 
> > Agreed, makes sense to me -- thanks for finding.
> > 
> > Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>  
> 
> I can grab this one, and 5/9.  Connie do you want to toss an R-b at
> 5/9?  Thanks,
> 
> Alex

Yes, makes sense to grab these two. R-b added.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-01 18:42       ` Alex Williamson
@ 2021-02-02 16:06         ` Cornelia Huck
  2021-02-02 17:10           ` Jason Gunthorpe
  2021-02-02 17:41           ` Max Gurtovoy
  0 siblings, 2 replies; 57+ messages in thread
From: Cornelia Huck @ 2021-02-02 16:06 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Matthew Rosato, Max Gurtovoy, jgg, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Mon, 1 Feb 2021 11:42:30 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Mon, 1 Feb 2021 12:49:12 -0500
> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> 
> > On 2/1/21 12:14 PM, Cornelia Huck wrote:  
> > > On Mon, 1 Feb 2021 16:28:27 +0000
> > > Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> > >     
> > >> This patch doesn't change any logic but only align to the concept of
> > >> vfio_pci_core extensions. Extensions that are related to a platform
> > >> and not to a specific vendor of PCI devices should be part of the core
> > >> driver. Extensions that are specific for PCI device vendor should go
> > >> to a dedicated vendor vfio-pci driver.    
> > > 
> > > My understanding is that igd means support for Intel graphics, i.e. a
> > > strict subset of x86. If there are other future extensions that e.g.
> > > only make sense for some devices found only on AMD systems, I don't
> > > think they should all be included under the same x86 umbrella.
> > > 
> > > Similar reasoning for nvlink, that only seems to cover support for some
> > > GPUs under Power, and is not a general platform-specific extension IIUC.
> > > 
> > > We can arguably do the zdev -> s390 rename (as zpci appears only on
> > > s390, and all PCI devices will be zpci on that platform), although I'm
> > > not sure about the benefit.    
> > 
> > As far as I can tell, there isn't any benefit for s390 it's just 
> > "re-branding" to match the platform name rather than the zdev moniker, 
> > which admittedly perhaps makes it more clear to someone outside of s390 
> > that any PCI device on s390 is a zdev/zpci type, and thus will use this 
> > extension to vfio_pci(_core).  This would still be true even if we added 
> > something later that builds atop it (e.g. a platform-specific device 
> > like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on s390x uses 
> > these zdev extensions today and would need to continue using them in a 
> > world where mlx5-vfio-pci.ko exists.
> > 
> > I guess all that to say: if such a rename matches the 'grand scheme' of 
> > this design where we treat arch-level extensions to vfio_pci(_core) as 
> > "vfio_pci_(arch)" then I'm not particularly opposed to the rename.  But 
> > by itself it's not very exciting :)  
> 
> This all seems like the wrong direction to me.  The goal here is to
> modularize vfio-pci into a core library and derived vendor modules that
> make use of that core library.  If existing device specific extensions
> within vfio-pci cannot be turned into vendor modules through this
> support and are instead redefined as platform specific features of the
> new core library, that feels like we're already admitting failure of
> this core library to support known devices, let alone future devices.
> 
> IGD is a specific set of devices.  They happen to rely on some platform
> specific support, whose availability should be determined via the
> vendor module probe callback.  Packing that support into an "x86"
> component as part of the core feels not only short sighted, but also
> avoids addressing the issues around how userspace determines an optimal
> module to use for a device.

Hm, it seems that not all current extensions to the vfio-pci code are
created equal.

IIUC, we have igd and nvlink, which are sets of devices that only show
up on x86 or ppc, respectively, and may rely on some special features
of those architectures/platforms. The important point is that you have
a device identifier that you can match a driver against.

On the other side, we have the zdev support, which both requires s390
and applies to any pci device on s390. If we added special handling for
ISM on s390, ISM would be in a category similar to igd/nvlink.

Now, if somebody plugs a mlx5 device into an s390, we would want both
the zdev support and the specialized mlx5 driver. Maybe zdev should be
some kind of library that can be linked into normal vfio-pci, into
vfio-pci-mlx5, and a hypothetical vfio-pci-ism? You always want zdev on
s390 (if configured into the kernel).


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-02 16:06         ` Cornelia Huck
@ 2021-02-02 17:10           ` Jason Gunthorpe
  2021-02-11 15:47             ` Max Gurtovoy
  2021-02-02 17:41           ` Max Gurtovoy
  1 sibling, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2021-02-02 17:10 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alex Williamson, Matthew Rosato, Max Gurtovoy, kvm, linux-kernel,
	liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Tue, Feb 02, 2021 at 05:06:59PM +0100, Cornelia Huck wrote:

> On the other side, we have the zdev support, which both requires s390
> and applies to any pci device on s390.

Is there a reason why CONFIG_VFIO_PCI_ZDEV exists? Why not just always
return the s390 specific data in VFIO_DEVICE_GET_INFO if running on
s390?

It would be like returning data from ACPI on other platforms.

It really seems like part of vfio-pci-core

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/9] vfio-pci/zdev: remove unused vdev argument
  2021-02-02  7:57   ` Cornelia Huck
@ 2021-02-02 17:21     ` Alex Williamson
  0 siblings, 0 replies; 57+ messages in thread
From: Alex Williamson @ 2021-02-02 17:21 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Max Gurtovoy, jgg, kvm, linux-kernel, liranl, oren, tzahio,
	leonro, yarong, aviadye, shahafs, artemp, kwankhede, ACurrid,
	gmataev, cjia, mjrosato, yishaih, aik

On Tue, 2 Feb 2021 08:57:55 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Mon, 1 Feb 2021 16:28:24 +0000
> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> 
> > Zdev static functions does not use vdev argument. Remove it.  
> 
> s/does not use/do not use the/
> 
> > 
> > Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_zdev.c | 20 ++++++++------------
> >  1 file changed, 8 insertions(+), 12 deletions(-)  
> 
> Reviewed-by: Cornelia Huck <cohuck@redhat.com>

Applied 5&6 to vfio next branch for v5.12 w/ Matt and Connie's R-b and
trivial fix above.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-02 16:06         ` Cornelia Huck
  2021-02-02 17:10           ` Jason Gunthorpe
@ 2021-02-02 17:41           ` Max Gurtovoy
  2021-02-02 17:54             ` Alex Williamson
       [not found]             ` <806c138e-685c-0955-7c15-93cb1d4fe0d9@ozlabs.ru>
  1 sibling, 2 replies; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-02 17:41 UTC (permalink / raw)
  To: Cornelia Huck, Alex Williamson
  Cc: Matthew Rosato, jgg, kvm, linux-kernel, liranl, oren, tzahio,
	leonro, yarong, aviadye, shahafs, artemp, kwankhede, ACurrid,
	gmataev, cjia, yishaih, aik


On 2/2/2021 6:06 PM, Cornelia Huck wrote:
> On Mon, 1 Feb 2021 11:42:30 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
>
>> On Mon, 1 Feb 2021 12:49:12 -0500
>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>
>>> On 2/1/21 12:14 PM, Cornelia Huck wrote:
>>>> On Mon, 1 Feb 2021 16:28:27 +0000
>>>> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>>>      
>>>>> This patch doesn't change any logic but only align to the concept of
>>>>> vfio_pci_core extensions. Extensions that are related to a platform
>>>>> and not to a specific vendor of PCI devices should be part of the core
>>>>> driver. Extensions that are specific for PCI device vendor should go
>>>>> to a dedicated vendor vfio-pci driver.
>>>> My understanding is that igd means support for Intel graphics, i.e. a
>>>> strict subset of x86. If there are other future extensions that e.g.
>>>> only make sense for some devices found only on AMD systems, I don't
>>>> think they should all be included under the same x86 umbrella.
>>>>
>>>> Similar reasoning for nvlink, that only seems to cover support for some
>>>> GPUs under Power, and is not a general platform-specific extension IIUC.
>>>>
>>>> We can arguably do the zdev -> s390 rename (as zpci appears only on
>>>> s390, and all PCI devices will be zpci on that platform), although I'm
>>>> not sure about the benefit.
>>> As far as I can tell, there isn't any benefit for s390 it's just
>>> "re-branding" to match the platform name rather than the zdev moniker,
>>> which admittedly perhaps makes it more clear to someone outside of s390
>>> that any PCI device on s390 is a zdev/zpci type, and thus will use this
>>> extension to vfio_pci(_core).  This would still be true even if we added
>>> something later that builds atop it (e.g. a platform-specific device
>>> like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on s390x uses
>>> these zdev extensions today and would need to continue using them in a
>>> world where mlx5-vfio-pci.ko exists.
>>>
>>> I guess all that to say: if such a rename matches the 'grand scheme' of
>>> this design where we treat arch-level extensions to vfio_pci(_core) as
>>> "vfio_pci_(arch)" then I'm not particularly opposed to the rename.  But
>>> by itself it's not very exciting :)
>> This all seems like the wrong direction to me.  The goal here is to
>> modularize vfio-pci into a core library and derived vendor modules that
>> make use of that core library.  If existing device specific extensions
>> within vfio-pci cannot be turned into vendor modules through this
>> support and are instead redefined as platform specific features of the
>> new core library, that feels like we're already admitting failure of
>> this core library to support known devices, let alone future devices.
>>
>> IGD is a specific set of devices.  They happen to rely on some platform
>> specific support, whose availability should be determined via the
>> vendor module probe callback.  Packing that support into an "x86"
>> component as part of the core feels not only short sighted, but also
>> avoids addressing the issues around how userspace determines an optimal
>> module to use for a device.
> Hm, it seems that not all current extensions to the vfio-pci code are
> created equal.
>
> IIUC, we have igd and nvlink, which are sets of devices that only show
> up on x86 or ppc, respectively, and may rely on some special features
> of those architectures/platforms. The important point is that you have
> a device identifier that you can match a driver against.

maybe you can supply the ids ?

Alexey K, I saw you've been working on the NVLINK2 for P9. can you 
supply the exact ids for that should be bounded to this driver ?

I'll add it to V3.

>
> On the other side, we have the zdev support, which both requires s390
> and applies to any pci device on s390. If we added special handling for
> ISM on s390, ISM would be in a category similar to igd/nvlink.
>
> Now, if somebody plugs a mlx5 device into an s390, we would want both
> the zdev support and the specialized mlx5 driver. Maybe zdev should be
> some kind of library that can be linked into normal vfio-pci, into
> vfio-pci-mlx5, and a hypothetical vfio-pci-ism? You always want zdev on
> s390 (if configured into the kernel).
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-02 17:41           ` Max Gurtovoy
@ 2021-02-02 17:54             ` Alex Williamson
  2021-02-02 18:50               ` Jason Gunthorpe
       [not found]             ` <806c138e-685c-0955-7c15-93cb1d4fe0d9@ozlabs.ru>
  1 sibling, 1 reply; 57+ messages in thread
From: Alex Williamson @ 2021-02-02 17:54 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Cornelia Huck, Matthew Rosato, jgg, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Tue, 2 Feb 2021 19:41:16 +0200
Max Gurtovoy <mgurtovoy@nvidia.com> wrote:

> On 2/2/2021 6:06 PM, Cornelia Huck wrote:
> > On Mon, 1 Feb 2021 11:42:30 -0700
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >  
> >> On Mon, 1 Feb 2021 12:49:12 -0500
> >> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> >>  
> >>> On 2/1/21 12:14 PM, Cornelia Huck wrote:  
> >>>> On Mon, 1 Feb 2021 16:28:27 +0000
> >>>> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> >>>>        
> >>>>> This patch doesn't change any logic but only align to the concept of
> >>>>> vfio_pci_core extensions. Extensions that are related to a platform
> >>>>> and not to a specific vendor of PCI devices should be part of the core
> >>>>> driver. Extensions that are specific for PCI device vendor should go
> >>>>> to a dedicated vendor vfio-pci driver.  
> >>>> My understanding is that igd means support for Intel graphics, i.e. a
> >>>> strict subset of x86. If there are other future extensions that e.g.
> >>>> only make sense for some devices found only on AMD systems, I don't
> >>>> think they should all be included under the same x86 umbrella.
> >>>>
> >>>> Similar reasoning for nvlink, that only seems to cover support for some
> >>>> GPUs under Power, and is not a general platform-specific extension IIUC.
> >>>>
> >>>> We can arguably do the zdev -> s390 rename (as zpci appears only on
> >>>> s390, and all PCI devices will be zpci on that platform), although I'm
> >>>> not sure about the benefit.  
> >>> As far as I can tell, there isn't any benefit for s390 it's just
> >>> "re-branding" to match the platform name rather than the zdev moniker,
> >>> which admittedly perhaps makes it more clear to someone outside of s390
> >>> that any PCI device on s390 is a zdev/zpci type, and thus will use this
> >>> extension to vfio_pci(_core).  This would still be true even if we added
> >>> something later that builds atop it (e.g. a platform-specific device
> >>> like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on s390x uses
> >>> these zdev extensions today and would need to continue using them in a
> >>> world where mlx5-vfio-pci.ko exists.
> >>>
> >>> I guess all that to say: if such a rename matches the 'grand scheme' of
> >>> this design where we treat arch-level extensions to vfio_pci(_core) as
> >>> "vfio_pci_(arch)" then I'm not particularly opposed to the rename.  But
> >>> by itself it's not very exciting :)  
> >> This all seems like the wrong direction to me.  The goal here is to
> >> modularize vfio-pci into a core library and derived vendor modules that
> >> make use of that core library.  If existing device specific extensions
> >> within vfio-pci cannot be turned into vendor modules through this
> >> support and are instead redefined as platform specific features of the
> >> new core library, that feels like we're already admitting failure of
> >> this core library to support known devices, let alone future devices.
> >>
> >> IGD is a specific set of devices.  They happen to rely on some platform
> >> specific support, whose availability should be determined via the
> >> vendor module probe callback.  Packing that support into an "x86"
> >> component as part of the core feels not only short sighted, but also
> >> avoids addressing the issues around how userspace determines an optimal
> >> module to use for a device.  
> > Hm, it seems that not all current extensions to the vfio-pci code are
> > created equal.
> >
> > IIUC, we have igd and nvlink, which are sets of devices that only show
> > up on x86 or ppc, respectively, and may rely on some special features
> > of those architectures/platforms. The important point is that you have
> > a device identifier that you can match a driver against.  
> 
> maybe you can supply the ids ?
> 
> Alexey K, I saw you've been working on the NVLINK2 for P9. can you 
> supply the exact ids for that should be bounded to this driver ?
> 
> I'll add it to V3.

As noted previously, if we start adding ids for vfio drivers then we
create conflicts with the native host driver.  We cannot register a
vfio PCI driver that automatically claims devices.  At best, this
NVLink driver and an IGD driver could reject devices that they don't
support, ie. NVIDIA GPUs where there's not the correct platform
provided support or Intel GPUs without an OpRegion.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-02 17:54             ` Alex Williamson
@ 2021-02-02 18:50               ` Jason Gunthorpe
  2021-02-02 18:55                 ` Christoph Hellwig
  2021-02-02 19:37                 ` Alex Williamson
  0 siblings, 2 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2021-02-02 18:50 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Max Gurtovoy, Cornelia Huck, Matthew Rosato, kvm, linux-kernel,
	liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Tue, Feb 02, 2021 at 10:54:55AM -0700, Alex Williamson wrote:

> As noted previously, if we start adding ids for vfio drivers then we
> create conflicts with the native host driver.  We cannot register a
> vfio PCI driver that automatically claims devices.

We can't do that in vfio_pci.ko, but a nvlink_vfio_pci.ko can, just
like the RFC showed with the mlx5 example. The key thing is the module
is not autoloadable and there is no modules.alias data for the PCI
IDs.

The admin must explicitly load the module, just like the admin must
explicitly do "cat > new_id". "modprobe nvlink_vfio_pci" replaces
"newid", and preloading the correct IDs into the module's driver makes
the entire admin experience much more natural and safe.

This could be improved with some simple work in the driver core:

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 2f32f38a11ed0b..dc3b088ad44d69 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -828,6 +828,9 @@ static int __device_attach_driver(struct device_driver *drv, void *_data)
 	bool async_allowed;
 	int ret;
 
+	if (drv->flags & DRIVER_EXPLICIT_BIND_ONLY)
+		continue;
+
 	ret = driver_match_device(drv, dev);
 	if (ret == 0) {
 		/* no match */

Thus the match table could be properly specified, but only explicit
manual bind would attach the driver. This would cleanly resolve the
duplicate ID problem, and we could even set a wildcard PCI match table
for vfio_pci and eliminate the new_id part of the sequence.

However, I'd prefer to split any driver core work from VFIO parts - so
I'd propose starting by splitting to vfio_pci_core.ko, vfio_pci.ko,
nvlink_vfio_pci.ko, and igd_vfio_pci.ko working as above.

For uAPI compatability vfio_pci.ko would need some
request_module()/symbol_get() trick to pass control over to the device
specific module.

Jason

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-02 18:50               ` Jason Gunthorpe
@ 2021-02-02 18:55                 ` Christoph Hellwig
  2021-02-02 19:05                   ` Jason Gunthorpe
  2021-02-02 19:37                 ` Alex Williamson
  1 sibling, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2021-02-02 18:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Max Gurtovoy, Cornelia Huck, Matthew Rosato,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Tue, Feb 02, 2021 at 02:50:17PM -0400, Jason Gunthorpe wrote:
> For uAPI compatability vfio_pci.ko would need some
> request_module()/symbol_get() trick to pass control over to the device
> specific module.

Err, don't go there.

Please do the DRIVER_EXPLICIT_BIND_ONLY thing first, which avoids the
need for such hacks.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-02 18:55                 ` Christoph Hellwig
@ 2021-02-02 19:05                   ` Jason Gunthorpe
  0 siblings, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2021-02-02 19:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alex Williamson, Max Gurtovoy, Cornelia Huck, Matthew Rosato,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Tue, Feb 02, 2021 at 06:55:18PM +0000, Christoph Hellwig wrote:
> On Tue, Feb 02, 2021 at 02:50:17PM -0400, Jason Gunthorpe wrote:
> > For uAPI compatability vfio_pci.ko would need some
> > request_module()/symbol_get() trick to pass control over to the device
> > specific module.
> 
> Err, don't go there.
> 
> Please do the DRIVER_EXPLICIT_BIND_ONLY thing first, which avoids the
> need for such hacks.

Ah, right good point, we don't need to avoid loading the nvlink.ko if
loading it is harmless.

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-02 18:50               ` Jason Gunthorpe
  2021-02-02 18:55                 ` Christoph Hellwig
@ 2021-02-02 19:37                 ` Alex Williamson
  2021-02-02 20:44                   ` Jason Gunthorpe
  1 sibling, 1 reply; 57+ messages in thread
From: Alex Williamson @ 2021-02-02 19:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Max Gurtovoy, Cornelia Huck, Matthew Rosato, kvm, linux-kernel,
	liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Tue, 2 Feb 2021 14:50:17 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 02, 2021 at 10:54:55AM -0700, Alex Williamson wrote:
> 
> > As noted previously, if we start adding ids for vfio drivers then we
> > create conflicts with the native host driver.  We cannot register a
> > vfio PCI driver that automatically claims devices.  
> 
> We can't do that in vfio_pci.ko, but a nvlink_vfio_pci.ko can, just
> like the RFC showed with the mlx5 example. The key thing is the module
> is not autoloadable and there is no modules.alias data for the PCI
> IDs.
> 
> The admin must explicitly load the module, just like the admin must
> explicitly do "cat > new_id". "modprobe nvlink_vfio_pci" replaces
> "newid", and preloading the correct IDs into the module's driver makes
> the entire admin experience much more natural and safe.
> 
> This could be improved with some simple work in the driver core:
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 2f32f38a11ed0b..dc3b088ad44d69 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -828,6 +828,9 @@ static int __device_attach_driver(struct device_driver *drv, void *_data)
>  	bool async_allowed;
>  	int ret;
>  
> +	if (drv->flags & DRIVER_EXPLICIT_BIND_ONLY)
> +		continue;
> +
>  	ret = driver_match_device(drv, dev);
>  	if (ret == 0) {
>  		/* no match */
> 
> Thus the match table could be properly specified, but only explicit
> manual bind would attach the driver. This would cleanly resolve the
> duplicate ID problem, and we could even set a wildcard PCI match table
> for vfio_pci and eliminate the new_id part of the sequence.
> 
> However, I'd prefer to split any driver core work from VFIO parts - so
> I'd propose starting by splitting to vfio_pci_core.ko, vfio_pci.ko,
> nvlink_vfio_pci.ko, and igd_vfio_pci.ko working as above.

For the most part, this explicit bind interface is redundant to
driver_override, which already avoids the duplicate ID issue.  A user
specifies a driver to use for a given device, which automatically makes
the driver match accept the device and there are no conflicts with
native drivers.  The problem is still how the user knows to choose
vfio-pci-igd for a device versus vfio-pci-nvlink, other vendor drivers,
or vfio-pci.

A driver id table doesn't really help for binding the device,
ultimately even if a device is in the id table it might fail to probe
due to the missing platform support that each of these igd and nvlink
drivers expose, at which point the user would need to pick a next best
options.  Are you somehow proposing the driver id table for the user to
understand possible drivers, even if that doesn't prioritize them?  I
don't see that there's anything new here otherwise.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-02 19:37                 ` Alex Williamson
@ 2021-02-02 20:44                   ` Jason Gunthorpe
  2021-02-02 20:59                     ` Max Gurtovoy
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2021-02-02 20:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Max Gurtovoy, Cornelia Huck, Matthew Rosato, kvm, linux-kernel,
	liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Tue, Feb 02, 2021 at 12:37:23PM -0700, Alex Williamson wrote:

> For the most part, this explicit bind interface is redundant to
> driver_override, which already avoids the duplicate ID issue.  

No, the point here is to have the ID tables in the PCI drivers because
they fundamentally only work with their supported IDs. The normal
driver core ID tables are a replacement for all the hardwired if's in
vfio_pci.

driver_override completely disables all the ID checking, it seems only
useful for vfio_pci which works with everything. It should not be used
with something like nvlink_vfio_pci.ko that needs ID checking.

Yes, this DRIVER_EXPLICIT_BIND_ONLY idea somewhat replaces
driver_override because we could set the PCI any match on vfio_pci and
manage the driver binding explicitly instead.

> A driver id table doesn't really help for binding the device,
> ultimately even if a device is in the id table it might fail to
> probe due to the missing platform support that each of these igd and
> nvlink drivers expose,

What happens depends on what makes sense for the driver, some missing
optional support could continue without it, or it could fail.

IGD and nvlink can trivially go onwards and work if they don't find
the platform support.

Or they might want to fail, I think the mlx5 and probably nvlink
drivers should fail as they are intended to be coupled with userspace
that expects to use their extended features.

In those cases failing is a feature because it prevents the whole
system from going into an unexpected state.

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-02 20:44                   ` Jason Gunthorpe
@ 2021-02-02 20:59                     ` Max Gurtovoy
  2021-02-02 21:30                       ` Alex Williamson
  0 siblings, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-02 20:59 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Cornelia Huck, Matthew Rosato, kvm, linux-kernel, liranl, oren,
	tzahio, leonro, yarong, aviadye, shahafs, artemp, kwankhede,
	ACurrid, gmataev, cjia, yishaih, aik


On 2/2/2021 10:44 PM, Jason Gunthorpe wrote:
> On Tue, Feb 02, 2021 at 12:37:23PM -0700, Alex Williamson wrote:
>
>> For the most part, this explicit bind interface is redundant to
>> driver_override, which already avoids the duplicate ID issue.
> No, the point here is to have the ID tables in the PCI drivers because
> they fundamentally only work with their supported IDs. The normal
> driver core ID tables are a replacement for all the hardwired if's in
> vfio_pci.
>
> driver_override completely disables all the ID checking, it seems only
> useful for vfio_pci which works with everything. It should not be used
> with something like nvlink_vfio_pci.ko that needs ID checking.

This mechanism of driver_override seems weird to me. In case of hotplug 
and both capable drivers (native device driver and vfio-pci) are loaded, 
both will compete on the device.

I think the proposed flags is very powerful and it does fix the original 
concern Alex had ("if we start adding ids for vfio drivers then we 
create conflicts with the native host driver") and it's very deterministic.

In this way we'll bind explicitly to a driver.

And the way we'll choose a vfio-pci driver is by device_id + vendor_id + 
subsystem_device + subsystem_vendor.

There shouldn't be 2 vfio-pci drivers that support a device with same 
above 4 ids.

if you don't find a suitable vendor-vfio-pci.ko, you'll try binding 
vfio-pci.ko.

Each driver will publish its supported ids in sysfs to help the user to 
decide.

>
> Yes, this DRIVER_EXPLICIT_BIND_ONLY idea somewhat replaces
> driver_override because we could set the PCI any match on vfio_pci and
> manage the driver binding explicitly instead.
>
>> A driver id table doesn't really help for binding the device,
>> ultimately even if a device is in the id table it might fail to
>> probe due to the missing platform support that each of these igd and
>> nvlink drivers expose,
> What happens depends on what makes sense for the driver, some missing
> optional support could continue without it, or it could fail.
>
> IGD and nvlink can trivially go onwards and work if they don't find
> the platform support.
>
> Or they might want to fail, I think the mlx5 and probably nvlink
> drivers should fail as they are intended to be coupled with userspace
> that expects to use their extended features.
>
> In those cases failing is a feature because it prevents the whole
> system from going into an unexpected state.
>
> Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-02 20:59                     ` Max Gurtovoy
@ 2021-02-02 21:30                       ` Alex Williamson
  2021-02-02 23:06                         ` Jason Gunthorpe
  0 siblings, 1 reply; 57+ messages in thread
From: Alex Williamson @ 2021-02-02 21:30 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Jason Gunthorpe, Cornelia Huck, Matthew Rosato, kvm,
	linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Tue, 2 Feb 2021 22:59:27 +0200
Max Gurtovoy <mgurtovoy@nvidia.com> wrote:

> On 2/2/2021 10:44 PM, Jason Gunthorpe wrote:
> > On Tue, Feb 02, 2021 at 12:37:23PM -0700, Alex Williamson wrote:
> >  
> >> For the most part, this explicit bind interface is redundant to
> >> driver_override, which already avoids the duplicate ID issue.  
> > No, the point here is to have the ID tables in the PCI drivers because
> > they fundamentally only work with their supported IDs. The normal
> > driver core ID tables are a replacement for all the hardwired if's in
> > vfio_pci.
> >
> > driver_override completely disables all the ID checking, it seems only
> > useful for vfio_pci which works with everything. It should not be used
> > with something like nvlink_vfio_pci.ko that needs ID checking.  
> 
> This mechanism of driver_override seems weird to me. In case of hotplug 
> and both capable drivers (native device driver and vfio-pci) are loaded, 
> both will compete on the device.

How would the hot-added device have driver_override set?  There's no
competition, the native device driver would claim the device and the
user could set driver_override, unbind and re-probe to get their
specified driver.  Once a driver_override is set, there cannot be any
competition, driver_override is used for match exclusively if set.

> I think the proposed flags is very powerful and it does fix the original 
> concern Alex had ("if we start adding ids for vfio drivers then we 
> create conflicts with the native host driver") and it's very deterministic.
> 
> In this way we'll bind explicitly to a driver.
> 
> And the way we'll choose a vfio-pci driver is by device_id + vendor_id + 
> subsystem_device + subsystem_vendor.
> 
> There shouldn't be 2 vfio-pci drivers that support a device with same 
> above 4 ids.

It's entirely possible there could be, but without neural implant
devices to interpret the user's intentions, I think we'll have to
accept there could be non-determinism here.

The first set of users already fail this specification though, we can't
base it strictly on device and vendor IDs, we need wildcards, class
codes, revision IDs, etc., just like any other PCI drvier.  We're not
going to maintain a set of specific device IDs for the IGD extension,
nor I suspect the NVLINK support as that would require a kernel update
every time a new GPU is released that makes use of the same interface.

As I understand Jason's reply, these vendor drivers would have an ids
table and a user could look at modalias for the device to compare to
the driver supported aliases for a match.  Does kmod already have this
as a utility outside of modprobe?

> if you don't find a suitable vendor-vfio-pci.ko, you'll try binding 
> vfio-pci.ko.
> 
> Each driver will publish its supported ids in sysfs to help the user to 
> decide.

Seems like it would be embedded in the aliases for the module, with
this explicit binding flag being the significant difference that
prevents auto loading the device.  We still have one of the races that
driver_override resolves though, the proposed explicit bind flag is on
the driver not the device, so a native host driver being loaded due to
a hotplug operation or independent actions of different admins could
usurp the device between unbind of old driver and bind to new driver.

> > Yes, this DRIVER_EXPLICIT_BIND_ONLY idea somewhat replaces
> > driver_override because we could set the PCI any match on vfio_pci and
> > manage the driver binding explicitly instead.
> >  
> >> A driver id table doesn't really help for binding the device,
> >> ultimately even if a device is in the id table it might fail to
> >> probe due to the missing platform support that each of these igd and
> >> nvlink drivers expose,  
> > What happens depends on what makes sense for the driver, some missing
> > optional support could continue without it, or it could fail.
> >
> > IGD and nvlink can trivially go onwards and work if they don't find
> > the platform support.

This seems unpredictable from a user perspective.  In either the igd or
nvlink cases, if the platform features aren't available, the feature
set of the device is reduced.  That's not apparent until the user tries
to start interacting with the device if the device specific driver
doesn't fail the probe.  Userspace policy would need to decide if a
fallback driver is acceptable or the vendor specific driver failure is
fatal. Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-02 21:30                       ` Alex Williamson
@ 2021-02-02 23:06                         ` Jason Gunthorpe
  2021-02-02 23:59                           ` Alex Williamson
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2021-02-02 23:06 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Max Gurtovoy, Cornelia Huck, Matthew Rosato, kvm, linux-kernel,
	liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Tue, Feb 02, 2021 at 02:30:13PM -0700, Alex Williamson wrote:

> The first set of users already fail this specification though, we can't
> base it strictly on device and vendor IDs, we need wildcards, class
> codes, revision IDs, etc., just like any other PCI drvier.  We're not
> going to maintain a set of specific device IDs for the IGD
> extension,

The Intel GPU driver already has a include/drm/i915_pciids.h that
organizes all the PCI match table entries, no reason why VFIO IGD
couldn't include that too and use the same match table as the real GPU
driver. Same HW right?

Also how sure are you that this loose detection is going to work with
future Intel discrete GPUs that likely won't need vfio_igd?

> nor I suspect the NVLINK support as that would require a kernel update
> every time a new GPU is released that makes use of the same interface.

The nvlink device that required this special vfio code was a one
off. Current devices do not use it. Not having an exact PCI ID match
in this case is a bug.

> As I understand Jason's reply, these vendor drivers would have an ids
> table and a user could look at modalias for the device to compare to
> the driver supported aliases for a match.  Does kmod already have this
> as a utility outside of modprobe?

I think this is worth exploring.

One idea that fits nicely with the existing infrastructure is to add
to driver core a 'device mode' string. It would be "default" or "vfio"

devices in vfio mode only match vfio mode device_drivers.

devices in vfio mode generate a unique modalias string that includes
some additional 'mode=vfio' identifier

drivers that run in vfio mode generate a module table string that
includes the same mode=vfio

The driver core can trigger driver auto loading soley based on the
mode string, happens naturally.

All the existing udev, depmod/etc tooling will transparently work.

Like driver_override, but doesn't bypass all the ID and module loading
parts of the driver core.

(But lets not get too far down this path until we can agree that
embracing the driver core like the RFC contemplates is the agreed
direction)

> Seems like it would be embedded in the aliases for the module, with
> this explicit binding flag being the significant difference that
> prevents auto loading the device.  We still have one of the races that
> driver_override resolves though, the proposed explicit bind flag is on
> the driver not the device, so a native host driver being loaded due to
> a hotplug operation or independent actions of different admins could
> usurp the device between unbind of old driver and bind to new driver.

This is because the sysfs doesn't have an atomic way to bind and
rebind a device, teaching 'bind' to how to do that would also solve
this problem.

> This seems unpredictable from a user perspective.  In either the igd or
> nvlink cases, if the platform features aren't available, the feature
> set of the device is reduced.  That's not apparent until the user tries
> to start interacting with the device if the device specific driver
> doesn't fail the probe.  Userspace policy would need to decide if a
> fallback driver is acceptable or the vendor specific driver failure is
> fatal. Thanks,

It matches today's behavior, if it is a good idea to preserve it, then
it can be so without much effort.

I do prefer the explicitness because I think most use cases have a
user that requires the special driver to run. Explicitly binding a
the required driver seems preferable.

Certainly nvlink and mlx5 should fail probe and not fall back to plain
vfio-pci. If user wants plain vfio-pci user should ask explicitly. At
least for the mlx5 cases this is a completely reasonable thing to
do. I like that we can support this choice.

I'm not so clear on IGD, especially how it would interact with future
descrete cards that probably don't need it. IMHO, it would be fine if
it was different for some good reason.

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-02 23:06                         ` Jason Gunthorpe
@ 2021-02-02 23:59                           ` Alex Williamson
  2021-02-03 13:54                             ` Jason Gunthorpe
  2021-02-11  8:44                             ` Christoph Hellwig
  0 siblings, 2 replies; 57+ messages in thread
From: Alex Williamson @ 2021-02-02 23:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Max Gurtovoy, Cornelia Huck, Matthew Rosato, kvm, linux-kernel,
	liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Tue, 2 Feb 2021 19:06:04 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 02, 2021 at 02:30:13PM -0700, Alex Williamson wrote:
> 
> > The first set of users already fail this specification though, we can't
> > base it strictly on device and vendor IDs, we need wildcards, class
> > codes, revision IDs, etc., just like any other PCI drvier.  We're not
> > going to maintain a set of specific device IDs for the IGD
> > extension,  
> 
> The Intel GPU driver already has a include/drm/i915_pciids.h that
> organizes all the PCI match table entries, no reason why VFIO IGD
> couldn't include that too and use the same match table as the real GPU
> driver. Same HW right?

vfio-pci-igd support knows very little about the device, we're
effectively just exposing a firmware table and some of the host bridge
config space (read-only).  So the idea that the host kernel needs to
have updated i915 support in order to expose the device to userspace
with these extra regions is a bit silly.

> Also how sure are you that this loose detection is going to work with
> future Intel discrete GPUs that likely won't need vfio_igd?

Not at all, which is one more reason we don't want to rely on i915's
device table, which would likely support those as well.  We might only
want to bind to GPUs on the root complex, or at address 0000:00:02.0.
Our "what to reject" algorithm might need to evolve as those arrive,
but I don't think that means we need to explicitly list every device ID
either.

> > nor I suspect the NVLINK support as that would require a kernel update
> > every time a new GPU is released that makes use of the same interface.  
> 
> The nvlink device that required this special vfio code was a one
> off. Current devices do not use it. Not having an exact PCI ID match
> in this case is a bug.

AIUI, the quirk is only activated when there's a firmware table to
support it.  No firmware table, no driver bind, no need to use
explicit IDs.  Vendor and class code should be enough.
 
> > As I understand Jason's reply, these vendor drivers would have an ids
> > table and a user could look at modalias for the device to compare to
> > the driver supported aliases for a match.  Does kmod already have this
> > as a utility outside of modprobe?  
> 
> I think this is worth exploring.
> 
> One idea that fits nicely with the existing infrastructure is to add
> to driver core a 'device mode' string. It would be "default" or "vfio"
> 
> devices in vfio mode only match vfio mode device_drivers.
> 
> devices in vfio mode generate a unique modalias string that includes
> some additional 'mode=vfio' identifier
> 
> drivers that run in vfio mode generate a module table string that
> includes the same mode=vfio
> 
> The driver core can trigger driver auto loading soley based on the
> mode string, happens naturally.
> 
> All the existing udev, depmod/etc tooling will transparently work.
> 
> Like driver_override, but doesn't bypass all the ID and module loading
> parts of the driver core.
> 
> (But lets not get too far down this path until we can agree that
> embracing the driver core like the RFC contemplates is the agreed
> direction)

I'm not sure I fully follow the mechanics of this.  I'm interpreting
this as something like a sub-class of drivers where for example
vfio-pci class drivers would have a vfio-pci: alias prefix rather than
pci:.  There might be some sysfs attribute for the device that would
allow the user to write an alias prefix and would that trigger the
(ex.) pci-core to send remove uevents for the pci: modalias device and
add uevents for the vfio-pci: modalias device?  Some ordering rules
would then allow vendor/device modules to precede vfio-pci, which would
have only a wildcard id table?

I need to churn on that for a while, but if driver-core folks are
interested, maybe it could be a good idea...  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-02 23:59                           ` Alex Williamson
@ 2021-02-03 13:54                             ` Jason Gunthorpe
  2021-02-11  8:47                               ` Christoph Hellwig
  2021-02-11  8:44                             ` Christoph Hellwig
  1 sibling, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2021-02-03 13:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Max Gurtovoy, Cornelia Huck, Matthew Rosato, kvm, linux-kernel,
	liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Tue, Feb 02, 2021 at 04:59:23PM -0700, Alex Williamson wrote:
> On Tue, 2 Feb 2021 19:06:04 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Feb 02, 2021 at 02:30:13PM -0700, Alex Williamson wrote:
> > 
> > > The first set of users already fail this specification though, we can't
> > > base it strictly on device and vendor IDs, we need wildcards, class
> > > codes, revision IDs, etc., just like any other PCI drvier.  We're not
> > > going to maintain a set of specific device IDs for the IGD
> > > extension,  
> > 
> > The Intel GPU driver already has a include/drm/i915_pciids.h that
> > organizes all the PCI match table entries, no reason why VFIO IGD
> > couldn't include that too and use the same match table as the real GPU
> > driver. Same HW right?
> 
> vfio-pci-igd support knows very little about the device, we're
> effectively just exposing a firmware table and some of the host bridge
> config space (read-only).  So the idea that the host kernel needs to
> have updated i915 support in order to expose the device to userspace
> with these extra regions is a bit silly.

It is our standard driver process in Linux, we can use it here in vfio

My point is we don't have to preserve the currently loose matching
semantics and we can have explicit ID lists for these drivers. It is
not a blocker to this direction.

We can argue if it is better/worse, but the rest of the kernel works
just fine on an 'update the ID match table when the HW is vetted'
principle. It is not an important enough reason to reject this
direction.

> I'm not sure I fully follow the mechanics of this.  I'm interpreting
> this as something like a sub-class of drivers where for example
> vfio-pci class drivers would have a vfio-pci: alias prefix rather than
> pci:.  There might be some sysfs attribute for the device that would
> allow the user to write an alias prefix and would that trigger the
> (ex.) pci-core to send remove uevents for the pci: modalias device and
> add uevents for the vfio-pci: modalias device?  Some ordering rules
> would then allow vendor/device modules to precede vfio-pci, which would
> have only a wildcard id table?

Yes, I think you have the general idea

> I need to churn on that for a while, but if driver-core folks are
> interested, maybe it could be a good idea...  Thanks,

My intention is to show there are options here, we are not limited to
just what we have today.

If people are accepting that these device-specific drivers are
required then we need to come to a community consensus to decide what
direction to pursue:

* Do we embrace the driver core and use it to load VFIO modules like a
  normal subsytem (this RFC)

OR 

* Do we make a driver-core like thing inside the VFIO bus drivers and
  have them run their own special driver matching, binding, and loading
  scheme. (May RFC)

Haven't heard a 3rd option yet..

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
       [not found]             ` <806c138e-685c-0955-7c15-93cb1d4fe0d9@ozlabs.ru>
@ 2021-02-03 16:07               ` Max Gurtovoy
       [not found]                 ` <83ef0164-6291-c3d1-0ce5-2c9d6c97469e@ozlabs.ru>
  2021-02-04  9:12               ` Max Gurtovoy
  1 sibling, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-03 16:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Cornelia Huck, Alex Williamson
  Cc: Matthew Rosato, jgg, kvm, linux-kernel, liranl, oren, tzahio,
	leonro, yarong, aviadye, shahafs, artemp, kwankhede, ACurrid,
	gmataev, cjia, yishaih


On 2/3/2021 5:24 AM, Alexey Kardashevskiy wrote:
>
>
> On 03/02/2021 04:41, Max Gurtovoy wrote:
>>
>> On 2/2/2021 6:06 PM, Cornelia Huck wrote:
>>> On Mon, 1 Feb 2021 11:42:30 -0700
>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>
>>>> On Mon, 1 Feb 2021 12:49:12 -0500
>>>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>>>
>>>>> On 2/1/21 12:14 PM, Cornelia Huck wrote:
>>>>>> On Mon, 1 Feb 2021 16:28:27 +0000
>>>>>> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>>>>>> This patch doesn't change any logic but only align to the 
>>>>>>> concept of
>>>>>>> vfio_pci_core extensions. Extensions that are related to a platform
>>>>>>> and not to a specific vendor of PCI devices should be part of 
>>>>>>> the core
>>>>>>> driver. Extensions that are specific for PCI device vendor 
>>>>>>> should go
>>>>>>> to a dedicated vendor vfio-pci driver.
>>>>>> My understanding is that igd means support for Intel graphics, 
>>>>>> i.e. a
>>>>>> strict subset of x86. If there are other future extensions that e.g.
>>>>>> only make sense for some devices found only on AMD systems, I don't
>>>>>> think they should all be included under the same x86 umbrella.
>>>>>>
>>>>>> Similar reasoning for nvlink, that only seems to cover support 
>>>>>> for some
>>>>>> GPUs under Power, and is not a general platform-specific 
>>>>>> extension IIUC.
>>>>>>
>>>>>> We can arguably do the zdev -> s390 rename (as zpci appears only on
>>>>>> s390, and all PCI devices will be zpci on that platform), 
>>>>>> although I'm
>>>>>> not sure about the benefit.
>>>>> As far as I can tell, there isn't any benefit for s390 it's just
>>>>> "re-branding" to match the platform name rather than the zdev 
>>>>> moniker,
>>>>> which admittedly perhaps makes it more clear to someone outside of 
>>>>> s390
>>>>> that any PCI device on s390 is a zdev/zpci type, and thus will use 
>>>>> this
>>>>> extension to vfio_pci(_core).  This would still be true even if we 
>>>>> added
>>>>> something later that builds atop it (e.g. a platform-specific device
>>>>> like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on 
>>>>> s390x uses
>>>>> these zdev extensions today and would need to continue using them 
>>>>> in a
>>>>> world where mlx5-vfio-pci.ko exists.
>>>>>
>>>>> I guess all that to say: if such a rename matches the 'grand 
>>>>> scheme' of
>>>>> this design where we treat arch-level extensions to 
>>>>> vfio_pci(_core) as
>>>>> "vfio_pci_(arch)" then I'm not particularly opposed to the 
>>>>> rename.  But
>>>>> by itself it's not very exciting :)
>>>> This all seems like the wrong direction to me.  The goal here is to
>>>> modularize vfio-pci into a core library and derived vendor modules 
>>>> that
>>>> make use of that core library.  If existing device specific extensions
>>>> within vfio-pci cannot be turned into vendor modules through this
>>>> support and are instead redefined as platform specific features of the
>>>> new core library, that feels like we're already admitting failure of
>>>> this core library to support known devices, let alone future devices.
>>>>
>>>> IGD is a specific set of devices.  They happen to rely on some 
>>>> platform
>>>> specific support, whose availability should be determined via the
>>>> vendor module probe callback.  Packing that support into an "x86"
>>>> component as part of the core feels not only short sighted, but also
>>>> avoids addressing the issues around how userspace determines an 
>>>> optimal
>>>> module to use for a device.
>>> Hm, it seems that not all current extensions to the vfio-pci code are
>>> created equal.
>>>
>>> IIUC, we have igd and nvlink, which are sets of devices that only show
>>> up on x86 or ppc, respectively, and may rely on some special features
>>> of those architectures/platforms. The important point is that you have
>>> a device identifier that you can match a driver against.
>>
>> maybe you can supply the ids ?
>>
>> Alexey K, I saw you've been working on the NVLINK2 for P9. can you 
>> supply the exact ids for that should be bounded to this driver ?
>>
>> I'll add it to V3.
>
> Sorry no, I never really had the exact ids, they keep popping up after 
> years.
>
> The nvlink2 support can only work if the platform supports it as there 
> is nothing to do to the GPUs themselves, it is about setting up those 
> nvlink2 links. If the platform advertises certain features in the 
> device tree - then any GPU in the SXM2 form factor (not sure about the 
> exact abbrev, not an usual PCIe device) should just work.
>
> If the nvlink2 "subdriver" fails to find nvlinks (the platform does 
> not advertise them or some parameters are new to this subdriver, such 
> as link-speed), we still fall back to generic vfio-pci and try passing 
> through this GPU as a plain PCI device with no extra links. 
> Semi-optimal but if the user is mining cryptocoins, then highspeed 
> links are not really that important :) And the nvidia driver is 
> capable of running such GPUs without nvlinks. Thanks,

 From the above I can conclude that nvlink2_vfio_pci will need to find 
nvlinks during the probe and fail in case it doesn't.

which platform driver is responsible to advertise it ? can we use 
aux_device to represent these nvlink/nvlinks ?

In case of a failure, the user can fallback to vfio-pci.ko and run 
without the nvlinks as you said.

This is deterministic behavior and the user will know exactly what he's 
getting from vfio subsystem.

>
>
>>
>>>
>>> On the other side, we have the zdev support, which both requires s390
>>> and applies to any pci device on s390. If we added special handling for
>>> ISM on s390, ISM would be in a category similar to igd/nvlink.
>>>
>>> Now, if somebody plugs a mlx5 device into an s390, we would want both
>>> the zdev support and the specialized mlx5 driver. Maybe zdev should be
>>> some kind of library that can be linked into normal vfio-pci, into
>>> vfio-pci-mlx5, and a hypothetical vfio-pci-ism? You always want zdev on
>>> s390 (if configured into the kernel).
>>>
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
       [not found]             ` <806c138e-685c-0955-7c15-93cb1d4fe0d9@ozlabs.ru>
  2021-02-03 16:07               ` Max Gurtovoy
@ 2021-02-04  9:12               ` Max Gurtovoy
  2021-02-11  8:50                 ` Christoph Hellwig
  1 sibling, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-04  9:12 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Cornelia Huck, Alex Williamson
  Cc: Matthew Rosato, jgg, kvm, linux-kernel, liranl, oren, tzahio,
	leonro, yarong, aviadye, shahafs, artemp, kwankhede, ACurrid,
	gmataev, cjia, yishaih


On 2/3/2021 5:24 AM, Alexey Kardashevskiy wrote:
>
>
> On 03/02/2021 04:41, Max Gurtovoy wrote:
>>
>> On 2/2/2021 6:06 PM, Cornelia Huck wrote:
>>> On Mon, 1 Feb 2021 11:42:30 -0700
>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>
>>>> On Mon, 1 Feb 2021 12:49:12 -0500
>>>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>>>
>>>>> On 2/1/21 12:14 PM, Cornelia Huck wrote:
>>>>>> On Mon, 1 Feb 2021 16:28:27 +0000
>>>>>> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>>>>>> This patch doesn't change any logic but only align to the 
>>>>>>> concept of
>>>>>>> vfio_pci_core extensions. Extensions that are related to a platform
>>>>>>> and not to a specific vendor of PCI devices should be part of 
>>>>>>> the core
>>>>>>> driver. Extensions that are specific for PCI device vendor 
>>>>>>> should go
>>>>>>> to a dedicated vendor vfio-pci driver.
>>>>>> My understanding is that igd means support for Intel graphics, 
>>>>>> i.e. a
>>>>>> strict subset of x86. If there are other future extensions that e.g.
>>>>>> only make sense for some devices found only on AMD systems, I don't
>>>>>> think they should all be included under the same x86 umbrella.
>>>>>>
>>>>>> Similar reasoning for nvlink, that only seems to cover support 
>>>>>> for some
>>>>>> GPUs under Power, and is not a general platform-specific 
>>>>>> extension IIUC.
>>>>>>
>>>>>> We can arguably do the zdev -> s390 rename (as zpci appears only on
>>>>>> s390, and all PCI devices will be zpci on that platform), 
>>>>>> although I'm
>>>>>> not sure about the benefit.
>>>>> As far as I can tell, there isn't any benefit for s390 it's just
>>>>> "re-branding" to match the platform name rather than the zdev 
>>>>> moniker,
>>>>> which admittedly perhaps makes it more clear to someone outside of 
>>>>> s390
>>>>> that any PCI device on s390 is a zdev/zpci type, and thus will use 
>>>>> this
>>>>> extension to vfio_pci(_core).  This would still be true even if we 
>>>>> added
>>>>> something later that builds atop it (e.g. a platform-specific device
>>>>> like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on 
>>>>> s390x uses
>>>>> these zdev extensions today and would need to continue using them 
>>>>> in a
>>>>> world where mlx5-vfio-pci.ko exists.
>>>>>
>>>>> I guess all that to say: if such a rename matches the 'grand 
>>>>> scheme' of
>>>>> this design where we treat arch-level extensions to 
>>>>> vfio_pci(_core) as
>>>>> "vfio_pci_(arch)" then I'm not particularly opposed to the 
>>>>> rename.  But
>>>>> by itself it's not very exciting :)
>>>> This all seems like the wrong direction to me.  The goal here is to
>>>> modularize vfio-pci into a core library and derived vendor modules 
>>>> that
>>>> make use of that core library.  If existing device specific extensions
>>>> within vfio-pci cannot be turned into vendor modules through this
>>>> support and are instead redefined as platform specific features of the
>>>> new core library, that feels like we're already admitting failure of
>>>> this core library to support known devices, let alone future devices.
>>>>
>>>> IGD is a specific set of devices.  They happen to rely on some 
>>>> platform
>>>> specific support, whose availability should be determined via the
>>>> vendor module probe callback.  Packing that support into an "x86"
>>>> component as part of the core feels not only short sighted, but also
>>>> avoids addressing the issues around how userspace determines an 
>>>> optimal
>>>> module to use for a device.
>>> Hm, it seems that not all current extensions to the vfio-pci code are
>>> created equal.
>>>
>>> IIUC, we have igd and nvlink, which are sets of devices that only show
>>> up on x86 or ppc, respectively, and may rely on some special features
>>> of those architectures/platforms. The important point is that you have
>>> a device identifier that you can match a driver against.
>>
>> maybe you can supply the ids ?
>>
>> Alexey K, I saw you've been working on the NVLINK2 for P9. can you 
>> supply the exact ids for that should be bounded to this driver ?
>>
>> I'll add it to V3.
>
> Sorry no, I never really had the exact ids, they keep popping up after 
> years.
>
> The nvlink2 support can only work if the platform supports it as there 
> is nothing to do to the GPUs themselves, it is about setting up those 
> nvlink2 links. If the platform advertises certain features in the 
> device tree - then any GPU in the SXM2 form factor (not sure about the 
> exact abbrev, not an usual PCIe device) should just work.
>
> If the nvlink2 "subdriver" fails to find nvlinks (the platform does 
> not advertise them or some parameters are new to this subdriver, such 
> as link-speed), we still fall back to generic vfio-pci and try passing 
> through this GPU as a plain PCI device with no extra links. 
> Semi-optimal but if the user is mining cryptocoins, then highspeed 
> links are not really that important :) And the nvidia driver is 
> capable of running such GPUs without nvlinks. Thanks,
>
I see.

But the PCI function (the bounded BDF) is GPU function or NVLINK function ?

If it's NVLINK function then we should fail probing in the host vfio-pci 
driver.

if its a GPU function so it shouldn't been called nvlink2 vfio-pci 
driver. Its just an extension in the GPU vfio-pci driver.

>
>>
>>>
>>> On the other side, we have the zdev support, which both requires s390
>>> and applies to any pci device on s390. If we added special handling for
>>> ISM on s390, ISM would be in a category similar to igd/nvlink.
>>>
>>> Now, if somebody plugs a mlx5 device into an s390, we would want both
>>> the zdev support and the specialized mlx5 driver. Maybe zdev should be
>>> some kind of library that can be linked into normal vfio-pci, into
>>> vfio-pci-mlx5, and a hypothetical vfio-pci-ism? You always want zdev on
>>> s390 (if configured into the kernel).
>>>
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
       [not found]                 ` <83ef0164-6291-c3d1-0ce5-2c9d6c97469e@ozlabs.ru>
@ 2021-02-04 12:51                   ` Jason Gunthorpe
  2021-02-05  0:42                     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2021-02-04 12:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Max Gurtovoy, Cornelia Huck, Alex Williamson, Matthew Rosato,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, gmataev, cjia, yishaih

On Thu, Feb 04, 2021 at 12:05:22PM +1100, Alexey Kardashevskiy wrote:

> It is system firmware (==bios) which puts stuff in the device tree. The
> stuff is:
> 1. emulated pci devices (custom pci bridges), one per nvlink, emulated by
> the firmware, the driver is "ibmnpu" and it is a part on the nvidia driver;
> these are basically config space proxies to the cpu's side of nvlink.
> 2. interconnect information - which of 6 gpus nvlinks connected to which
> nvlink on the cpu side, and memory ranges.

So what is this vfio_nvlink driver supposed to be bound to? 

The "emulated pci devices"?

A real GPU function?

A real nvswitch function?

Something else?

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-04 12:51                   ` Jason Gunthorpe
@ 2021-02-05  0:42                     ` Alexey Kardashevskiy
  2021-02-08 12:44                       ` Max Gurtovoy
  2021-02-08 18:13                       ` Jason Gunthorpe
  0 siblings, 2 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2021-02-05  0:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Max Gurtovoy, Cornelia Huck, Alex Williamson, Matthew Rosato,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, gmataev, cjia, yishaih



On 04/02/2021 23:51, Jason Gunthorpe wrote:
> On Thu, Feb 04, 2021 at 12:05:22PM +1100, Alexey Kardashevskiy wrote:
> 
>> It is system firmware (==bios) which puts stuff in the device tree. The
>> stuff is:
>> 1. emulated pci devices (custom pci bridges), one per nvlink, emulated by
>> the firmware, the driver is "ibmnpu" and it is a part on the nvidia driver;
>> these are basically config space proxies to the cpu's side of nvlink.
>> 2. interconnect information - which of 6 gpus nvlinks connected to which
>> nvlink on the cpu side, and memory ranges.
> 
> So what is this vfio_nvlink driver supposed to be bound to?
> 
> The "emulated pci devices"?

Yes.

> A real GPU function?

Yes.

> A real nvswitch function?

What do you mean by this exactly? The cpu side of nvlink is "emulated 
pci devices", the gpu side is not in pci space at all, the nvidia driver 
manages it via the gpu's mmio or/and cfg space.

> Something else?

Nope :)
In this new scheme which you are proposing it should be 2 drivers, I guess.

> 
> Jason
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-05  0:42                     ` Alexey Kardashevskiy
@ 2021-02-08 12:44                       ` Max Gurtovoy
  2021-02-09  1:55                         ` Alexey Kardashevskiy
  2021-02-08 18:13                       ` Jason Gunthorpe
  1 sibling, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-08 12:44 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Jason Gunthorpe
  Cc: Cornelia Huck, Alex Williamson, Matthew Rosato, kvm,
	linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, gmataev, cjia, yishaih


On 2/5/2021 2:42 AM, Alexey Kardashevskiy wrote:
>
>
> On 04/02/2021 23:51, Jason Gunthorpe wrote:
>> On Thu, Feb 04, 2021 at 12:05:22PM +1100, Alexey Kardashevskiy wrote:
>>
>>> It is system firmware (==bios) which puts stuff in the device tree. The
>>> stuff is:
>>> 1. emulated pci devices (custom pci bridges), one per nvlink, 
>>> emulated by
>>> the firmware, the driver is "ibmnpu" and it is a part on the nvidia 
>>> driver;
>>> these are basically config space proxies to the cpu's side of nvlink.
>>> 2. interconnect information - which of 6 gpus nvlinks connected to 
>>> which
>>> nvlink on the cpu side, and memory ranges.
>>
>> So what is this vfio_nvlink driver supposed to be bound to?
>>
>> The "emulated pci devices"?
>
> Yes.
>
>> A real GPU function?
>
> Yes.
>
>> A real nvswitch function?
>
> What do you mean by this exactly? The cpu side of nvlink is "emulated 
> pci devices", the gpu side is not in pci space at all, the nvidia 
> driver manages it via the gpu's mmio or/and cfg space.
>
>> Something else?
>
> Nope :)
> In this new scheme which you are proposing it should be 2 drivers, I 
> guess.

I see.

So should it be nvidia_vfio_pci.ko ? and it will do the NVLINK stuff in 
case the class code matches and otherwise just work as simple vfio_pci GPU ?

What about the second driver ? should it be called ibmnpu_vfio_pci.ko ?

>
>>
>> Jason
>>
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-05  0:42                     ` Alexey Kardashevskiy
  2021-02-08 12:44                       ` Max Gurtovoy
@ 2021-02-08 18:13                       ` Jason Gunthorpe
  2021-02-09  1:51                         ` Alexey Kardashevskiy
  1 sibling, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2021-02-08 18:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Max Gurtovoy, Cornelia Huck, Alex Williamson, Matthew Rosato,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, gmataev, cjia, yishaih

On Fri, Feb 05, 2021 at 11:42:11AM +1100, Alexey Kardashevskiy wrote:
> > A real nvswitch function?
> 
> What do you mean by this exactly? The cpu side of nvlink is "emulated pci
> devices", the gpu side is not in pci space at all, the nvidia driver manages
> it via the gpu's mmio or/and cfg space.

Some versions of the nvswitch chip have a PCI-E link too, that is what
I though this was all about when I first saw it.

So, it is really a special set of functions for NVIDIA GPU device
assignment only applicable to P9 systems, much like IGD is for Intel
on x86.

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-08 18:13                       ` Jason Gunthorpe
@ 2021-02-09  1:51                         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2021-02-09  1:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Max Gurtovoy, Cornelia Huck, Alex Williamson, Matthew Rosato,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, gmataev, cjia, yishaih



On 09/02/2021 05:13, Jason Gunthorpe wrote:
> On Fri, Feb 05, 2021 at 11:42:11AM +1100, Alexey Kardashevskiy wrote:
>>> A real nvswitch function?
>>
>> What do you mean by this exactly? The cpu side of nvlink is "emulated pci
>> devices", the gpu side is not in pci space at all, the nvidia driver manages
>> it via the gpu's mmio or/and cfg space.
> 
> Some versions of the nvswitch chip have a PCI-E link too, that is what
> I though this was all about when I first saw it.

> So, it is really a special set of functions for NVIDIA GPU device
> assignment only applicable to P9 systems, much like IGD is for Intel
> on x86.



These GPUs are not P9 specific and they all have both PCIe and NVLink2 
links. The special part is that some nvlinks are between P9 and GPU and 
the rest are between GPUs, everywhere else (x86, may be ARM) the nvlinks 
are used between GPUs but even there I do not think the nvlink logic is 
presented to the host in the PCI space.



-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-08 12:44                       ` Max Gurtovoy
@ 2021-02-09  1:55                         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 57+ messages in thread
From: Alexey Kardashevskiy @ 2021-02-09  1:55 UTC (permalink / raw)
  To: Max Gurtovoy, Jason Gunthorpe
  Cc: Cornelia Huck, Alex Williamson, Matthew Rosato, kvm,
	linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, gmataev, cjia, yishaih



On 08/02/2021 23:44, Max Gurtovoy wrote:
> 
> On 2/5/2021 2:42 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 04/02/2021 23:51, Jason Gunthorpe wrote:
>>> On Thu, Feb 04, 2021 at 12:05:22PM +1100, Alexey Kardashevskiy wrote:
>>>
>>>> It is system firmware (==bios) which puts stuff in the device tree. The
>>>> stuff is:
>>>> 1. emulated pci devices (custom pci bridges), one per nvlink, 
>>>> emulated by
>>>> the firmware, the driver is "ibmnpu" and it is a part on the nvidia 
>>>> driver;
>>>> these are basically config space proxies to the cpu's side of nvlink.
>>>> 2. interconnect information - which of 6 gpus nvlinks connected to 
>>>> which
>>>> nvlink on the cpu side, and memory ranges.
>>>
>>> So what is this vfio_nvlink driver supposed to be bound to?
>>>
>>> The "emulated pci devices"?
>>
>> Yes.
>>
>>> A real GPU function?
>>
>> Yes.
>>
>>> A real nvswitch function?
>>
>> What do you mean by this exactly? The cpu side of nvlink is "emulated 
>> pci devices", the gpu side is not in pci space at all, the nvidia 
>> driver manages it via the gpu's mmio or/and cfg space.
>>
>>> Something else?
>>
>> Nope :)
>> In this new scheme which you are proposing it should be 2 drivers, I 
>> guess.
> 
> I see.
> 
> So should it be nvidia_vfio_pci.ko ? and it will do the NVLINK stuff in 
> case the class code matches and otherwise just work as simple vfio_pci 
> GPU ?

"nvidia_vfio_pci" would be too generic, sounds like it is for every 
nvidia on every platform. powernv_nvidia_vfio_pci.ko may be.

> What about the second driver ? should it be called ibmnpu_vfio_pci.ko ?

This will do.


> 
>>
>>>
>>> Jason
>>>
>>

-- 
Alexey

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH v2 0/9] Introduce vfio-pci-core subsystem
  2021-02-01 16:28 [PATCH v2 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
                   ` (8 preceding siblings ...)
  2021-02-01 16:28 ` [PATCH 9/9] vfio/pci: use powernv naming instead of nvlink2 Max Gurtovoy
@ 2021-02-10  7:52 ` Tian, Kevin
  2021-02-10 13:34   ` Jason Gunthorpe
  9 siblings, 1 reply; 57+ messages in thread
From: Tian, Kevin @ 2021-02-10  7:52 UTC (permalink / raw)
  To: Max Gurtovoy, jgg, cohuck, kvm, linux-kernel, alex.williamson
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, mjrosato, yishaih, aik, Zhao,
	Yan Y

Hi, Max,

> From: Max Gurtovoy <mgurtovoy@nvidia.com>
> Sent: Tuesday, February 2, 2021 12:28 AM
> 
> Hi Alex and Cornelia,
> 
> This series split the vfio_pci driver into 2 parts: pci driver and a
> subsystem driver that will also be library of code. The pci driver,
> vfio_pci.ko will be used as before and it will bind to the subsystem
> driver vfio_pci_core.ko to register to the VFIO subsystem. This patchset
> if fully backward compatible. This is a typical Linux subsystem
> framework behaviour. This framework can be also adopted by vfio_mdev
> devices as we'll see in the below sketch.
> 
> This series is coming to solve the issues that were raised in the
> previous attempt for extending vfio-pci for vendor specific
> functionality: https://lkml.org/lkml/2020/5/17/376 by Yan Zhao.
> 
> This solution is also deterministic in a sense that when a user will
> bind to a vendor specific vfio-pci driver, it will get all the special
> goodies of the HW.
> 
> This subsystem framework will also ease on adding vendor specific
> functionality to VFIO devices in the future by allowing another module
> to provide the pci_driver that can setup number of details before
> registering to VFIO subsystem (such as inject its own operations).

I'm a bit confused about the change from v1 to v2, especially about
how to inject module specific operations. From live migration p.o.v
it may requires two hook points at least for some devices (e.g. i40e 
in original Yan's example): register a migration region and intercept 
guest writes to specific registers. [PATCH 4/9] demonstrates the 
former but not the latter (which is allowed in v1). I saw some concerns 
about reporting struct vfio_pci_device outside of vfio-pci-core in v1
which should be the main reason driving this change. But I'm still 
curious to know how we plan to address this requirement (allowing 
vendor driver to tweak specific vfio_device_ops) moving forward.

Then another question. Once we have this framework in place, do we 
mandate this approach for any vendor specific tweak or still allow
doing it as vfio_pci_core extensions (such as igd and zdev in this series)? 
If the latter, what is the criteria to judge which way is desired? Also what 
about the scenarios where we just want one-time vendor information, 
e.g. to tell whether a device can tolerate arbitrary I/O page faults [1] or
the offset in VF PCI config space to put PASID/ATS/PRI capabilities [2]?
Do we expect to create a module for each device to provide such info?
Having those questions answered is helpful for better understanding of
this proposal IMO. 😊

[1] https://lore.kernel.org/kvm/d4c51504-24ed-2592-37b4-f390b97fdd00@huawei.com/T/
[2] https://lore.kernel.org/kvm/20200407095801.648b1371@w520.home/

> 
> Below we can see the proposed changes (this patchset only deals with
> VFIO_PCI subsystem but it can easily be extended to VFIO_MDEV subsystem
> as well):
> 
> +------------------------------------------------------------------+
> |                                                                  |
> |                               VFIO                               |
> |                                                                  |
> +------------------------------------------------------------------+
> 
> +------------------------------+    +------------------------------+
> |                              |    |                              |
> |        VFIO_PCI_CORE         |    |         VFIO_MDEV_CORE       |
> |                              |    |                              |
> +------------------------------+    +------------------------------+
> 
> +--------------+ +-------------+    +-------------+ +--------------+
> |              | |             |    |             | |              |
> |              | |             |    |             | |              |
> |   VFIO_PCI   | |MLX5_VFIO_PCI|    |  VFIO_MDEV  | |MLX5_VFIO_MDEV|
> |              | |             |    |             | |              |
> |              | |             |    |             | |              |
> +--------------+ +-------------+    +-------------+ +--------------+
> 

The VFIO_PCI part is clear but I didn't get how this concept is applied
to VFIO_MDEV. the mdev sub-system looks like below in my mind:

+------------------------------------------------------------------+
|                                                                  |
|                               VFIO                               |
|                                                                  |
+------------------------------------------------------------------+

+------------------------------+    +------------------------------+
|                              |    |                              |
|        VFIO_PCI_CORE         |    |         VFIO_MDEV            |
|                              |    |                              |
+------------------------------+    +------------------------------+

+--------------+ +-------------+    +------------------------------+
|              | |             |    |                              |
|              | |             |    |                              |
|   VFIO_PCI   | |MLX5_VFIO_PCI|    |          MDEV CORE           |
|              | |             |    |                              |
|              | |             |    |                              |
+--------------+ +-------------+    +------------------------------+

                                    +------------------------------+
                                    |                              |
                                    |                              |
                                    |         MLX5-MDEV            |
                                    |                              |
                                    |                              |
                                    +------------------------------+
MDEV core is already a well defined subsystem to connect mdev
bus driver (vfio-mdev) and mdev device driver (mlx5-mdev).
vfio-mdev is just the channel to bring VFIO APIs through mdev core
to underlying vendor specific mdev device driver, which is already
granted flexibility to tweak whatever needs through mdev_parent_ops.
Then what exact extension is talked here by creating another subsystem
module? or are we talking about some general library which can be
shared by underlying mdev device drivers to reduce duplicated
emulation code?

> First 3 patches introduce the above changes for vfio_pci and
> vfio_pci_core.
> 
> Patch (4/9) introduces a new mlx5 vfio-pci module that registers to VFIO
> subsystem using vfio_pci_core. It also registers to Auxiliary bus for
> binding to mlx5_core that is the parent of mlx5-vfio-pci devices. This
> will allow extending mlx5-vfio-pci devices with HW specific features
> such as Live Migration (mlx5_core patches are not part of this series
> that comes for proposing the changes need for the vfio pci subsystem).
> 
> For our testing and development we used 2 VirtIO-BLK physical functions
> based on NVIDIAs Bluefield-2 SNAP technology. These 2 PCI functions were
> enumerated as 08:00.0 and 09:00.0 by the Hypervisor.
> 
> The Bluefield-2 device driver, mlx5_core, will create auxiliary devices
> for each VirtIO-BLK SNAP PF (the "parent"/"manager" of each VirtIO-BLK PF
> will actually issue auxiliary device creation).
> 
> These auxiliary devices will be seen on the Auxiliary bus as:
> mlx5_core.vfio_pci.2048 -
> > ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:0
> 7:00.0/mlx5_core.vfio_pci.2048
> mlx5_core.vfio_pci.2304 -
> > ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:0
> 7:00.1/mlx5_core.vfio_pci.2304
> 
> 2048 represents BDF 08:00.0 (parent is 0000:07:00.0 Bluefield-2 p0) and
> 2304 represents BDF 09:00.0 (parent is 0000:07:00.1 Bluefield-2 p1) in
> decimal view. In this manner, the administrator will be able to locate the
> correct vfio-pci module it should bind the desired BDF to (by finding
> the pointer to the module according to the Auxiliary driver of that BDF).
> 
> Note: The discovery mechanism we used for the RFC might need some
>       improvements as mentioned in the TODO list.
> 
> In this way, we'll use the HW vendor driver core to manage the lifecycle
> of these devices. This is reasonable since only the vendor driver knows
> exactly about the status on its internal state and the capabilities of
> its acceleratots, for example.
> 
> changes from v1:
>  - Create a private and public vfio-pci structures (From Alex)
>  - register to vfio core directly from vfio-pci-core (for now, expose
>    minimal public funcionality to vfio pci drivers. This will remove the
>    implicit behaviour from v1. More power to the drivers can be added in
>    the future)
>  - Add patch 3/9 to emphasize the needed extension for LM feature (From
>    Cornelia)
>  - take/release refcount for the pci module during core open/release
>  - update nvlink, igd and zdev to PowerNV, X86 and s390 extensions for
>    vfio-pci core
>  - fix segfault bugs in current vfio-pci zdev code
> 
> TODOs:
> 1. Create subsystem module for VFIO_MDEV. This can be used for vendor
>    specific scalable functions for example (SFs).
> 2. Add Live migration functionality for mlx5 SNAP devices
>    (NVMe/Virtio-BLK).
> 3. Add Live migration functionality for mlx5 VFs
> 4. Add the needed functionality for mlx5_core
> 5. Improve userspace and discovery
> 6. move VGA stuff to x86 extension
> 
> I would like to thank the great team that was involved in this
> development, design and internal review:
> Oren, Liran, Jason, Leon, Aviad, Shahaf, Gary, Artem, Kirti, Neo, Andy
> and others.
> 
> This series applies cleanly on top of kernel 5.11-rc5+ commit 13391c60da33:
> "Merge branch 'linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6"
> from Linus.
> 
> Note: Live migration for MLX5 SNAP devices is WIP and can be the first
>       example for adding vendor extension to vfio-pci devices. As the
>       changes to the subsystem must be defined as a pre-condition for
>       this work, we've decided to split the submission for now.
> 
> Max Gurtovoy (9):
>   vfio-pci: rename vfio_pci.c to vfio_pci_core.c
>   vfio-pci: introduce vfio_pci_core subsystem driver
>   vfio-pci-core: export vfio_pci_register_dev_region function
>   mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices
>   vfio-pci/zdev: remove unused vdev argument
>   vfio-pci/zdev: fix possible segmentation fault issue
>   vfio/pci: use s390 naming instead of zdev
>   vfio/pci: use x86 naming instead of igd
>   vfio/pci: use powernv naming instead of nvlink2
> 
>  drivers/vfio/pci/Kconfig                      |   57 +-
>  drivers/vfio/pci/Makefile                     |   16 +-
>  drivers/vfio/pci/mlx5_vfio_pci.c              |  253 ++
>  drivers/vfio/pci/vfio_pci.c                   | 2384 +----------------
>  drivers/vfio/pci/vfio_pci_config.c            |   56 +-
>  drivers/vfio/pci/vfio_pci_core.c              | 2326 ++++++++++++++++
>  drivers/vfio/pci/vfio_pci_core.h              |   73 +
>  drivers/vfio/pci/vfio_pci_intrs.c             |   22 +-
>  ...{vfio_pci_nvlink2.c => vfio_pci_powernv.c} |   47 +-
>  drivers/vfio/pci/vfio_pci_private.h           |   44 +-
>  drivers/vfio/pci/vfio_pci_rdwr.c              |   14 +-
>  .../pci/{vfio_pci_zdev.c => vfio_pci_s390.c}  |   28 +-
>  .../pci/{vfio_pci_igd.c => vfio_pci_x86.c}    |   18 +-
>  include/linux/mlx5/vfio_pci.h                 |   36 +
>  14 files changed, 2916 insertions(+), 2458 deletions(-)
>  create mode 100644 drivers/vfio/pci/mlx5_vfio_pci.c
>  create mode 100644 drivers/vfio/pci/vfio_pci_core.c
>  create mode 100644 drivers/vfio/pci/vfio_pci_core.h
>  rename drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_powernv.c} (89%)
>  rename drivers/vfio/pci/{vfio_pci_zdev.c => vfio_pci_s390.c} (80%)
>  rename drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} (89%)
>  create mode 100644 include/linux/mlx5/vfio_pci.h
> 
> --
> 2.25.4

Thanks
Kevin


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 0/9] Introduce vfio-pci-core subsystem
  2021-02-10  7:52 ` [PATCH v2 0/9] Introduce vfio-pci-core subsystem Tian, Kevin
@ 2021-02-10 13:34   ` Jason Gunthorpe
  2021-02-10 16:37     ` Alex Williamson
  2021-02-11  8:36     ` Christoph Hellwig
  0 siblings, 2 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2021-02-10 13:34 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Max Gurtovoy, cohuck, kvm, linux-kernel, alex.williamson, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, mjrosato, yishaih, aik, Zhao,
	Yan Y

On Wed, Feb 10, 2021 at 07:52:08AM +0000, Tian, Kevin wrote:
> > This subsystem framework will also ease on adding vendor specific
> > functionality to VFIO devices in the future by allowing another module
> > to provide the pci_driver that can setup number of details before
> > registering to VFIO subsystem (such as inject its own operations).
> 
> I'm a bit confused about the change from v1 to v2, especially about
> how to inject module specific operations. From live migration p.o.v
> it may requires two hook points at least for some devices (e.g. i40e 
> in original Yan's example):

IMHO, it was too soon to give up on putting the vfio_device_ops in the
final driver- we should try to define a reasonable public/private
split of vfio_pci_device as is the norm in the kernel. No reason we
can't achieve that.

>  register a migration region and intercept guest writes to specific
> registers. [PATCH 4/9] demonstrates the former but not the latter
> (which is allowed in v1).

And this is why, the ROI to wrapper every vfio op in a PCI op just to
keep vfio_pci_device completely private is poor :(

> Then another question. Once we have this framework in place, do we 
> mandate this approach for any vendor specific tweak or still allow
> doing it as vfio_pci_core extensions (such as igd and zdev in this
> series)?

I would say no to any further vfio_pci_core extensions that are tied
to specific PCI devices. Things like zdev are platform features, they
are not tied to specific PCI devices

> If the latter, what is the criteria to judge which way is desired? Also what 
> about the scenarios where we just want one-time vendor information, 
> e.g. to tell whether a device can tolerate arbitrary I/O page faults [1] or
> the offset in VF PCI config space to put PASID/ATS/PRI capabilities [2]?
> Do we expect to create a module for each device to provide such info?
> Having those questions answered is helpful for better understanding of
> this proposal IMO. 😊
> 
> [1] https://lore.kernel.org/kvm/d4c51504-24ed-2592-37b4-f390b97fdd00@huawei.com/T/

SVA is a platform feature, so no problem. Don't see a vfio-pci change
in here?

> [2] https://lore.kernel.org/kvm/20200407095801.648b1371@w520.home/

This one could have been done as a broadcom_vfio_pci driver. Not sure
exposing the entire config space unprotected is safe, hard to know
what the device has put in there, and if it is secure to share with a
guest..

> MDEV core is already a well defined subsystem to connect mdev
> bus driver (vfio-mdev) and mdev device driver (mlx5-mdev).

mdev is two things

 - a driver core bus layer and sysfs that makes a lifetime model
 - a vfio bus driver that doesn't do anything but forward ops to the
   main ops

> vfio-mdev is just the channel to bring VFIO APIs through mdev core
> to underlying vendor specific mdev device driver, which is already
> granted flexibility to tweak whatever needs through mdev_parent_ops.

This is the second thing, and it could just be deleted. The actual
final mdev driver can just use vfio_device_ops directly. The
redirection shim in vfio_mdev.c doesn't add value.

> Then what exact extension is talked here by creating another subsystem
> module? or are we talking about some general library which can be
> shared by underlying mdev device drivers to reduce duplicated
> emulation code?

IMHO it is more a design philosophy that the end driver should
implement the vfio_device_ops directly vs having a stack of ops
structs.

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 0/9] Introduce vfio-pci-core subsystem
  2021-02-10 13:34   ` Jason Gunthorpe
@ 2021-02-10 16:37     ` Alex Williamson
  2021-02-10 17:08       ` Jason Gunthorpe
  2021-02-11  8:36     ` Christoph Hellwig
  1 sibling, 1 reply; 57+ messages in thread
From: Alex Williamson @ 2021-02-10 16:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Max Gurtovoy, cohuck, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, mjrosato, yishaih, aik, Zhao,
	Yan Y

On Wed, 10 Feb 2021 09:34:52 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Feb 10, 2021 at 07:52:08AM +0000, Tian, Kevin wrote:
> > > This subsystem framework will also ease on adding vendor specific
> > > functionality to VFIO devices in the future by allowing another module
> > > to provide the pci_driver that can setup number of details before
> > > registering to VFIO subsystem (such as inject its own operations).  
> > 
> > I'm a bit confused about the change from v1 to v2, especially about
> > how to inject module specific operations. From live migration p.o.v
> > it may requires two hook points at least for some devices (e.g. i40e 
> > in original Yan's example):  
> 
> IMHO, it was too soon to give up on putting the vfio_device_ops in the
> final driver- we should try to define a reasonable public/private
> split of vfio_pci_device as is the norm in the kernel. No reason we
> can't achieve that.
> 
> >  register a migration region and intercept guest writes to specific
> > registers. [PATCH 4/9] demonstrates the former but not the latter
> > (which is allowed in v1).  
> 
> And this is why, the ROI to wrapper every vfio op in a PCI op just to
> keep vfio_pci_device completely private is poor :(

Says someone who doesn't need to maintain the core, fixing bugs and
adding features, while not breaking vendor driver touching private data
in unexpected ways ;)

> > Then another question. Once we have this framework in place, do we 
> > mandate this approach for any vendor specific tweak or still allow
> > doing it as vfio_pci_core extensions (such as igd and zdev in this
> > series)?  
> 
> I would say no to any further vfio_pci_core extensions that are tied
> to specific PCI devices. Things like zdev are platform features, they
> are not tied to specific PCI devices
> 
> > If the latter, what is the criteria to judge which way is desired? Also what 
> > about the scenarios where we just want one-time vendor information, 
> > e.g. to tell whether a device can tolerate arbitrary I/O page faults [1] or
> > the offset in VF PCI config space to put PASID/ATS/PRI capabilities [2]?
> > Do we expect to create a module for each device to provide such info?
> > Having those questions answered is helpful for better understanding of
> > this proposal IMO. 😊
> > 
> > [1] https://lore.kernel.org/kvm/d4c51504-24ed-2592-37b4-f390b97fdd00@huawei.com/T/  
> 
> SVA is a platform feature, so no problem. Don't see a vfio-pci change
> in here?
> 
> > [2] https://lore.kernel.org/kvm/20200407095801.648b1371@w520.home/  
> 
> This one could have been done as a broadcom_vfio_pci driver. Not sure
> exposing the entire config space unprotected is safe, hard to know
> what the device has put in there, and if it is secure to share with a
> guest..

The same argument could be made about the whole device, not just config
space.  In fact we know that devices like GPUs provide access to config
space through other address spaces, I/O port and MMIO BARs.  Config
space is only the architected means to manipulate the link interface
and device features, we can't determine if it's the only means.  Any
emulation or access restrictions we put in config space are meant to
make the device work better for the user (ex. re-using host kernel
device quirks) or prevent casual misconfiguration (ex. incompatible mps
settings, BAR registers), the safety of assigning a device to a user
is only as good as the isolation and error handling that the platform
allows.

 
> > MDEV core is already a well defined subsystem to connect mdev
> > bus driver (vfio-mdev) and mdev device driver (mlx5-mdev).  
> 
> mdev is two things
> 
>  - a driver core bus layer and sysfs that makes a lifetime model
>  - a vfio bus driver that doesn't do anything but forward ops to the
>    main ops
> 
> > vfio-mdev is just the channel to bring VFIO APIs through mdev core
> > to underlying vendor specific mdev device driver, which is already
> > granted flexibility to tweak whatever needs through mdev_parent_ops.  
> 
> This is the second thing, and it could just be deleted. The actual
> final mdev driver can just use vfio_device_ops directly. The
> redirection shim in vfio_mdev.c doesn't add value.
> 
> > Then what exact extension is talked here by creating another subsystem
> > module? or are we talking about some general library which can be
> > shared by underlying mdev device drivers to reduce duplicated
> > emulation code?  
> 
> IMHO it is more a design philosophy that the end driver should
> implement the vfio_device_ops directly vs having a stack of ops
> structs.

And that's where turning vfio-pci into a vfio-pci-core library comes
in, lowering the bar for a vendor driver to re-use what vfio-pci has
already done.  This doesn't change the basic vfio architecture that
already allows any vendor driver to register with its own
vfio_device_ops, it's just code-reuse, which only makes sense if we're
not just shifting the support burden from the vendor driver to the new
library-ized core.  Like Kevin though, I don't really understand the
hand-wave application to mdev.  Sure, vfio-mdev could be collapsed now
that we've rejected that there could be other drivers binding to mdev
devices, but mdev-core provides a common lifecycle management within a
class of devices, so if you subscribe to mdev, your vendor driver is
providing essentially vfio_device_ops plus the mdev extra callbacks.
If your vendor driver doesn't subscribe to that lifecycle, then just
implement a vfio bus driver with your own vfio_device_ops.  The nature
of that mdev lifecycle doesn't seem to mesh well with a library that
expects to manage a whole physical PCI device though.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 0/9] Introduce vfio-pci-core subsystem
  2021-02-10 16:37     ` Alex Williamson
@ 2021-02-10 17:08       ` Jason Gunthorpe
  0 siblings, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2021-02-10 17:08 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Max Gurtovoy, cohuck, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, gmataev, cjia, mjrosato, yishaih, aik, Zhao,
	Yan Y

On Wed, Feb 10, 2021 at 09:37:46AM -0700, Alex Williamson wrote:

> > >  register a migration region and intercept guest writes to specific
> > > registers. [PATCH 4/9] demonstrates the former but not the latter
> > > (which is allowed in v1).  
> > 
> > And this is why, the ROI to wrapper every vfio op in a PCI op just to
> > keep vfio_pci_device completely private is poor :(
> 
> Says someone who doesn't need to maintain the core, fixing bugs and
> adding features, while not breaking vendor driver touching private data
> in unexpected ways ;)

Said as someone that maintains a driver subsystem 25x larger than VFIO
that is really experienced in "crazy things drivers do". :)

Private/public is rarely at the top of my worries, and I'm confident
to say this is the general kernel philosophy. Again, look anywhere, we
rarely have private data split out of major structs like struct
device, struct netdev, struct pci_device, etc. This data has to be
public because we are using C and we expect inline functions,
container_of() and so on to work. It is rarely done with hidden
structs.

If we can get private data in some places it is a nice win, but not
worth making a mess to achieve. eg I would not give up the normal
container_of pattern just to obscure some struct, the overall ROI is
bad.

Drivers always do unexpected and crazy things, I wouldn't get fixated
on touching private data as the worst sin a driver can do :(

So, no, I don't agree that exposing a struct vfio_pci_device is the
end of the world - it is normal in the kernel to do this kind of
thing, and yes drivers can do crazy things with that if crazy slips
past the review process.

Honestly I expect people to test their drivers and fix things if a
core change broke them. It happens, QA finds it, it gets fixed, normal
stuff for Linux, IMHO.

> > > Then what exact extension is talked here by creating another subsystem
> > > module? or are we talking about some general library which can be
> > > shared by underlying mdev device drivers to reduce duplicated
> > > emulation code?  
> > 
> > IMHO it is more a design philosophy that the end driver should
> > implement the vfio_device_ops directly vs having a stack of ops
> > structs.

> Like Kevin though, I don't really understand the hand-wave
> application to mdev.  Sure, vfio-mdev could be collapsed now that
> we've rejected that there could be other drivers binding to mdev
> devices,

Again, I think the point Max was trying to make is only that vfio_mdev
can follow the same design as proposed here and replace the
mdev_parent_ops with the vfio_device_ops.

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 0/9] Introduce vfio-pci-core subsystem
  2021-02-10 13:34   ` Jason Gunthorpe
  2021-02-10 16:37     ` Alex Williamson
@ 2021-02-11  8:36     ` Christoph Hellwig
  1 sibling, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2021-02-11  8:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Max Gurtovoy, cohuck, kvm, linux-kernel,
	alex.williamson, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, gmataev, cjia, mjrosato,
	yishaih, aik, Zhao, Yan Y

On Wed, Feb 10, 2021 at 09:34:52AM -0400, Jason Gunthorpe wrote:
> > I'm a bit confused about the change from v1 to v2, especially about
> > how to inject module specific operations. From live migration p.o.v
> > it may requires two hook points at least for some devices (e.g. i40e 
> > in original Yan's example):
> 
> IMHO, it was too soon to give up on putting the vfio_device_ops in the
> final driver- we should try to define a reasonable public/private
> split of vfio_pci_device as is the norm in the kernel. No reason we
> can't achieve that.
> 
> >  register a migration region and intercept guest writes to specific
> > registers. [PATCH 4/9] demonstrates the former but not the latter
> > (which is allowed in v1).
> 
> And this is why, the ROI to wrapper every vfio op in a PCI op just to
> keep vfio_pci_device completely private is poor :(

Yes.  If Alex has a strong preference to keep some values private
a split between vfio_pci_device vfio_pci_device_priv might be doable,
but it is somewhat silly.

> > Then another question. Once we have this framework in place, do we 
> > mandate this approach for any vendor specific tweak or still allow
> > doing it as vfio_pci_core extensions (such as igd and zdev in this
> > series)?
> 
> I would say no to any further vfio_pci_core extensions that are tied
> to specific PCI devices. Things like zdev are platform features, they
> are not tied to specific PCI devices

Yes, ZDEV is just a special case of exposing extra information for any
PCI device on s390.  It does not fit any split up vfio_pci framework.
In fact I wonder why it even has its own config option.

> > vfio-mdev is just the channel to bring VFIO APIs through mdev core
> > to underlying vendor specific mdev device driver, which is already
> > granted flexibility to tweak whatever needs through mdev_parent_ops.
> 
> This is the second thing, and it could just be deleted. The actual
> final mdev driver can just use vfio_device_ops directly. The
> redirection shim in vfio_mdev.c doesn't add value.

Yes, that would simplify a lot of things.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-02 23:59                           ` Alex Williamson
  2021-02-03 13:54                             ` Jason Gunthorpe
@ 2021-02-11  8:44                             ` Christoph Hellwig
  2021-02-11 19:43                               ` Alex Williamson
  1 sibling, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2021-02-11  8:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Max Gurtovoy, Cornelia Huck, Matthew Rosato,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Tue, Feb 02, 2021 at 04:59:23PM -0700, Alex Williamson wrote:
> vfio-pci-igd support knows very little about the device, we're
> effectively just exposing a firmware table and some of the host bridge
> config space (read-only).  So the idea that the host kernel needs to
> have updated i915 support in order to expose the device to userspace
> with these extra regions is a bit silly.

On the other hand assuming the IGD scheme works for every device
with an Intel Vendor ID and a VGA classcode that hangs off an Intel
host bridge seems highly dangerous.  Is this actually going to work
for the new discreete Intel graphics?  For the old i740?  And if not
what is the failure scenario?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-03 13:54                             ` Jason Gunthorpe
@ 2021-02-11  8:47                               ` Christoph Hellwig
  2021-02-11 14:30                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2021-02-11  8:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Max Gurtovoy, Cornelia Huck, Matthew Rosato,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Wed, Feb 03, 2021 at 09:54:48AM -0400, Jason Gunthorpe wrote:
> If people are accepting that these device-specific drivers are
> required then we need to come to a community consensus to decide what
> direction to pursue:
> 
> * Do we embrace the driver core and use it to load VFIO modules like a
>   normal subsytem (this RFC)
> 
> OR 
> 
> * Do we make a driver-core like thing inside the VFIO bus drivers and
>   have them run their own special driver matching, binding, and loading
>   scheme. (May RFC)
> 
> Haven't heard a 3rd option yet..

The third option would be to use the driver core to bind the VFIO
submodules.  Define a new bus for it, which also uses the normal PCI IDs
for binding, and walk through those VFIO specific drivers when vfio_pci
is bound to a device.  That would provide a pretty clean abstraction
and could even keep the existing behavior of say bind to all VGA devices
with an Intel vendor ID (even if I think that is a bad idea).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-04  9:12               ` Max Gurtovoy
@ 2021-02-11  8:50                 ` Christoph Hellwig
  2021-02-11 14:49                   ` Jason Gunthorpe
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2021-02-11  8:50 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Alexey Kardashevskiy, Cornelia Huck, Alex Williamson,
	Matthew Rosato, jgg, kvm, linux-kernel, liranl, oren, tzahio,
	leonro, yarong, aviadye, shahafs, artemp, kwankhede, ACurrid,
	gmataev, cjia, yishaih

On Thu, Feb 04, 2021 at 11:12:49AM +0200, Max Gurtovoy wrote:
> But the PCI function (the bounded BDF) is GPU function or NVLINK function ?
> 
> If it's NVLINK function then we should fail probing in the host vfio-pci
> driver.
> 
> if its a GPU function so it shouldn't been called nvlink2 vfio-pci driver.
> Its just an extension in the GPU vfio-pci driver.

I suspect the trivial and correct answer is that we should just drop the
driver entirely.  It is for obsolete hardware that never had upstream
support for even using it in the guest.  It also is the reason for
keeping cruft in the always built-in powernv platform code alive that
is otherwise dead wood.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-11  8:47                               ` Christoph Hellwig
@ 2021-02-11 14:30                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2021-02-11 14:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alex Williamson, Max Gurtovoy, Cornelia Huck, Matthew Rosato,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Thu, Feb 11, 2021 at 08:47:03AM +0000, Christoph Hellwig wrote:
> On Wed, Feb 03, 2021 at 09:54:48AM -0400, Jason Gunthorpe wrote:
> > If people are accepting that these device-specific drivers are
> > required then we need to come to a community consensus to decide what
> > direction to pursue:
> > 
> > * Do we embrace the driver core and use it to load VFIO modules like a
> >   normal subsytem (this RFC)
> > 
> > OR 
> > 
> > * Do we make a driver-core like thing inside the VFIO bus drivers and
> >   have them run their own special driver matching, binding, and loading
> >   scheme. (May RFC)
> > 
> > Haven't heard a 3rd option yet..
> 
> The third option would be to use the driver core to bind the VFIO
> submodules.  Define a new bus for it, which also uses the normal PCI IDs
> for binding, and walk through those VFIO specific drivers when vfio_pci
> is bound to a device.  That would provide a pretty clean abstraction
> and could even keep the existing behavior of say bind to all VGA devices
> with an Intel vendor ID (even if I think that is a bad idea).

I think of this as some variant to the second option above.

Maximally using the driver core to make subdrivers still means the
VFIO side needs to reimplement all the existing matcher logic for PCI
(and it means we don't generalize, so future CXL/etc, will need more
VFIO special code)  It has to put this stuff on a new special bus and
somehow make names and match tables for it.

It also means we'd have to somehow fix vfio-pci to allow hot
plug/unplug of the subdriver. The driver core doesn't really run
synchronously for binding, so late binding would have to be
accommodated somehow. It feels like adding a lot of complexity for
very little gain to me.

Personally I dislike the subdriver direction of the May RFC quite a
bit, it has a lot of unfixable negatives for the admin side. The first
option does present some challenges for userspace but I belive we can
work through them.

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-11  8:50                 ` Christoph Hellwig
@ 2021-02-11 14:49                   ` Jason Gunthorpe
  0 siblings, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2021-02-11 14:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Max Gurtovoy, Alexey Kardashevskiy, Cornelia Huck,
	Alex Williamson, Matthew Rosato, kvm, linux-kernel, liranl, oren,
	tzahio, leonro, yarong, aviadye, shahafs, artemp, kwankhede,
	ACurrid, gmataev, cjia, yishaih

On Thu, Feb 11, 2021 at 08:50:21AM +0000, Christoph Hellwig wrote:
> On Thu, Feb 04, 2021 at 11:12:49AM +0200, Max Gurtovoy wrote:
> > But the PCI function (the bounded BDF) is GPU function or NVLINK function ?
> > 
> > If it's NVLINK function then we should fail probing in the host vfio-pci
> > driver.
> > 
> > if its a GPU function so it shouldn't been called nvlink2 vfio-pci driver.
> > Its just an extension in the GPU vfio-pci driver.
> 
> I suspect the trivial and correct answer is that we should just drop
> the driver entirely.  It is for obsolete hardware that never had
> upstream

The HW is still in active deployment and use. The big system, Summit,
only went to production sometime in 2019, so it is barely started on
its lifecycle. Something around a 5-10 year operational lifetime would
be pretty typical in this area.

> support for even using it in the guest.  It also is the reason for
> keeping cruft in the always built-in powernv platform code alive that
> is otherwise dead wood.

Or stated another way, once vfio-pci supports loadable extensions the
non-upstream hardware could provide the extension it needs out of
tree.

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-02 17:10           ` Jason Gunthorpe
@ 2021-02-11 15:47             ` Max Gurtovoy
  2021-02-11 16:29               ` Matthew Rosato
  0 siblings, 1 reply; 57+ messages in thread
From: Max Gurtovoy @ 2021-02-11 15:47 UTC (permalink / raw)
  To: Jason Gunthorpe, Cornelia Huck
  Cc: Alex Williamson, Matthew Rosato, kvm, linux-kernel, liranl, oren,
	tzahio, leonro, yarong, aviadye, shahafs, artemp, kwankhede,
	ACurrid, gmataev, cjia, yishaih, aik


On 2/2/2021 7:10 PM, Jason Gunthorpe wrote:
> On Tue, Feb 02, 2021 at 05:06:59PM +0100, Cornelia Huck wrote:
>
>> On the other side, we have the zdev support, which both requires s390
>> and applies to any pci device on s390.
> Is there a reason why CONFIG_VFIO_PCI_ZDEV exists? Why not just always
> return the s390 specific data in VFIO_DEVICE_GET_INFO if running on
> s390?
>
> It would be like returning data from ACPI on other platforms.

Agree.

all agree that I remove it ?

we already have a check in the code:

if (ret && ret != -ENODEV) {
                                 pci_warn(vdev->vpdev.pdev, "Failed to 
setup zPCI info capabilities\n");
                                 return ret;
}

so in case its not zdev we should get -ENODEV and continue in the good flow.

>
> It really seems like part of vfio-pci-core
>
> Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-11 15:47             ` Max Gurtovoy
@ 2021-02-11 16:29               ` Matthew Rosato
  2021-02-11 17:39                 ` Cornelia Huck
  0 siblings, 1 reply; 57+ messages in thread
From: Matthew Rosato @ 2021-02-11 16:29 UTC (permalink / raw)
  To: Max Gurtovoy, Jason Gunthorpe, Cornelia Huck
  Cc: Alex Williamson, kvm, linux-kernel, liranl, oren, tzahio, leonro,
	yarong, aviadye, shahafs, artemp, kwankhede, ACurrid, gmataev,
	cjia, yishaih, aik

On 2/11/21 10:47 AM, Max Gurtovoy wrote:
> 
> On 2/2/2021 7:10 PM, Jason Gunthorpe wrote:
>> On Tue, Feb 02, 2021 at 05:06:59PM +0100, Cornelia Huck wrote:
>>
>>> On the other side, we have the zdev support, which both requires s390
>>> and applies to any pci device on s390.
>> Is there a reason why CONFIG_VFIO_PCI_ZDEV exists? Why not just always
>> return the s390 specific data in VFIO_DEVICE_GET_INFO if running on
>> s390?
>>
>> It would be like returning data from ACPI on other platforms.
> 
> Agree.
> 
> all agree that I remove it ?

I did some archives digging on the discussions around 
CONFIG_VFIO_PCI_ZDEV and whether we should/should not have a Kconfig 
switch around this; it was something that was carried over various 
attempts to get the zdev support upstream, but I can't really find (or 
think of) a compelling reason that a Kconfig switch must be kept for it. 
  The bottom line is if you're on s390, you really want zdev support.

So: I don't have an objection so long as the net result is that 
vfio_pci_zdev.o is always built in to vfio-pci(-core) for s390.

> 
> we already have a check in the code:
> 
> if (ret && ret != -ENODEV) {
>                                  pci_warn(vdev->vpdev.pdev, "Failed to 
> setup zPCI info capabilities\n");
>                                  return ret;
> }
> 
> so in case its not zdev we should get -ENODEV and continue in the good 
> flow.
> 
>>
>> It really seems like part of vfio-pci-core
>>
>> Jason


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-11 16:29               ` Matthew Rosato
@ 2021-02-11 17:39                 ` Cornelia Huck
  0 siblings, 0 replies; 57+ messages in thread
From: Cornelia Huck @ 2021-02-11 17:39 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: Max Gurtovoy, Jason Gunthorpe, Alex Williamson, kvm,
	linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Thu, 11 Feb 2021 11:29:37 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> On 2/11/21 10:47 AM, Max Gurtovoy wrote:
> > 
> > On 2/2/2021 7:10 PM, Jason Gunthorpe wrote:  
> >> On Tue, Feb 02, 2021 at 05:06:59PM +0100, Cornelia Huck wrote:
> >>  
> >>> On the other side, we have the zdev support, which both requires s390
> >>> and applies to any pci device on s390.  
> >> Is there a reason why CONFIG_VFIO_PCI_ZDEV exists? Why not just always
> >> return the s390 specific data in VFIO_DEVICE_GET_INFO if running on
> >> s390?
> >>
> >> It would be like returning data from ACPI on other platforms.  
> > 
> > Agree.
> > 
> > all agree that I remove it ?  
> 
> I did some archives digging on the discussions around 
> CONFIG_VFIO_PCI_ZDEV and whether we should/should not have a Kconfig 
> switch around this; it was something that was carried over various 
> attempts to get the zdev support upstream, but I can't really find (or 
> think of) a compelling reason that a Kconfig switch must be kept for it. 
>   The bottom line is if you're on s390, you really want zdev support.
> 
> So: I don't have an objection so long as the net result is that 
> vfio_pci_zdev.o is always built in to vfio-pci(-core) for s390.

Yes, I also don't expect presence of the zdev stuff to confuse any
older userspace.

So, let's just drop CONFIG_VFIO_PCI_ZDEV and use CONFIG_S390 in lieu of
it (not changing the file name).


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
  2021-02-11  8:44                             ` Christoph Hellwig
@ 2021-02-11 19:43                               ` Alex Williamson
  0 siblings, 0 replies; 57+ messages in thread
From: Alex Williamson @ 2021-02-11 19:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Max Gurtovoy, Cornelia Huck, Matthew Rosato,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, gmataev, cjia, yishaih, aik

On Thu, 11 Feb 2021 08:44:26 +0000
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Feb 02, 2021 at 04:59:23PM -0700, Alex Williamson wrote:
> > vfio-pci-igd support knows very little about the device, we're
> > effectively just exposing a firmware table and some of the host bridge
> > config space (read-only).  So the idea that the host kernel needs to
> > have updated i915 support in order to expose the device to userspace
> > with these extra regions is a bit silly.  
> 
> On the other hand assuming the IGD scheme works for every device
> with an Intel Vendor ID and a VGA classcode that hangs off an Intel
> host bridge seems highly dangerous.  Is this actually going to work
> for the new discreete Intel graphics?  For the old i740?  And if not
> what is the failure scenario?

The failure scenario is that we expose read-only copies of the OpRegion
firmware table and host and lpc bridge config space to userspace.  Not
exactly dangerous.  For discrete graphics we'd simply fail the device
probe if the target device isn't on the root bus.  This would cover the
old i740 as well, assuming you're seriously concerned about someone
plugging in a predominantly AGP graphics card from 20+ years ago into a
modern system and trying to assign it to a guest.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2021-02-11 19:45 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-01 16:28 [PATCH v2 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
2021-02-01 16:28 ` [PATCH 1/9] vfio-pci: rename vfio_pci.c to vfio_pci_core.c Max Gurtovoy
2021-02-01 16:28 ` [PATCH 2/9] vfio-pci: introduce vfio_pci_core subsystem driver Max Gurtovoy
2021-02-01 16:28 ` [PATCH 3/9] vfio-pci-core: export vfio_pci_register_dev_region function Max Gurtovoy
2021-02-01 16:28 ` [PATCH 4/9] mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices Max Gurtovoy
2021-02-01 16:28 ` [PATCH 5/9] vfio-pci/zdev: remove unused vdev argument Max Gurtovoy
2021-02-01 17:27   ` Matthew Rosato
2021-02-02  7:57   ` Cornelia Huck
2021-02-02 17:21     ` Alex Williamson
2021-02-01 16:28 ` [PATCH 6/9] vfio-pci/zdev: fix possible segmentation fault issue Max Gurtovoy
2021-02-01 16:52   ` Cornelia Huck
2021-02-01 17:08     ` Matthew Rosato
2021-02-01 20:47       ` Alex Williamson
2021-02-02  7:58         ` Cornelia Huck
2021-02-01 16:28 ` [PATCH 7/9] vfio/pci: use s390 naming instead of zdev Max Gurtovoy
2021-02-01 16:28 ` [PATCH 8/9] vfio/pci: use x86 naming instead of igd Max Gurtovoy
2021-02-01 17:14   ` Cornelia Huck
2021-02-01 17:49     ` Matthew Rosato
2021-02-01 18:42       ` Alex Williamson
2021-02-02 16:06         ` Cornelia Huck
2021-02-02 17:10           ` Jason Gunthorpe
2021-02-11 15:47             ` Max Gurtovoy
2021-02-11 16:29               ` Matthew Rosato
2021-02-11 17:39                 ` Cornelia Huck
2021-02-02 17:41           ` Max Gurtovoy
2021-02-02 17:54             ` Alex Williamson
2021-02-02 18:50               ` Jason Gunthorpe
2021-02-02 18:55                 ` Christoph Hellwig
2021-02-02 19:05                   ` Jason Gunthorpe
2021-02-02 19:37                 ` Alex Williamson
2021-02-02 20:44                   ` Jason Gunthorpe
2021-02-02 20:59                     ` Max Gurtovoy
2021-02-02 21:30                       ` Alex Williamson
2021-02-02 23:06                         ` Jason Gunthorpe
2021-02-02 23:59                           ` Alex Williamson
2021-02-03 13:54                             ` Jason Gunthorpe
2021-02-11  8:47                               ` Christoph Hellwig
2021-02-11 14:30                                 ` Jason Gunthorpe
2021-02-11  8:44                             ` Christoph Hellwig
2021-02-11 19:43                               ` Alex Williamson
     [not found]             ` <806c138e-685c-0955-7c15-93cb1d4fe0d9@ozlabs.ru>
2021-02-03 16:07               ` Max Gurtovoy
     [not found]                 ` <83ef0164-6291-c3d1-0ce5-2c9d6c97469e@ozlabs.ru>
2021-02-04 12:51                   ` Jason Gunthorpe
2021-02-05  0:42                     ` Alexey Kardashevskiy
2021-02-08 12:44                       ` Max Gurtovoy
2021-02-09  1:55                         ` Alexey Kardashevskiy
2021-02-08 18:13                       ` Jason Gunthorpe
2021-02-09  1:51                         ` Alexey Kardashevskiy
2021-02-04  9:12               ` Max Gurtovoy
2021-02-11  8:50                 ` Christoph Hellwig
2021-02-11 14:49                   ` Jason Gunthorpe
2021-02-01 16:28 ` [PATCH 9/9] vfio/pci: use powernv naming instead of nvlink2 Max Gurtovoy
2021-02-01 18:35   ` Jason Gunthorpe
2021-02-10  7:52 ` [PATCH v2 0/9] Introduce vfio-pci-core subsystem Tian, Kevin
2021-02-10 13:34   ` Jason Gunthorpe
2021-02-10 16:37     ` Alex Williamson
2021-02-10 17:08       ` Jason Gunthorpe
2021-02-11  8:36     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).