linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/9] Introduce vfio-pci-core subsystem
@ 2021-03-09  8:33 Max Gurtovoy
  2021-03-09  8:33 ` [PATCH 1/9] vfio-pci: rename vfio_pci.c to vfio_pci_core.c Max Gurtovoy
                   ` (8 more replies)
  0 siblings, 9 replies; 53+ messages in thread
From: Max Gurtovoy @ 2021-03-09  8:33 UTC (permalink / raw)
  To: jgg, alex.williamson, cohuck, kvm, linux-kernel
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, aik, hch,
	Max Gurtovoy

Hi Alex and Cornelia,

This series split the vfio_pci driver into 2 parts: pci drivers and a
subsystem driver that will also be library of code. The main pci driver,
vfio_pci.ko will be used as before and it will bind to the subsystem
driver vfio_pci_core.ko to register to the VFIO subsystem.
New vendor vfio pci drivers were introduced in this series are:
- igd_vfio_pci.ko
- nvlink2gpu_vfio_pci.ko
- npu2_vfio_pci.ko

These drivers also will bind to the subsystem driver vfio_pci_core.ko to
register to the VFIO subsystem. These drivers will also add vendor
specific extensions that are relevant only for the devices that will be
driven by them. This is a typical Linux subsystem framework behaviour.

This series is coming to solve some of the issues that were raised in the
previous attempt for extending vfio-pci for vendor specific
functionality: https://lkml.org/lkml/2020/5/17/376 by Yan Zhao.

This solution is also deterministic in a sense that when a user will
bind to a vendor specific vfio-pci driver, it will get all the special
goodies of the HW. Non-common code will be pushed only to the specific
vfio_pci driver.
 
This subsystem framework will also ease on adding new vendor specific
functionality to VFIO devices in the future by allowing another module
to provide the pci_driver that can setup number of details before
registering to VFIO subsystem (such as inject its own operations).

Below we can see the proposed changes:

+-------------------------------------------------------------------+
|                                                                   |
|                               VFIO                                |
|                                                                   |
+-------------------------------------------------------------------+
                                                             
+-------------------------------------------------------------------+
|                                                                   |
|                           VFIO_PCI_CORE                           |
|                                                                   |
+-------------------------------------------------------------------+

+--------------+ +------------+ +-------------+ +-------------------+
|              | |            | |             | |                   |
|              | |            | |             | |                   |
|   VFIO_PCI   | |IGD_VFIO_PCI| |NPU2_VFIO_PCI| |NVLINK2GPU_VFIO_PCI|
|              | |            | |             | |                   |
|              | |            | |             | |                   |
+--------------+ +------------+ +-------------+ +-------------------+

Patches (1/9) - (4/9) introduce the above changes for vfio_pci and
vfio_pci_core.

Patches (6/9) and (7/9) are a preparation for adding nvlink2 related vfio
pci drivers.

Patch (8/9) introduce new npu2_vfio_pci and nvlink2gpu_vfio_pci drivers
that will drive NVLINK2 capable devices exist today (IBMs emulated PCI
managment device and NVIDIAs NVLINK2 capable GPUs). These drivers add
vendor specific functionality that is related to the IBM NPU2 unit which
is an NVLink2 host bus adapter and to capable NVIDIA GPUs.

Patch (9/9) introduce new igd_vfio_pci driver for adding special extensions
for INTELs Graphics card (GVT-d).

All new 3 vendor specific vfio_pci drivers can be extended easily and
new vendor specific functionality might be added to the needed vendor
drivers. In case there is a vendor specific vfio_pci driver for the
device that a user would like to use, this driver should be bounded to
that device. Otherwise, vfio_pci.ko should be bounded to it (backward
compatability is also supported).

This framework will allow adding more HW specific features such as Live
Migration in the future by extending existing vendor vfio_pci drivers or
creating new ones (e.g. mlx5_vfio_pci.ko that will drive live migration
for mlx5 PCI devices and mlx5_snap_vfio_pci.ko that will drive live
migration for mlx5_snap devices such as mlx5 NVMe and Virtio-BLK SNAP
devices).

Testing:
1. vfio_pci.ko was tested by using 2 VirtIO-BLK physical functions based
   on NVIDIAs Bluefield-2 SNAP technology. These 2 PCI functions were
   passed to a QEMU based VM after binding to vfio_pci.ko and basic FIO
   read/write was issued in the VM to each exposed block device (vda, vdb).
2. igd_vfio_pci.ko was only compiled and loaded/unloaded successfully on x86 server.
3. npu2_vfio_pci.ko and nvlink2gpu_vfio_pci.KO were successfully
   compiled and loaded/unloaded on P9 server + vfio probe/remove of devices (without
   QEMU/VM).

Note: After this series will be approved, a new discovery/matching mechanism
      should be introduced in order to help users to decide which driver should
      be bounded to which device.

changes from v2:
 - Use container_of approach between core and vfio_pci drivers.
 - Comment on private/public attributes for the core structure to ease on
   code maintanance.
 - rename core structure to vfio_pci_core_device.
 - create 3 new vendor drivers (igd_vfio_pci, npu2_vfio_pci,
   nvlink2gpu_vfio_pci) and preserve backward compatibility.
 - rebase on top of Linux 5.11 tag.
 - remove patches that were accepted as stand-alone.
 - removed mlx5_vfio_pci driver from this series (3 vendor drivers will be enough
   for emphasizing the approch).

changes from v1:
 - Create a private and public vfio-pci structures (From Alex)
 - register to vfio core directly from vfio-pci-core (for now, expose
   minimal public funcionality to vfio pci drivers. This will remove the
   implicit behaviour from v1. More power to the drivers can be added in
   the future)
 - Add patch 3/9 to emphasize the needed extension for LM feature (From
   Cornelia)
 - take/release refcount for the pci module during core open/release
 - update nvlink, igd and zdev to PowerNV, X86 and s390 extensions for
   vfio-pci core
 - fix segfault bugs in current vfio-pci zdev code

Max Gurtovoy (9):
  vfio-pci: rename vfio_pci.c to vfio_pci_core.c
  vfio-pci: rename vfio_pci_private.h to vfio_pci_core.h
  vfio-pci: rename vfio_pci_device to vfio_pci_core_device
  vfio-pci: introduce vfio_pci_core subsystem driver
  vfio/pci: introduce vfio_pci_device structure
  vfio-pci-core: export vfio_pci_register_dev_region function
  vfio/pci_core: split nvlink2 to nvlink2gpu and npu2
  vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  vfio/pci: export igd support into vendor vfio_pci driver

 drivers/vfio/pci/Kconfig                      |   53 +-
 drivers/vfio/pci/Makefile                     |   20 +-
 .../pci/{vfio_pci_igd.c => igd_vfio_pci.c}    |  159 +-
 drivers/vfio/pci/igd_vfio_pci.h               |   24 +
 drivers/vfio/pci/npu2_trace.h                 |   50 +
 drivers/vfio/pci/npu2_vfio_pci.c              |  364 +++
 drivers/vfio/pci/npu2_vfio_pci.h              |   24 +
 .../vfio/pci/{trace.h => nvlink2gpu_trace.h}  |   27 +-
 ...io_pci_nvlink2.c => nvlink2gpu_vfio_pci.c} |  296 +-
 drivers/vfio/pci/nvlink2gpu_vfio_pci.h        |   24 +
 drivers/vfio/pci/vfio_pci.c                   | 2433 ++---------------
 drivers/vfio/pci/vfio_pci_config.c            |   70 +-
 drivers/vfio/pci/vfio_pci_core.c              | 2245 +++++++++++++++
 drivers/vfio/pci/vfio_pci_core.h              |  242 ++
 drivers/vfio/pci/vfio_pci_intrs.c             |   42 +-
 drivers/vfio/pci/vfio_pci_private.h           |  228 --
 drivers/vfio/pci/vfio_pci_rdwr.c              |   18 +-
 drivers/vfio/pci/vfio_pci_zdev.c              |   12 +-
 18 files changed, 3528 insertions(+), 2803 deletions(-)
 rename drivers/vfio/pci/{vfio_pci_igd.c => igd_vfio_pci.c} (58%)
 create mode 100644 drivers/vfio/pci/igd_vfio_pci.h
 create mode 100644 drivers/vfio/pci/npu2_trace.h
 create mode 100644 drivers/vfio/pci/npu2_vfio_pci.c
 create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h
 rename drivers/vfio/pci/{trace.h => nvlink2gpu_trace.h} (72%)
 rename drivers/vfio/pci/{vfio_pci_nvlink2.c => nvlink2gpu_vfio_pci.c} (57%)
 create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h
 create mode 100644 drivers/vfio/pci/vfio_pci_core.c
 create mode 100644 drivers/vfio/pci/vfio_pci_core.h
 delete mode 100644 drivers/vfio/pci/vfio_pci_private.h

-- 
2.25.4


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 1/9] vfio-pci: rename vfio_pci.c to vfio_pci_core.c
  2021-03-09  8:33 [PATCH v3 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
@ 2021-03-09  8:33 ` Max Gurtovoy
  2021-03-09  8:33 ` [PATCH 2/9] vfio-pci: rename vfio_pci_private.h to vfio_pci_core.h Max Gurtovoy
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Max Gurtovoy @ 2021-03-09  8:33 UTC (permalink / raw)
  To: jgg, alex.williamson, cohuck, kvm, linux-kernel
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, aik, hch,
	Max Gurtovoy

This is a preparation patch for separating the vfio_pci driver to a
subsystem driver and a generic pci driver. This patch doesn't change any
logic.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/Makefile                        | 2 +-
 drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} (100%)

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index eff97a7cd9f1..bbf8d7c8fc45 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
-vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 vfio-pci-$(CONFIG_S390) += vfio_pci_zdev.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci_core.c
similarity index 100%
rename from drivers/vfio/pci/vfio_pci.c
rename to drivers/vfio/pci/vfio_pci_core.c
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 2/9] vfio-pci: rename vfio_pci_private.h to vfio_pci_core.h
  2021-03-09  8:33 [PATCH v3 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
  2021-03-09  8:33 ` [PATCH 1/9] vfio-pci: rename vfio_pci.c to vfio_pci_core.c Max Gurtovoy
@ 2021-03-09  8:33 ` Max Gurtovoy
  2021-03-09  8:33 ` [PATCH 3/9] vfio-pci: rename vfio_pci_device to vfio_pci_core_device Max Gurtovoy
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Max Gurtovoy @ 2021-03-09  8:33 UTC (permalink / raw)
  To: jgg, alex.williamson, cohuck, kvm, linux-kernel
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, aik, hch,
	Max Gurtovoy

This is a preparation patch for separating the vfio_pci driver to a
subsystem driver and a generic pci driver. This patch doesn't change
any logic.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c                       | 2 +-
 drivers/vfio/pci/vfio_pci_core.c                         | 2 +-
 drivers/vfio/pci/{vfio_pci_private.h => vfio_pci_core.h} | 6 +++---
 drivers/vfio/pci/vfio_pci_igd.c                          | 2 +-
 drivers/vfio/pci/vfio_pci_intrs.c                        | 2 +-
 drivers/vfio/pci/vfio_pci_nvlink2.c                      | 2 +-
 drivers/vfio/pci/vfio_pci_rdwr.c                         | 2 +-
 drivers/vfio/pci/vfio_pci_zdev.c                         | 2 +-
 8 files changed, 10 insertions(+), 10 deletions(-)
 rename drivers/vfio/pci/{vfio_pci_private.h => vfio_pci_core.h} (98%)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index a402adee8a21..5e9d24992207 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -26,7 +26,7 @@
 #include <linux/vfio.h>
 #include <linux/slab.h>
 
-#include "vfio_pci_private.h"
+#include "vfio_pci_core.h"
 
 /* Fake capability ID for standard config space */
 #define PCI_CAP_ID_BASIC	0
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 65e7e6b44578..bd587db04625 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -28,7 +28,7 @@
 #include <linux/nospec.h>
 #include <linux/sched/mm.h>
 
-#include "vfio_pci_private.h"
+#include "vfio_pci_core.h"
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_core.h
similarity index 98%
rename from drivers/vfio/pci/vfio_pci_private.h
rename to drivers/vfio/pci/vfio_pci_core.h
index 9cd1882a05af..b9ac7132b84a 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_core.h
@@ -15,8 +15,8 @@
 #include <linux/uuid.h>
 #include <linux/notifier.h>
 
-#ifndef VFIO_PCI_PRIVATE_H
-#define VFIO_PCI_PRIVATE_H
+#ifndef VFIO_PCI_CORE_H
+#define VFIO_PCI_CORE_H
 
 #define VFIO_PCI_OFFSET_SHIFT   40
 
@@ -225,4 +225,4 @@ static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
 }
 #endif
 
-#endif /* VFIO_PCI_PRIVATE_H */
+#endif /* VFIO_PCI_CORE_H */
diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
index 53d97f459252..0c599cd33d01 100644
--- a/drivers/vfio/pci/vfio_pci_igd.c
+++ b/drivers/vfio/pci/vfio_pci_igd.c
@@ -15,7 +15,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 
-#include "vfio_pci_private.h"
+#include "vfio_pci_core.h"
 
 #define OPREGION_SIGNATURE	"IntelGraphicsMem"
 #define OPREGION_SIZE		(8 * 1024)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 869dce5f134d..df1e8c8c274c 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -20,7 +20,7 @@
 #include <linux/wait.h>
 #include <linux/slab.h>
 
-#include "vfio_pci_private.h"
+#include "vfio_pci_core.h"
 
 /*
  * INTx
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
index 9adcf6a8f888..326a704c4527 100644
--- a/drivers/vfio/pci/vfio_pci_nvlink2.c
+++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
@@ -19,7 +19,7 @@
 #include <linux/sched/mm.h>
 #include <linux/mmu_context.h>
 #include <asm/kvm_ppc.h>
-#include "vfio_pci_private.h"
+#include "vfio_pci_core.h"
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index a0b5fc8e46f4..667e82726e75 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -17,7 +17,7 @@
 #include <linux/vfio.h>
 #include <linux/vgaarb.h>
 
-#include "vfio_pci_private.h"
+#include "vfio_pci_core.h"
 
 #ifdef __LITTLE_ENDIAN
 #define vfio_ioread64	ioread64
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 229685634031..3e91d49fa3f0 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -19,7 +19,7 @@
 #include <asm/pci_clp.h>
 #include <asm/pci_io.h>
 
-#include "vfio_pci_private.h"
+#include "vfio_pci_core.h"
 
 /*
  * Add the Base PCI Function information to the device info region.
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 3/9] vfio-pci: rename vfio_pci_device to vfio_pci_core_device
  2021-03-09  8:33 [PATCH v3 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
  2021-03-09  8:33 ` [PATCH 1/9] vfio-pci: rename vfio_pci.c to vfio_pci_core.c Max Gurtovoy
  2021-03-09  8:33 ` [PATCH 2/9] vfio-pci: rename vfio_pci_private.h to vfio_pci_core.h Max Gurtovoy
@ 2021-03-09  8:33 ` Max Gurtovoy
  2021-03-09  8:33 ` [PATCH 4/9] vfio-pci: introduce vfio_pci_core subsystem driver Max Gurtovoy
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Max Gurtovoy @ 2021-03-09  8:33 UTC (permalink / raw)
  To: jgg, alex.williamson, cohuck, kvm, linux-kernel
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, aik, hch,
	Max Gurtovoy

This is a preparation patch for separating the vfio_pci driver to a
subsystem driver and a generic pci driver. This patch doesn't change
any logic. The new vfio_pci_core_device structure will be the main
structure of the core driver and later on vfio_pci_device structure will
be the main structure of the generic vfio_pci driver.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c  | 68 +++++++++++-----------
 drivers/vfio/pci/vfio_pci_core.c    | 90 ++++++++++++++---------------
 drivers/vfio/pci/vfio_pci_core.h    | 76 ++++++++++++------------
 drivers/vfio/pci/vfio_pci_igd.c     | 14 ++---
 drivers/vfio/pci/vfio_pci_intrs.c   | 40 ++++++-------
 drivers/vfio/pci/vfio_pci_nvlink2.c | 20 +++----
 drivers/vfio/pci/vfio_pci_rdwr.c    | 16 ++---
 drivers/vfio/pci/vfio_pci_zdev.c    | 10 ++--
 8 files changed, 167 insertions(+), 167 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 5e9d24992207..122918d0f733 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -108,9 +108,9 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] = {
 struct perm_bits {
 	u8	*virt;		/* read/write virtual data, not hw */
 	u8	*write;		/* writeable bits */
-	int	(*readfn)(struct vfio_pci_device *vdev, int pos, int count,
+	int	(*readfn)(struct vfio_pci_core_device *vdev, int pos, int count,
 			  struct perm_bits *perm, int offset, __le32 *val);
-	int	(*writefn)(struct vfio_pci_device *vdev, int pos, int count,
+	int	(*writefn)(struct vfio_pci_core_device *vdev, int pos, int count,
 			   struct perm_bits *perm, int offset, __le32 val);
 };
 
@@ -171,7 +171,7 @@ static int vfio_user_config_write(struct pci_dev *pdev, int offset,
 	return ret;
 }
 
-static int vfio_default_config_read(struct vfio_pci_device *vdev, int pos,
+static int vfio_default_config_read(struct vfio_pci_core_device *vdev, int pos,
 				    int count, struct perm_bits *perm,
 				    int offset, __le32 *val)
 {
@@ -197,7 +197,7 @@ static int vfio_default_config_read(struct vfio_pci_device *vdev, int pos,
 	return count;
 }
 
-static int vfio_default_config_write(struct vfio_pci_device *vdev, int pos,
+static int vfio_default_config_write(struct vfio_pci_core_device *vdev, int pos,
 				     int count, struct perm_bits *perm,
 				     int offset, __le32 val)
 {
@@ -244,7 +244,7 @@ static int vfio_default_config_write(struct vfio_pci_device *vdev, int pos,
 }
 
 /* Allow direct read from hardware, except for capability next pointer */
-static int vfio_direct_config_read(struct vfio_pci_device *vdev, int pos,
+static int vfio_direct_config_read(struct vfio_pci_core_device *vdev, int pos,
 				   int count, struct perm_bits *perm,
 				   int offset, __le32 *val)
 {
@@ -269,7 +269,7 @@ static int vfio_direct_config_read(struct vfio_pci_device *vdev, int pos,
 }
 
 /* Raw access skips any kind of virtualization */
-static int vfio_raw_config_write(struct vfio_pci_device *vdev, int pos,
+static int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos,
 				 int count, struct perm_bits *perm,
 				 int offset, __le32 val)
 {
@@ -282,7 +282,7 @@ static int vfio_raw_config_write(struct vfio_pci_device *vdev, int pos,
 	return count;
 }
 
-static int vfio_raw_config_read(struct vfio_pci_device *vdev, int pos,
+static int vfio_raw_config_read(struct vfio_pci_core_device *vdev, int pos,
 				int count, struct perm_bits *perm,
 				int offset, __le32 *val)
 {
@@ -296,7 +296,7 @@ static int vfio_raw_config_read(struct vfio_pci_device *vdev, int pos,
 }
 
 /* Virt access uses only virtualization */
-static int vfio_virt_config_write(struct vfio_pci_device *vdev, int pos,
+static int vfio_virt_config_write(struct vfio_pci_core_device *vdev, int pos,
 				  int count, struct perm_bits *perm,
 				  int offset, __le32 val)
 {
@@ -304,7 +304,7 @@ static int vfio_virt_config_write(struct vfio_pci_device *vdev, int pos,
 	return count;
 }
 
-static int vfio_virt_config_read(struct vfio_pci_device *vdev, int pos,
+static int vfio_virt_config_read(struct vfio_pci_core_device *vdev, int pos,
 				 int count, struct perm_bits *perm,
 				 int offset, __le32 *val)
 {
@@ -396,7 +396,7 @@ static inline void p_setd(struct perm_bits *p, int off, u32 virt, u32 write)
 }
 
 /* Caller should hold memory_lock semaphore */
-bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev)
+bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	u16 cmd = le16_to_cpu(*(__le16 *)&vdev->vconfig[PCI_COMMAND]);
@@ -413,7 +413,7 @@ bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev)
  * Restore the *real* BARs after we detect a FLR or backdoor reset.
  * (backdoor = some device specific technique that we didn't catch)
  */
-static void vfio_bar_restore(struct vfio_pci_device *vdev)
+static void vfio_bar_restore(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	u32 *rbar = vdev->rbar;
@@ -460,7 +460,7 @@ static __le32 vfio_generate_bar_flags(struct pci_dev *pdev, int bar)
  * Pretend we're hardware and tweak the values of the *virtual* PCI BARs
  * to reflect the hardware capabilities.  This implements BAR sizing.
  */
-static void vfio_bar_fixup(struct vfio_pci_device *vdev)
+static void vfio_bar_fixup(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	int i;
@@ -514,7 +514,7 @@ static void vfio_bar_fixup(struct vfio_pci_device *vdev)
 	vdev->bardirty = false;
 }
 
-static int vfio_basic_config_read(struct vfio_pci_device *vdev, int pos,
+static int vfio_basic_config_read(struct vfio_pci_core_device *vdev, int pos,
 				  int count, struct perm_bits *perm,
 				  int offset, __le32 *val)
 {
@@ -536,7 +536,7 @@ static int vfio_basic_config_read(struct vfio_pci_device *vdev, int pos,
 }
 
 /* Test whether BARs match the value we think they should contain */
-static bool vfio_need_bar_restore(struct vfio_pci_device *vdev)
+static bool vfio_need_bar_restore(struct vfio_pci_core_device *vdev)
 {
 	int i = 0, pos = PCI_BASE_ADDRESS_0, ret;
 	u32 bar;
@@ -552,7 +552,7 @@ static bool vfio_need_bar_restore(struct vfio_pci_device *vdev)
 	return false;
 }
 
-static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos,
+static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
 				   int count, struct perm_bits *perm,
 				   int offset, __le32 val)
 {
@@ -692,7 +692,7 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
 	return 0;
 }
 
-static int vfio_pm_config_write(struct vfio_pci_device *vdev, int pos,
+static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
 				int count, struct perm_bits *perm,
 				int offset, __le32 val)
 {
@@ -747,7 +747,7 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
 	return 0;
 }
 
-static int vfio_vpd_config_write(struct vfio_pci_device *vdev, int pos,
+static int vfio_vpd_config_write(struct vfio_pci_core_device *vdev, int pos,
 				 int count, struct perm_bits *perm,
 				 int offset, __le32 val)
 {
@@ -829,7 +829,7 @@ static int __init init_pci_cap_pcix_perm(struct perm_bits *perm)
 	return 0;
 }
 
-static int vfio_exp_config_write(struct vfio_pci_device *vdev, int pos,
+static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
 				 int count, struct perm_bits *perm,
 				 int offset, __le32 val)
 {
@@ -913,7 +913,7 @@ static int __init init_pci_cap_exp_perm(struct perm_bits *perm)
 	return 0;
 }
 
-static int vfio_af_config_write(struct vfio_pci_device *vdev, int pos,
+static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
 				int count, struct perm_bits *perm,
 				int offset, __le32 val)
 {
@@ -1072,7 +1072,7 @@ int __init vfio_pci_init_perm_bits(void)
 	return ret;
 }
 
-static int vfio_find_cap_start(struct vfio_pci_device *vdev, int pos)
+static int vfio_find_cap_start(struct vfio_pci_core_device *vdev, int pos)
 {
 	u8 cap;
 	int base = (pos >= PCI_CFG_SPACE_SIZE) ? PCI_CFG_SPACE_SIZE :
@@ -1089,7 +1089,7 @@ static int vfio_find_cap_start(struct vfio_pci_device *vdev, int pos)
 	return pos;
 }
 
-static int vfio_msi_config_read(struct vfio_pci_device *vdev, int pos,
+static int vfio_msi_config_read(struct vfio_pci_core_device *vdev, int pos,
 				int count, struct perm_bits *perm,
 				int offset, __le32 *val)
 {
@@ -1109,7 +1109,7 @@ static int vfio_msi_config_read(struct vfio_pci_device *vdev, int pos,
 	return vfio_default_config_read(vdev, pos, count, perm, offset, val);
 }
 
-static int vfio_msi_config_write(struct vfio_pci_device *vdev, int pos,
+static int vfio_msi_config_write(struct vfio_pci_core_device *vdev, int pos,
 				 int count, struct perm_bits *perm,
 				 int offset, __le32 val)
 {
@@ -1189,7 +1189,7 @@ static int init_pci_cap_msi_perm(struct perm_bits *perm, int len, u16 flags)
 }
 
 /* Determine MSI CAP field length; initialize msi_perms on 1st call per vdev */
-static int vfio_msi_cap_len(struct vfio_pci_device *vdev, u8 pos)
+static int vfio_msi_cap_len(struct vfio_pci_core_device *vdev, u8 pos)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	int len, ret;
@@ -1222,7 +1222,7 @@ static int vfio_msi_cap_len(struct vfio_pci_device *vdev, u8 pos)
 }
 
 /* Determine extended capability length for VC (2 & 9) and MFVC */
-static int vfio_vc_cap_len(struct vfio_pci_device *vdev, u16 pos)
+static int vfio_vc_cap_len(struct vfio_pci_core_device *vdev, u16 pos)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	u32 tmp;
@@ -1263,7 +1263,7 @@ static int vfio_vc_cap_len(struct vfio_pci_device *vdev, u16 pos)
 	return len;
 }
 
-static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos)
+static int vfio_cap_len(struct vfio_pci_core_device *vdev, u8 cap, u8 pos)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	u32 dword;
@@ -1338,7 +1338,7 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos)
 	return 0;
 }
 
-static int vfio_ext_cap_len(struct vfio_pci_device *vdev, u16 ecap, u16 epos)
+static int vfio_ext_cap_len(struct vfio_pci_core_device *vdev, u16 ecap, u16 epos)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	u8 byte;
@@ -1412,7 +1412,7 @@ static int vfio_ext_cap_len(struct vfio_pci_device *vdev, u16 ecap, u16 epos)
 	return 0;
 }
 
-static int vfio_fill_vconfig_bytes(struct vfio_pci_device *vdev,
+static int vfio_fill_vconfig_bytes(struct vfio_pci_core_device *vdev,
 				   int offset, int size)
 {
 	struct pci_dev *pdev = vdev->pdev;
@@ -1459,7 +1459,7 @@ static int vfio_fill_vconfig_bytes(struct vfio_pci_device *vdev,
 	return ret;
 }
 
-static int vfio_cap_init(struct vfio_pci_device *vdev)
+static int vfio_cap_init(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	u8 *map = vdev->pci_config_map;
@@ -1549,7 +1549,7 @@ static int vfio_cap_init(struct vfio_pci_device *vdev)
 	return 0;
 }
 
-static int vfio_ecap_init(struct vfio_pci_device *vdev)
+static int vfio_ecap_init(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	u8 *map = vdev->pci_config_map;
@@ -1669,7 +1669,7 @@ static const struct pci_device_id known_bogus_vf_intx_pin[] = {
  * for each area requiring emulated bits, but the array of pointers
  * would be comparable in size (at least for standard config space).
  */
-int vfio_config_init(struct vfio_pci_device *vdev)
+int vfio_config_init(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	u8 *map, *vconfig;
@@ -1773,7 +1773,7 @@ int vfio_config_init(struct vfio_pci_device *vdev)
 	return pcibios_err_to_errno(ret);
 }
 
-void vfio_config_free(struct vfio_pci_device *vdev)
+void vfio_config_free(struct vfio_pci_core_device *vdev)
 {
 	kfree(vdev->vconfig);
 	vdev->vconfig = NULL;
@@ -1790,7 +1790,7 @@ void vfio_config_free(struct vfio_pci_device *vdev)
  * Find the remaining number of bytes in a dword that match the given
  * position.  Stop at either the end of the capability or the dword boundary.
  */
-static size_t vfio_pci_cap_remaining_dword(struct vfio_pci_device *vdev,
+static size_t vfio_pci_cap_remaining_dword(struct vfio_pci_core_device *vdev,
 					   loff_t pos)
 {
 	u8 cap = vdev->pci_config_map[pos];
@@ -1802,7 +1802,7 @@ static size_t vfio_pci_cap_remaining_dword(struct vfio_pci_device *vdev,
 	return i;
 }
 
-static ssize_t vfio_config_do_rw(struct vfio_pci_device *vdev, char __user *buf,
+static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 				 size_t count, loff_t *ppos, bool iswrite)
 {
 	struct pci_dev *pdev = vdev->pdev;
@@ -1885,7 +1885,7 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_device *vdev, char __user *buf,
 	return ret;
 }
 
-ssize_t vfio_pci_config_rw(struct vfio_pci_device *vdev, char __user *buf,
+ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 			   size_t count, loff_t *ppos, bool iswrite)
 {
 	size_t done = 0;
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index bd587db04625..557a03528dcd 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -121,7 +121,7 @@ static bool vfio_pci_is_denylisted(struct pci_dev *pdev)
  */
 static unsigned int vfio_pci_set_vga_decode(void *opaque, bool single_vga)
 {
-	struct vfio_pci_device *vdev = opaque;
+	struct vfio_pci_core_device *vdev = opaque;
 	struct pci_dev *tmp = NULL, *pdev = vdev->pdev;
 	unsigned char max_busnr;
 	unsigned int decodes;
@@ -155,7 +155,7 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
-static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
+static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
 {
 	struct resource *res;
 	int i;
@@ -223,8 +223,8 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 	}
 }
 
-static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev);
-static void vfio_pci_disable(struct vfio_pci_device *vdev);
+static void vfio_pci_try_bus_reset(struct vfio_pci_core_device *vdev);
+static void vfio_pci_disable(struct vfio_pci_core_device *vdev);
 static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data);
 
 /*
@@ -258,7 +258,7 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
 	return false;
 }
 
-static void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
+static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	u16 pmcsr;
@@ -278,7 +278,7 @@ static void vfio_pci_probe_power_state(struct vfio_pci_device *vdev)
  * by PM capability emulation and separately from pci_dev internal saved state
  * to avoid it being overwritten and consumed around other resets.
  */
-int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
+int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	bool needs_restore = false, needs_save = false;
@@ -309,7 +309,7 @@ int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state)
 	return ret;
 }
 
-static int vfio_pci_enable(struct vfio_pci_device *vdev)
+static int vfio_pci_enable(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	int ret;
@@ -416,7 +416,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 	return ret;
 }
 
-static void vfio_pci_disable(struct vfio_pci_device *vdev)
+static void vfio_pci_disable(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	struct vfio_pci_dummy_resource *dummy_res, *tmp;
@@ -517,7 +517,7 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
 
 static struct pci_driver vfio_pci_driver;
 
-static struct vfio_pci_device *get_pf_vdev(struct vfio_pci_device *vdev,
+static struct vfio_pci_core_device *get_pf_vdev(struct vfio_pci_core_device *vdev,
 					   struct vfio_device **pf_dev)
 {
 	struct pci_dev *physfn = pci_physfn(vdev->pdev);
@@ -537,10 +537,10 @@ static struct vfio_pci_device *get_pf_vdev(struct vfio_pci_device *vdev,
 	return vfio_device_data(*pf_dev);
 }
 
-static void vfio_pci_vf_token_user_add(struct vfio_pci_device *vdev, int val)
+static void vfio_pci_vf_token_user_add(struct vfio_pci_core_device *vdev, int val)
 {
 	struct vfio_device *pf_dev;
-	struct vfio_pci_device *pf_vdev = get_pf_vdev(vdev, &pf_dev);
+	struct vfio_pci_core_device *pf_vdev = get_pf_vdev(vdev, &pf_dev);
 
 	if (!pf_vdev)
 		return;
@@ -555,7 +555,7 @@ static void vfio_pci_vf_token_user_add(struct vfio_pci_device *vdev, int val)
 
 static void vfio_pci_release(void *device_data)
 {
-	struct vfio_pci_device *vdev = device_data;
+	struct vfio_pci_core_device *vdev = device_data;
 
 	mutex_lock(&vdev->reflck->lock);
 
@@ -583,7 +583,7 @@ static void vfio_pci_release(void *device_data)
 
 static int vfio_pci_open(void *device_data)
 {
-	struct vfio_pci_device *vdev = device_data;
+	struct vfio_pci_core_device *vdev = device_data;
 	int ret = 0;
 
 	if (!try_module_get(THIS_MODULE))
@@ -607,7 +607,7 @@ static int vfio_pci_open(void *device_data)
 	return ret;
 }
 
-static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
+static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_type)
 {
 	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
 		u8 pin;
@@ -754,7 +754,7 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
 	return walk.ret;
 }
 
-static int msix_mmappable_cap(struct vfio_pci_device *vdev,
+static int msix_mmappable_cap(struct vfio_pci_core_device *vdev,
 			      struct vfio_info_cap *caps)
 {
 	struct vfio_info_cap_header header = {
@@ -765,7 +765,7 @@ static int msix_mmappable_cap(struct vfio_pci_device *vdev,
 	return vfio_info_add_capability(caps, &header, sizeof(header));
 }
 
-int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
+int vfio_pci_register_dev_region(struct vfio_pci_core_device *vdev,
 				 unsigned int type, unsigned int subtype,
 				 const struct vfio_pci_regops *ops,
 				 size_t size, u32 flags, void *data)
@@ -800,7 +800,7 @@ struct vfio_devices {
 static long vfio_pci_ioctl(void *device_data,
 			   unsigned int cmd, unsigned long arg)
 {
-	struct vfio_pci_device *vdev = device_data;
+	struct vfio_pci_core_device *vdev = device_data;
 	unsigned long minsz;
 
 	if (cmd == VFIO_DEVICE_GET_INFO) {
@@ -1280,7 +1280,7 @@ static long vfio_pci_ioctl(void *device_data,
 			goto hot_reset_release;
 
 		for (; mem_idx < devs.cur_index; mem_idx++) {
-			struct vfio_pci_device *tmp;
+			struct vfio_pci_core_device *tmp;
 
 			tmp = vfio_device_data(devs.devices[mem_idx]);
 
@@ -1298,7 +1298,7 @@ static long vfio_pci_ioctl(void *device_data,
 hot_reset_release:
 		for (i = 0; i < devs.cur_index; i++) {
 			struct vfio_device *device;
-			struct vfio_pci_device *tmp;
+			struct vfio_pci_core_device *tmp;
 
 			device = devs.devices[i];
 			tmp = vfio_device_data(device);
@@ -1406,7 +1406,7 @@ static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
 			   size_t count, loff_t *ppos, bool iswrite)
 {
 	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
-	struct vfio_pci_device *vdev = device_data;
+	struct vfio_pci_core_device *vdev = device_data;
 
 	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
 		return -EINVAL;
@@ -1453,7 +1453,7 @@ static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 }
 
 /* Return 1 on zap and vma_lock acquired, 0 on contention (only with @try) */
-static int vfio_pci_zap_and_vma_lock(struct vfio_pci_device *vdev, bool try)
+static int vfio_pci_zap_and_vma_lock(struct vfio_pci_core_device *vdev, bool try)
 {
 	struct vfio_pci_mmap_vma *mmap_vma, *tmp;
 
@@ -1541,14 +1541,14 @@ static int vfio_pci_zap_and_vma_lock(struct vfio_pci_device *vdev, bool try)
 	}
 }
 
-void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_device *vdev)
+void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *vdev)
 {
 	vfio_pci_zap_and_vma_lock(vdev, false);
 	down_write(&vdev->memory_lock);
 	mutex_unlock(&vdev->vma_lock);
 }
 
-u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_device *vdev)
+u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev)
 {
 	u16 cmd;
 
@@ -1561,14 +1561,14 @@ u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_device *vdev)
 	return cmd;
 }
 
-void vfio_pci_memory_unlock_and_restore(struct vfio_pci_device *vdev, u16 cmd)
+void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev, u16 cmd)
 {
 	pci_write_config_word(vdev->pdev, PCI_COMMAND, cmd);
 	up_write(&vdev->memory_lock);
 }
 
 /* Caller holds vma_lock */
-static int __vfio_pci_add_vma(struct vfio_pci_device *vdev,
+static int __vfio_pci_add_vma(struct vfio_pci_core_device *vdev,
 			      struct vm_area_struct *vma)
 {
 	struct vfio_pci_mmap_vma *mmap_vma;
@@ -1594,7 +1594,7 @@ static void vfio_pci_mmap_open(struct vm_area_struct *vma)
 
 static void vfio_pci_mmap_close(struct vm_area_struct *vma)
 {
-	struct vfio_pci_device *vdev = vma->vm_private_data;
+	struct vfio_pci_core_device *vdev = vma->vm_private_data;
 	struct vfio_pci_mmap_vma *mmap_vma;
 
 	mutex_lock(&vdev->vma_lock);
@@ -1611,7 +1611,7 @@ static void vfio_pci_mmap_close(struct vm_area_struct *vma)
 static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
-	struct vfio_pci_device *vdev = vma->vm_private_data;
+	struct vfio_pci_core_device *vdev = vma->vm_private_data;
 	vm_fault_t ret = VM_FAULT_NOPAGE;
 
 	mutex_lock(&vdev->vma_lock);
@@ -1648,7 +1648,7 @@ static const struct vm_operations_struct vfio_pci_mmap_ops = {
 
 static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 {
-	struct vfio_pci_device *vdev = device_data;
+	struct vfio_pci_core_device *vdev = device_data;
 	struct pci_dev *pdev = vdev->pdev;
 	unsigned int index;
 	u64 phys_len, req_len, pgoff, req_start;
@@ -1716,7 +1716,7 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 
 static void vfio_pci_request(void *device_data, unsigned int count)
 {
-	struct vfio_pci_device *vdev = device_data;
+	struct vfio_pci_core_device *vdev = device_data;
 	struct pci_dev *pdev = vdev->pdev;
 
 	mutex_lock(&vdev->igate);
@@ -1735,7 +1735,7 @@ static void vfio_pci_request(void *device_data, unsigned int count)
 	mutex_unlock(&vdev->igate);
 }
 
-static int vfio_pci_validate_vf_token(struct vfio_pci_device *vdev,
+static int vfio_pci_validate_vf_token(struct vfio_pci_core_device *vdev,
 				      bool vf_token, uuid_t *uuid)
 {
 	/*
@@ -1768,7 +1768,7 @@ static int vfio_pci_validate_vf_token(struct vfio_pci_device *vdev,
 
 	if (vdev->pdev->is_virtfn) {
 		struct vfio_device *pf_dev;
-		struct vfio_pci_device *pf_vdev = get_pf_vdev(vdev, &pf_dev);
+		struct vfio_pci_core_device *pf_vdev = get_pf_vdev(vdev, &pf_dev);
 		bool match;
 
 		if (!pf_vdev) {
@@ -1832,7 +1832,7 @@ static int vfio_pci_validate_vf_token(struct vfio_pci_device *vdev,
 
 static int vfio_pci_match(void *device_data, char *buf)
 {
-	struct vfio_pci_device *vdev = device_data;
+	struct vfio_pci_core_device *vdev = device_data;
 	bool vf_token = false;
 	uuid_t uuid;
 	int ret;
@@ -1891,14 +1891,14 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.match		= vfio_pci_match,
 };
 
-static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
+static int vfio_pci_reflck_attach(struct vfio_pci_core_device *vdev);
 static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
 
 static int vfio_pci_bus_notifier(struct notifier_block *nb,
 				 unsigned long action, void *data)
 {
-	struct vfio_pci_device *vdev = container_of(nb,
-						    struct vfio_pci_device, nb);
+	struct vfio_pci_core_device *vdev = container_of(nb,
+						    struct vfio_pci_core_device, nb);
 	struct device *dev = data;
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct pci_dev *physfn = pci_physfn(pdev);
@@ -1924,7 +1924,7 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
 
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
-	struct vfio_pci_device *vdev;
+	struct vfio_pci_core_device *vdev;
 	struct iommu_group *group;
 	int ret;
 
@@ -2031,7 +2031,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 static void vfio_pci_remove(struct pci_dev *pdev)
 {
-	struct vfio_pci_device *vdev;
+	struct vfio_pci_core_device *vdev;
 
 	pci_disable_sriov(pdev);
 
@@ -2071,7 +2071,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
 static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 						  pci_channel_state_t state)
 {
-	struct vfio_pci_device *vdev;
+	struct vfio_pci_core_device *vdev;
 	struct vfio_device *device;
 
 	device = vfio_device_get_from_dev(&pdev->dev);
@@ -2098,7 +2098,7 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 
 static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
 {
-	struct vfio_pci_device *vdev;
+	struct vfio_pci_core_device *vdev;
 	struct vfio_device *device;
 	int ret = 0;
 
@@ -2165,7 +2165,7 @@ static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
 {
 	struct vfio_pci_reflck **preflck = data;
 	struct vfio_device *device;
-	struct vfio_pci_device *vdev;
+	struct vfio_pci_core_device *vdev;
 
 	device = vfio_device_get_from_dev(&pdev->dev);
 	if (!device)
@@ -2189,7 +2189,7 @@ static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
 	return 0;
 }
 
-static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
+static int vfio_pci_reflck_attach(struct vfio_pci_core_device *vdev)
 {
 	bool slot = !pci_probe_reset_slot(vdev->pdev->slot);
 
@@ -2224,7 +2224,7 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
 {
 	struct vfio_devices *devs = data;
 	struct vfio_device *device;
-	struct vfio_pci_device *vdev;
+	struct vfio_pci_core_device *vdev;
 
 	if (devs->cur_index == devs->max_index)
 		return -ENOSPC;
@@ -2254,7 +2254,7 @@ static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data)
 {
 	struct vfio_devices *devs = data;
 	struct vfio_device *device;
-	struct vfio_pci_device *vdev;
+	struct vfio_pci_core_device *vdev;
 
 	if (devs->cur_index == devs->max_index)
 		return -ENOSPC;
@@ -2299,12 +2299,12 @@ static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data)
  * to be bound to vfio_pci since that's the only way we can be sure they
  * stay put.
  */
-static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
+static void vfio_pci_try_bus_reset(struct vfio_pci_core_device *vdev)
 {
 	struct vfio_devices devs = { .cur_index = 0 };
 	int i = 0, ret = -EINVAL;
 	bool slot = false;
-	struct vfio_pci_device *tmp;
+	struct vfio_pci_core_device *tmp;
 
 	if (!pci_probe_reset_slot(vdev->pdev->slot))
 		slot = true;
diff --git a/drivers/vfio/pci/vfio_pci_core.h b/drivers/vfio/pci/vfio_pci_core.h
index b9ac7132b84a..3964ca898984 100644
--- a/drivers/vfio/pci/vfio_pci_core.h
+++ b/drivers/vfio/pci/vfio_pci_core.h
@@ -32,15 +32,15 @@
 #define VFIO_PCI_IOEVENTFD_MAX		1000
 
 struct vfio_pci_ioeventfd {
-	struct list_head	next;
-	struct vfio_pci_device	*vdev;
-	struct virqfd		*virqfd;
-	void __iomem		*addr;
-	uint64_t		data;
-	loff_t			pos;
-	int			bar;
-	int			count;
-	bool			test_mem;
+	struct list_head		next;
+	struct vfio_pci_core_device	*vdev;
+	struct virqfd			*virqfd;
+	void __iomem			*addr;
+	uint64_t			data;
+	loff_t				pos;
+	int				bar;
+	int				count;
+	bool				test_mem;
 };
 
 struct vfio_pci_irq_ctx {
@@ -52,18 +52,18 @@ struct vfio_pci_irq_ctx {
 	struct irq_bypass_producer	producer;
 };
 
-struct vfio_pci_device;
+struct vfio_pci_core_device;
 struct vfio_pci_region;
 
 struct vfio_pci_regops {
-	size_t	(*rw)(struct vfio_pci_device *vdev, char __user *buf,
+	size_t	(*rw)(struct vfio_pci_core_device *vdev, char __user *buf,
 		      size_t count, loff_t *ppos, bool iswrite);
-	void	(*release)(struct vfio_pci_device *vdev,
+	void	(*release)(struct vfio_pci_core_device *vdev,
 			   struct vfio_pci_region *region);
-	int	(*mmap)(struct vfio_pci_device *vdev,
+	int	(*mmap)(struct vfio_pci_core_device *vdev,
 			struct vfio_pci_region *region,
 			struct vm_area_struct *vma);
-	int	(*add_capability)(struct vfio_pci_device *vdev,
+	int	(*add_capability)(struct vfio_pci_core_device *vdev,
 				  struct vfio_pci_region *region,
 				  struct vfio_info_cap *caps);
 };
@@ -99,7 +99,7 @@ struct vfio_pci_mmap_vma {
 	struct list_head	vma_next;
 };
 
-struct vfio_pci_device {
+struct vfio_pci_core_device {
 	struct pci_dev		*pdev;
 	void __iomem		*barmap[PCI_STD_NUM_BARS];
 	bool			bar_mmap_supported[PCI_STD_NUM_BARS];
@@ -150,75 +150,75 @@ struct vfio_pci_device {
 #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
 #define irq_is(vdev, type) (vdev->irq_type == type)
 
-extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
-extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
+extern void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
+extern void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev);
 
-extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev,
+extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev,
 				   uint32_t flags, unsigned index,
 				   unsigned start, unsigned count, void *data);
 
-extern ssize_t vfio_pci_config_rw(struct vfio_pci_device *vdev,
+extern ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev,
 				  char __user *buf, size_t count,
 				  loff_t *ppos, bool iswrite);
 
-extern ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
+extern ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 			       size_t count, loff_t *ppos, bool iswrite);
 
-extern ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf,
+extern ssize_t vfio_pci_vga_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 			       size_t count, loff_t *ppos, bool iswrite);
 
-extern long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset,
+extern long vfio_pci_ioeventfd(struct vfio_pci_core_device *vdev, loff_t offset,
 			       uint64_t data, int count, int fd);
 
 extern int vfio_pci_init_perm_bits(void);
 extern void vfio_pci_uninit_perm_bits(void);
 
-extern int vfio_config_init(struct vfio_pci_device *vdev);
-extern void vfio_config_free(struct vfio_pci_device *vdev);
+extern int vfio_config_init(struct vfio_pci_core_device *vdev);
+extern void vfio_config_free(struct vfio_pci_core_device *vdev);
 
-extern int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
+extern int vfio_pci_register_dev_region(struct vfio_pci_core_device *vdev,
 					unsigned int type, unsigned int subtype,
 					const struct vfio_pci_regops *ops,
 					size_t size, u32 flags, void *data);
 
-extern int vfio_pci_set_power_state(struct vfio_pci_device *vdev,
+extern int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev,
 				    pci_power_t state);
 
-extern bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev);
-extern void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_device
+extern bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev);
+extern void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device
 						    *vdev);
-extern u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_device *vdev);
-extern void vfio_pci_memory_unlock_and_restore(struct vfio_pci_device *vdev,
+extern u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev);
+extern void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev,
 					       u16 cmd);
 
 #ifdef CONFIG_VFIO_PCI_IGD
-extern int vfio_pci_igd_init(struct vfio_pci_device *vdev);
+extern int vfio_pci_igd_init(struct vfio_pci_core_device *vdev);
 #else
-static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
+static inline int vfio_pci_igd_init(struct vfio_pci_core_device *vdev)
 {
 	return -ENODEV;
 }
 #endif
 #ifdef CONFIG_VFIO_PCI_NVLINK2
-extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev);
-extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev);
+extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_core_device *vdev);
+extern int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev);
 #else
-static inline int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
+static inline int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
 {
 	return -ENODEV;
 }
 
-static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
+static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
 {
 	return -ENODEV;
 }
 #endif
 
 #ifdef CONFIG_S390
-extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
+extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 				       struct vfio_info_cap *caps);
 #else
-static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
+static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 					      struct vfio_info_cap *caps)
 {
 	return -ENODEV;
diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
index 0c599cd33d01..2388c9722ed8 100644
--- a/drivers/vfio/pci/vfio_pci_igd.c
+++ b/drivers/vfio/pci/vfio_pci_igd.c
@@ -21,7 +21,7 @@
 #define OPREGION_SIZE		(8 * 1024)
 #define OPREGION_PCI_ADDR	0xfc
 
-static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
+static size_t vfio_pci_igd_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 			      size_t count, loff_t *ppos, bool iswrite)
 {
 	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
@@ -41,7 +41,7 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
 	return count;
 }
 
-static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
+static void vfio_pci_igd_release(struct vfio_pci_core_device *vdev,
 				 struct vfio_pci_region *region)
 {
 	memunmap(region->data);
@@ -52,7 +52,7 @@ static const struct vfio_pci_regops vfio_pci_igd_regops = {
 	.release	= vfio_pci_igd_release,
 };
 
-static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
+static int vfio_pci_igd_opregion_init(struct vfio_pci_core_device *vdev)
 {
 	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
 	u32 addr, size;
@@ -107,7 +107,7 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 	return ret;
 }
 
-static size_t vfio_pci_igd_cfg_rw(struct vfio_pci_device *vdev,
+static size_t vfio_pci_igd_cfg_rw(struct vfio_pci_core_device *vdev,
 				  char __user *buf, size_t count, loff_t *ppos,
 				  bool iswrite)
 {
@@ -200,7 +200,7 @@ static size_t vfio_pci_igd_cfg_rw(struct vfio_pci_device *vdev,
 	return count;
 }
 
-static void vfio_pci_igd_cfg_release(struct vfio_pci_device *vdev,
+static void vfio_pci_igd_cfg_release(struct vfio_pci_core_device *vdev,
 				     struct vfio_pci_region *region)
 {
 	struct pci_dev *pdev = region->data;
@@ -213,7 +213,7 @@ static const struct vfio_pci_regops vfio_pci_igd_cfg_regops = {
 	.release	= vfio_pci_igd_cfg_release,
 };
 
-static int vfio_pci_igd_cfg_init(struct vfio_pci_device *vdev)
+static int vfio_pci_igd_cfg_init(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *host_bridge, *lpc_bridge;
 	int ret;
@@ -261,7 +261,7 @@ static int vfio_pci_igd_cfg_init(struct vfio_pci_device *vdev)
 	return 0;
 }
 
-int vfio_pci_igd_init(struct vfio_pci_device *vdev)
+int vfio_pci_igd_init(struct vfio_pci_core_device *vdev)
 {
 	int ret;
 
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index df1e8c8c274c..945ddbdf4d11 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -27,13 +27,13 @@
  */
 static void vfio_send_intx_eventfd(void *opaque, void *unused)
 {
-	struct vfio_pci_device *vdev = opaque;
+	struct vfio_pci_core_device *vdev = opaque;
 
 	if (likely(is_intx(vdev) && !vdev->virq_disabled))
 		eventfd_signal(vdev->ctx[0].trigger, 1);
 }
 
-void vfio_pci_intx_mask(struct vfio_pci_device *vdev)
+void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	unsigned long flags;
@@ -73,7 +73,7 @@ void vfio_pci_intx_mask(struct vfio_pci_device *vdev)
  */
 static int vfio_pci_intx_unmask_handler(void *opaque, void *unused)
 {
-	struct vfio_pci_device *vdev = opaque;
+	struct vfio_pci_core_device *vdev = opaque;
 	struct pci_dev *pdev = vdev->pdev;
 	unsigned long flags;
 	int ret = 0;
@@ -107,7 +107,7 @@ static int vfio_pci_intx_unmask_handler(void *opaque, void *unused)
 	return ret;
 }
 
-void vfio_pci_intx_unmask(struct vfio_pci_device *vdev)
+void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev)
 {
 	if (vfio_pci_intx_unmask_handler(vdev, NULL) > 0)
 		vfio_send_intx_eventfd(vdev, NULL);
@@ -115,7 +115,7 @@ void vfio_pci_intx_unmask(struct vfio_pci_device *vdev)
 
 static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
 {
-	struct vfio_pci_device *vdev = dev_id;
+	struct vfio_pci_core_device *vdev = dev_id;
 	unsigned long flags;
 	int ret = IRQ_NONE;
 
@@ -139,7 +139,7 @@ static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
 	return ret;
 }
 
-static int vfio_intx_enable(struct vfio_pci_device *vdev)
+static int vfio_intx_enable(struct vfio_pci_core_device *vdev)
 {
 	if (!is_irq_none(vdev))
 		return -EINVAL;
@@ -168,7 +168,7 @@ static int vfio_intx_enable(struct vfio_pci_device *vdev)
 	return 0;
 }
 
-static int vfio_intx_set_signal(struct vfio_pci_device *vdev, int fd)
+static int vfio_intx_set_signal(struct vfio_pci_core_device *vdev, int fd)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	unsigned long irqflags = IRQF_SHARED;
@@ -223,7 +223,7 @@ static int vfio_intx_set_signal(struct vfio_pci_device *vdev, int fd)
 	return 0;
 }
 
-static void vfio_intx_disable(struct vfio_pci_device *vdev)
+static void vfio_intx_disable(struct vfio_pci_core_device *vdev)
 {
 	vfio_virqfd_disable(&vdev->ctx[0].unmask);
 	vfio_virqfd_disable(&vdev->ctx[0].mask);
@@ -244,7 +244,7 @@ static irqreturn_t vfio_msihandler(int irq, void *arg)
 	return IRQ_HANDLED;
 }
 
-static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
+static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	unsigned int flag = msix ? PCI_IRQ_MSIX : PCI_IRQ_MSI;
@@ -285,7 +285,7 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
 	return 0;
 }
 
-static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
+static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 				      int vector, int fd, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
@@ -364,7 +364,7 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
 	return 0;
 }
 
-static int vfio_msi_set_block(struct vfio_pci_device *vdev, unsigned start,
+static int vfio_msi_set_block(struct vfio_pci_core_device *vdev, unsigned start,
 			      unsigned count, int32_t *fds, bool msix)
 {
 	int i, j, ret = 0;
@@ -385,7 +385,7 @@ static int vfio_msi_set_block(struct vfio_pci_device *vdev, unsigned start,
 	return ret;
 }
 
-static void vfio_msi_disable(struct vfio_pci_device *vdev, bool msix)
+static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	int i;
@@ -417,7 +417,7 @@ static void vfio_msi_disable(struct vfio_pci_device *vdev, bool msix)
 /*
  * IOCTL support
  */
-static int vfio_pci_set_intx_unmask(struct vfio_pci_device *vdev,
+static int vfio_pci_set_intx_unmask(struct vfio_pci_core_device *vdev,
 				    unsigned index, unsigned start,
 				    unsigned count, uint32_t flags, void *data)
 {
@@ -444,7 +444,7 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_device *vdev,
 	return 0;
 }
 
-static int vfio_pci_set_intx_mask(struct vfio_pci_device *vdev,
+static int vfio_pci_set_intx_mask(struct vfio_pci_core_device *vdev,
 				  unsigned index, unsigned start,
 				  unsigned count, uint32_t flags, void *data)
 {
@@ -464,7 +464,7 @@ static int vfio_pci_set_intx_mask(struct vfio_pci_device *vdev,
 	return 0;
 }
 
-static int vfio_pci_set_intx_trigger(struct vfio_pci_device *vdev,
+static int vfio_pci_set_intx_trigger(struct vfio_pci_core_device *vdev,
 				     unsigned index, unsigned start,
 				     unsigned count, uint32_t flags, void *data)
 {
@@ -507,7 +507,7 @@ static int vfio_pci_set_intx_trigger(struct vfio_pci_device *vdev,
 	return 0;
 }
 
-static int vfio_pci_set_msi_trigger(struct vfio_pci_device *vdev,
+static int vfio_pci_set_msi_trigger(struct vfio_pci_core_device *vdev,
 				    unsigned index, unsigned start,
 				    unsigned count, uint32_t flags, void *data)
 {
@@ -613,7 +613,7 @@ static int vfio_pci_set_ctx_trigger_single(struct eventfd_ctx **ctx,
 	return -EINVAL;
 }
 
-static int vfio_pci_set_err_trigger(struct vfio_pci_device *vdev,
+static int vfio_pci_set_err_trigger(struct vfio_pci_core_device *vdev,
 				    unsigned index, unsigned start,
 				    unsigned count, uint32_t flags, void *data)
 {
@@ -624,7 +624,7 @@ static int vfio_pci_set_err_trigger(struct vfio_pci_device *vdev,
 					       count, flags, data);
 }
 
-static int vfio_pci_set_req_trigger(struct vfio_pci_device *vdev,
+static int vfio_pci_set_req_trigger(struct vfio_pci_core_device *vdev,
 				    unsigned index, unsigned start,
 				    unsigned count, uint32_t flags, void *data)
 {
@@ -635,11 +635,11 @@ static int vfio_pci_set_req_trigger(struct vfio_pci_device *vdev,
 					       count, flags, data);
 }
 
-int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
+int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
 			    unsigned index, unsigned start, unsigned count,
 			    void *data)
 {
-	int (*func)(struct vfio_pci_device *vdev, unsigned index,
+	int (*func)(struct vfio_pci_core_device *vdev, unsigned index,
 		    unsigned start, unsigned count, uint32_t flags,
 		    void *data) = NULL;
 
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
index 326a704c4527..8ef2c62a9d27 100644
--- a/drivers/vfio/pci/vfio_pci_nvlink2.c
+++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
@@ -39,7 +39,7 @@ struct vfio_pci_nvgpu_data {
 	struct notifier_block group_notifier;
 };
 
-static size_t vfio_pci_nvgpu_rw(struct vfio_pci_device *vdev,
+static size_t vfio_pci_nvgpu_rw(struct vfio_pci_core_device *vdev,
 		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
 {
 	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
@@ -89,7 +89,7 @@ static size_t vfio_pci_nvgpu_rw(struct vfio_pci_device *vdev,
 	return count;
 }
 
-static void vfio_pci_nvgpu_release(struct vfio_pci_device *vdev,
+static void vfio_pci_nvgpu_release(struct vfio_pci_core_device *vdev,
 		struct vfio_pci_region *region)
 {
 	struct vfio_pci_nvgpu_data *data = region->data;
@@ -136,7 +136,7 @@ static const struct vm_operations_struct vfio_pci_nvgpu_mmap_vmops = {
 	.fault = vfio_pci_nvgpu_mmap_fault,
 };
 
-static int vfio_pci_nvgpu_mmap(struct vfio_pci_device *vdev,
+static int vfio_pci_nvgpu_mmap(struct vfio_pci_core_device *vdev,
 		struct vfio_pci_region *region, struct vm_area_struct *vma)
 {
 	int ret;
@@ -171,7 +171,7 @@ static int vfio_pci_nvgpu_mmap(struct vfio_pci_device *vdev,
 	return ret;
 }
 
-static int vfio_pci_nvgpu_add_capability(struct vfio_pci_device *vdev,
+static int vfio_pci_nvgpu_add_capability(struct vfio_pci_core_device *vdev,
 		struct vfio_pci_region *region, struct vfio_info_cap *caps)
 {
 	struct vfio_pci_nvgpu_data *data = region->data;
@@ -207,7 +207,7 @@ static int vfio_pci_nvgpu_group_notifier(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
-int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
+int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
 {
 	int ret;
 	u64 reg[2];
@@ -304,7 +304,7 @@ struct vfio_pci_npu2_data {
 	unsigned int link_speed; /* The link speed from DT's ibm,nvlink-speed */
 };
 
-static size_t vfio_pci_npu2_rw(struct vfio_pci_device *vdev,
+static size_t vfio_pci_npu2_rw(struct vfio_pci_core_device *vdev,
 		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
 {
 	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
@@ -328,7 +328,7 @@ static size_t vfio_pci_npu2_rw(struct vfio_pci_device *vdev,
 	return count;
 }
 
-static int vfio_pci_npu2_mmap(struct vfio_pci_device *vdev,
+static int vfio_pci_npu2_mmap(struct vfio_pci_core_device *vdev,
 		struct vfio_pci_region *region, struct vm_area_struct *vma)
 {
 	int ret;
@@ -349,7 +349,7 @@ static int vfio_pci_npu2_mmap(struct vfio_pci_device *vdev,
 	return ret;
 }
 
-static void vfio_pci_npu2_release(struct vfio_pci_device *vdev,
+static void vfio_pci_npu2_release(struct vfio_pci_core_device *vdev,
 		struct vfio_pci_region *region)
 {
 	struct vfio_pci_npu2_data *data = region->data;
@@ -358,7 +358,7 @@ static void vfio_pci_npu2_release(struct vfio_pci_device *vdev,
 	kfree(data);
 }
 
-static int vfio_pci_npu2_add_capability(struct vfio_pci_device *vdev,
+static int vfio_pci_npu2_add_capability(struct vfio_pci_core_device *vdev,
 		struct vfio_pci_region *region, struct vfio_info_cap *caps)
 {
 	struct vfio_pci_npu2_data *data = region->data;
@@ -388,7 +388,7 @@ static const struct vfio_pci_regops vfio_pci_npu2_regops = {
 	.add_capability = vfio_pci_npu2_add_capability,
 };
 
-int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
+int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
 {
 	int ret;
 	struct vfio_pci_npu2_data *data;
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 667e82726e75..8fff4689dd44 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -38,7 +38,7 @@
 #define vfio_iowrite8	iowrite8
 
 #define VFIO_IOWRITE(size) \
-static int vfio_pci_iowrite##size(struct vfio_pci_device *vdev,		\
+static int vfio_pci_iowrite##size(struct vfio_pci_core_device *vdev,		\
 			bool test_mem, u##size val, void __iomem *io)	\
 {									\
 	if (test_mem) {							\
@@ -65,7 +65,7 @@ VFIO_IOWRITE(64)
 #endif
 
 #define VFIO_IOREAD(size) \
-static int vfio_pci_ioread##size(struct vfio_pci_device *vdev,		\
+static int vfio_pci_ioread##size(struct vfio_pci_core_device *vdev,		\
 			bool test_mem, u##size *val, void __iomem *io)	\
 {									\
 	if (test_mem) {							\
@@ -94,7 +94,7 @@ VFIO_IOREAD(32)
  * reads with -1.  This is intended for handling MSI-X vector tables and
  * leftover space for ROM BARs.
  */
-static ssize_t do_io_rw(struct vfio_pci_device *vdev, bool test_mem,
+static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
 			void __iomem *io, char __user *buf,
 			loff_t off, size_t count, size_t x_start,
 			size_t x_end, bool iswrite)
@@ -200,7 +200,7 @@ static ssize_t do_io_rw(struct vfio_pci_device *vdev, bool test_mem,
 	return done;
 }
 
-static int vfio_pci_setup_barmap(struct vfio_pci_device *vdev, int bar)
+static int vfio_pci_setup_barmap(struct vfio_pci_core_device *vdev, int bar)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	int ret;
@@ -224,7 +224,7 @@ static int vfio_pci_setup_barmap(struct vfio_pci_device *vdev, int bar)
 	return 0;
 }
 
-ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
+ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 			size_t count, loff_t *ppos, bool iswrite)
 {
 	struct pci_dev *pdev = vdev->pdev;
@@ -288,7 +288,7 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf,
 	return done;
 }
 
-ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf,
+ssize_t vfio_pci_vga_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 			       size_t count, loff_t *ppos, bool iswrite)
 {
 	int ret;
@@ -384,7 +384,7 @@ static void vfio_pci_ioeventfd_do_write(struct vfio_pci_ioeventfd *ioeventfd,
 static int vfio_pci_ioeventfd_handler(void *opaque, void *unused)
 {
 	struct vfio_pci_ioeventfd *ioeventfd = opaque;
-	struct vfio_pci_device *vdev = ioeventfd->vdev;
+	struct vfio_pci_core_device *vdev = ioeventfd->vdev;
 
 	if (ioeventfd->test_mem) {
 		if (!down_read_trylock(&vdev->memory_lock))
@@ -410,7 +410,7 @@ static void vfio_pci_ioeventfd_thread(void *opaque, void *unused)
 	vfio_pci_ioeventfd_do_write(ioeventfd, ioeventfd->test_mem);
 }
 
-long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset,
+long vfio_pci_ioeventfd(struct vfio_pci_core_device *vdev, loff_t offset,
 			uint64_t data, int count, int fd)
 {
 	struct pci_dev *pdev = vdev->pdev;
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 3e91d49fa3f0..3216925c4d41 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -24,7 +24,7 @@
 /*
  * Add the Base PCI Function information to the device info region.
  */
-static int zpci_base_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
+static int zpci_base_cap(struct zpci_dev *zdev, struct vfio_pci_core_device *vdev,
 			 struct vfio_info_cap *caps)
 {
 	struct vfio_device_info_cap_zpci_base cap = {
@@ -45,7 +45,7 @@ static int zpci_base_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
 /*
  * Add the Base PCI Function Group information to the device info region.
  */
-static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
+static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_pci_core_device *vdev,
 			  struct vfio_info_cap *caps)
 {
 	struct vfio_device_info_cap_zpci_group cap = {
@@ -66,7 +66,7 @@ static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
 /*
  * Add the device utility string to the device info region.
  */
-static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
+static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_pci_core_device *vdev,
 			 struct vfio_info_cap *caps)
 {
 	struct vfio_device_info_cap_zpci_util *cap;
@@ -90,7 +90,7 @@ static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
 /*
  * Add the function path string to the device info region.
  */
-static int zpci_pfip_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
+static int zpci_pfip_cap(struct zpci_dev *zdev, struct vfio_pci_core_device *vdev,
 			 struct vfio_info_cap *caps)
 {
 	struct vfio_device_info_cap_zpci_pfip *cap;
@@ -114,7 +114,7 @@ static int zpci_pfip_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
 /*
  * Add all supported capabilities to the VFIO_DEVICE_GET_INFO capability chain.
  */
-int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
+int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 				struct vfio_info_cap *caps)
 {
 	struct zpci_dev *zdev = to_zpci(vdev->pdev);
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 4/9] vfio-pci: introduce vfio_pci_core subsystem driver
  2021-03-09  8:33 [PATCH v3 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
                   ` (2 preceding siblings ...)
  2021-03-09  8:33 ` [PATCH 3/9] vfio-pci: rename vfio_pci_device to vfio_pci_core_device Max Gurtovoy
@ 2021-03-09  8:33 ` Max Gurtovoy
  2021-03-09  8:33 ` [PATCH 5/9] vfio/pci: introduce vfio_pci_device structure Max Gurtovoy
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Max Gurtovoy @ 2021-03-09  8:33 UTC (permalink / raw)
  To: jgg, alex.williamson, cohuck, kvm, linux-kernel
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, aik, hch,
	Max Gurtovoy

Split the vfio_pci driver into two parts, the 'struct pci_driver'
(vfio_pci) and a library of code (vfio_pci_core) that helps creating a
VFIO device on top of a PCI device.

As before vfio_pci.ko continues to present the same interface under
sysfs and this change should have no functional impact.

vfio_pci_core exposes an interface that is similar to a typical
Linux subsystem, in that a pci_driver doing probe() can setup a number
of details and then create the VFIO char device.

Allowing another module to provide the pci_driver allows that module
to customize how VFIO is setup, inject its own operations, and easily
extend vendor specific functionality.

This is complementary to how VFIO's mediated devices work. Instead of
custome device lifecycle managmenet and a special bus drivers using
this approach will rely on the normal driver core lifecycle (e.g.
bind/unbind) management and this is optimized to effectively support
customization that is only making small modifications to what vfio_pci
would do normally.

This approach is also a pluggable alternative for the hard wired
CONFIG_VFIO_PCI_IGD and CONFIG_VFIO_PCI_NVLINK2 "drivers" that are
built into vfio-pci. Using this work all of that code can be moved to
a dedicated device-specific modules and cleanly split out of the
generic vfio_pci driver.

Below is an example for adding new driver to vfio pci subsystem:
	+-------------------------------------------------+
	|                                                 |
	|                     VFIO                        |
	|                                                 |
	+-------------------------------------------------+

	+-------------------------------------------------+
	|                                                 |
	|                  VFIO_PCI_CORE                  |
	|                                                 |
	+-------------------------------------------------+

	+--------------+ +---------------+ +--------------+
	|              | |               | |              |
	|  VFIO_PCI    | | MLX5_VFIO_PCI | | IGD_VFIO_PCI |
	|              | |               | |              |
	+--------------+ +---------------+ +--------------+

In this way mlx5_vfio_pci will use vfio_pci_core to register to vfio
subsystem and also use the generic PCI functionality exported from it.
Additionally it will add the needed vendor specific logic for HW
specific features such as Live Migration. Same for the igd_vfio_pci that
will add special extensions for Intel Graphics cards (GVT-d).

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/Kconfig         |  22 ++-
 drivers/vfio/pci/Makefile        |  13 +-
 drivers/vfio/pci/vfio_pci.c      | 247 ++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_core.c | 318 ++++++++-----------------------
 drivers/vfio/pci/vfio_pci_core.h | 113 +++++++----
 5 files changed, 423 insertions(+), 290 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci.c

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index ac3c1dd3edef..829e90a2e5a3 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
-config VFIO_PCI
-	tristate "VFIO support for PCI devices"
+config VFIO_PCI_CORE
+	tristate "VFIO core support for PCI devices"
 	depends on VFIO && PCI && EVENTFD
 	select VFIO_VIRQFD
 	select IRQ_BYPASS_MANAGER
@@ -10,9 +10,17 @@ config VFIO_PCI
 
 	  If you don't know what to do here, say N.
 
+config VFIO_PCI
+	tristate "VFIO support for PCI devices"
+	depends on VFIO_PCI_CORE
+	help
+	  This provides a generic PCI support using the VFIO framework.
+
+	  If you don't know what to do here, say N.
+
 config VFIO_PCI_VGA
 	bool "VFIO PCI support for VGA devices"
-	depends on VFIO_PCI && X86 && VGA_ARB
+	depends on VFIO_PCI_CORE && X86 && VGA_ARB
 	help
 	  Support for VGA extension to VFIO PCI.  This exposes an additional
 	  region on VGA devices for accessing legacy VGA addresses used by
@@ -21,16 +29,16 @@ config VFIO_PCI_VGA
 	  If you don't know what to do here, say N.
 
 config VFIO_PCI_MMAP
-	depends on VFIO_PCI
+	depends on VFIO_PCI_CORE
 	def_bool y if !S390
 
 config VFIO_PCI_INTX
-	depends on VFIO_PCI
+	depends on VFIO_PCI_CORE
 	def_bool y if !S390
 
 config VFIO_PCI_IGD
 	bool "VFIO PCI extensions for Intel graphics (GVT-d)"
-	depends on VFIO_PCI && X86
+	depends on VFIO_PCI_CORE && X86
 	default y
 	help
 	  Support for Intel IGD specific extensions to enable direct
@@ -42,6 +50,6 @@ config VFIO_PCI_IGD
 
 config VFIO_PCI_NVLINK2
 	def_bool y
-	depends on VFIO_PCI && PPC_POWERNV
+	depends on VFIO_PCI_CORE && PPC_POWERNV
 	help
 	  VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index bbf8d7c8fc45..16e7d77d63ce 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,8 +1,11 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
-vfio-pci-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
-vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
-vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
-vfio-pci-$(CONFIG_S390) += vfio_pci_zdev.o
-
+obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+
+vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
+vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
+
+vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
new file mode 100644
index 000000000000..447c31f4e64e
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -0,0 +1,247 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020, Mellanox Technologies, Ltd.  All rights reserved.
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+
+#include "vfio_pci_core.h"
+
+#define DRIVER_VERSION  "0.2"
+#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
+#define DRIVER_DESC     "VFIO PCI - User Level meta-driver"
+
+static char ids[1024] __initdata;
+module_param_string(ids, ids, sizeof(ids), 0);
+MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
+
+static bool enable_sriov;
+#ifdef CONFIG_PCI_IOV
+module_param(enable_sriov, bool, 0644);
+MODULE_PARM_DESC(enable_sriov, "Enable support for SR-IOV configuration.  Enabling SR-IOV on a PF typically requires support of the userspace PF driver, enabling VFs without such support may result in non-functional VFs or PF.");
+#endif
+
+static bool disable_denylist;
+module_param(disable_denylist, bool, 0444);
+MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
+
+static bool vfio_pci_dev_in_denylist(struct pci_dev *pdev)
+{
+	switch (pdev->vendor) {
+	case PCI_VENDOR_ID_INTEL:
+		switch (pdev->device) {
+		case PCI_DEVICE_ID_INTEL_QAT_C3XXX:
+		case PCI_DEVICE_ID_INTEL_QAT_C3XXX_VF:
+		case PCI_DEVICE_ID_INTEL_QAT_C62X:
+		case PCI_DEVICE_ID_INTEL_QAT_C62X_VF:
+		case PCI_DEVICE_ID_INTEL_QAT_DH895XCC:
+		case PCI_DEVICE_ID_INTEL_QAT_DH895XCC_VF:
+			return true;
+		default:
+			return false;
+		}
+	}
+
+	return false;
+}
+
+static bool vfio_pci_is_denylisted(struct pci_dev *pdev)
+{
+	if (!vfio_pci_dev_in_denylist(pdev))
+		return false;
+
+	if (disable_denylist) {
+		pci_warn(pdev,
+			 "device denylist disabled - allowing device %04x:%04x.\n",
+			 pdev->vendor, pdev->device);
+		return false;
+	}
+
+	pci_warn(pdev, "%04x:%04x exists in vfio-pci device denylist, driver probing disallowed.\n",
+		 pdev->vendor, pdev->device);
+
+	return true;
+}
+
+static void vfio_pci_release(void *device_data)
+{
+	struct vfio_pci_core_device *vdev = device_data;
+
+	mutex_lock(&vdev->reflck->lock);
+	if (!(--vdev->refcnt)) {
+		vfio_pci_vf_token_user_add(vdev, -1);
+		vfio_pci_core_spapr_eeh_release(vdev);
+		vfio_pci_core_disable(vdev);
+	}
+	mutex_unlock(&vdev->reflck->lock);
+
+	module_put(THIS_MODULE);
+}
+
+static int vfio_pci_open(void *device_data)
+{
+	struct vfio_pci_core_device *vdev = device_data;
+	int ret = 0;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!vdev->refcnt) {
+		ret = vfio_pci_core_enable(vdev);
+		if (ret)
+			goto error;
+
+		vfio_pci_probe_mmaps(vdev);
+		vfio_pci_core_spapr_eeh_open(vdev);
+		vfio_pci_vf_token_user_add(vdev, 1);
+	}
+	vdev->refcnt++;
+error:
+	mutex_unlock(&vdev->reflck->lock);
+	if (ret)
+		module_put(THIS_MODULE);
+	return ret;
+}
+
+static const struct vfio_device_ops vfio_pci_ops = {
+	.name		= "vfio-pci",
+	.open		= vfio_pci_open,
+	.release	= vfio_pci_release,
+	.ioctl		= vfio_pci_core_ioctl,
+	.read		= vfio_pci_core_read,
+	.write		= vfio_pci_core_write,
+	.mmap		= vfio_pci_core_mmap,
+	.request	= vfio_pci_core_request,
+	.match		= vfio_pci_core_match,
+};
+
+static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct vfio_pci_core_device *vdev;
+
+	if (vfio_pci_is_denylisted(pdev))
+		return -EINVAL;
+
+	vdev = vfio_create_pci_device(pdev, &vfio_pci_ops);
+	if (IS_ERR(vdev))
+		return PTR_ERR(vdev);
+
+	return 0;
+}
+
+static void vfio_pci_remove(struct pci_dev *pdev)
+{
+	vfio_destroy_pci_device(pdev);
+}
+
+static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
+{
+	might_sleep();
+
+	if (!enable_sriov)
+		return -ENOENT;
+
+	return vfio_pci_core_sriov_configure(pdev, nr_virtfn);
+}
+
+static struct pci_driver vfio_pci_driver = {
+	.name			= "vfio-pci",
+	.id_table		= NULL, /* only dynamic ids */
+	.probe			= vfio_pci_probe,
+	.remove			= vfio_pci_remove,
+	.sriov_configure	= vfio_pci_sriov_configure,
+	.err_handler		= &vfio_pci_core_err_handlers,
+};
+
+static void __exit vfio_pci_cleanup(void)
+{
+	pci_unregister_driver(&vfio_pci_driver);
+}
+
+static void __init vfio_pci_fill_ids(void)
+{
+	char *p, *id;
+	int rc;
+
+	/* no ids passed actually */
+	if (ids[0] == '\0')
+		return;
+
+	/* add ids specified in the module parameter */
+	p = ids;
+	while ((id = strsep(&p, ","))) {
+		unsigned int vendor, device, subvendor = PCI_ANY_ID,
+			subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
+		int fields;
+
+		if (!strlen(id))
+			continue;
+
+		fields = sscanf(id, "%x:%x:%x:%x:%x:%x",
+				&vendor, &device, &subvendor, &subdevice,
+				&class, &class_mask);
+
+		if (fields < 2) {
+			pr_warn("invalid id string \"%s\"\n", id);
+			continue;
+		}
+
+		rc = pci_add_dynid(&vfio_pci_driver, vendor, device,
+				   subvendor, subdevice, class, class_mask, 0);
+		if (rc)
+			pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
+				vendor, device, subvendor, subdevice,
+				class, class_mask, rc);
+		else
+			pr_info("add [%04x:%04x[%04x:%04x]] class %#08x/%08x\n",
+				vendor, device, subvendor, subdevice,
+				class, class_mask);
+	}
+}
+
+static int __init vfio_pci_init(void)
+{
+	int ret;
+
+	/* Register and scan for devices */
+	ret = pci_register_driver(&vfio_pci_driver);
+	if (ret)
+		return ret;
+
+	vfio_pci_fill_ids();
+
+	if (disable_denylist)
+		pr_warn("device denylist disabled.\n");
+
+	return 0;
+}
+
+module_init(vfio_pci_init);
+module_exit(vfio_pci_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 557a03528dcd..878a3609b916 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -23,7 +23,6 @@
 #include <linux/slab.h>
 #include <linux/types.h>
 #include <linux/uaccess.h>
-#include <linux/vfio.h>
 #include <linux/vgaarb.h>
 #include <linux/nospec.h>
 #include <linux/sched/mm.h>
@@ -32,11 +31,7 @@
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
-#define DRIVER_DESC     "VFIO PCI - User Level meta-driver"
-
-static char ids[1024] __initdata;
-module_param_string(ids, ids, sizeof(ids), 0);
-MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
+#define DRIVER_DESC "core driver for VFIO based PCI devices"
 
 static bool nointxmask;
 module_param_named(nointxmask, nointxmask, bool, S_IRUGO | S_IWUSR);
@@ -54,16 +49,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 		 "Disable using the PCI D3 low power state for idle, unused devices");
 
-static bool enable_sriov;
-#ifdef CONFIG_PCI_IOV
-module_param(enable_sriov, bool, 0644);
-MODULE_PARM_DESC(enable_sriov, "Enable support for SR-IOV configuration.  Enabling SR-IOV on a PF typically requires support of the userspace PF driver, enabling VFs without such support may result in non-functional VFs or PF.");
-#endif
-
-static bool disable_denylist;
-module_param(disable_denylist, bool, 0444);
-MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
-
 static inline bool vfio_vga_disabled(void)
 {
 #ifdef CONFIG_VFIO_PCI_VGA
@@ -73,44 +58,6 @@ static inline bool vfio_vga_disabled(void)
 #endif
 }
 
-static bool vfio_pci_dev_in_denylist(struct pci_dev *pdev)
-{
-	switch (pdev->vendor) {
-	case PCI_VENDOR_ID_INTEL:
-		switch (pdev->device) {
-		case PCI_DEVICE_ID_INTEL_QAT_C3XXX:
-		case PCI_DEVICE_ID_INTEL_QAT_C3XXX_VF:
-		case PCI_DEVICE_ID_INTEL_QAT_C62X:
-		case PCI_DEVICE_ID_INTEL_QAT_C62X_VF:
-		case PCI_DEVICE_ID_INTEL_QAT_DH895XCC:
-		case PCI_DEVICE_ID_INTEL_QAT_DH895XCC_VF:
-			return true;
-		default:
-			return false;
-		}
-	}
-
-	return false;
-}
-
-static bool vfio_pci_is_denylisted(struct pci_dev *pdev)
-{
-	if (!vfio_pci_dev_in_denylist(pdev))
-		return false;
-
-	if (disable_denylist) {
-		pci_warn(pdev,
-			 "device denylist disabled - allowing device %04x:%04x.\n",
-			 pdev->vendor, pdev->device);
-		return false;
-	}
-
-	pci_warn(pdev, "%04x:%04x exists in vfio-pci device denylist, driver probing disallowed.\n",
-		 pdev->vendor, pdev->device);
-
-	return true;
-}
-
 /*
  * Our VGA arbiter participation is limited since we don't know anything
  * about the device itself.  However, if the device is the only VGA device
@@ -155,7 +102,7 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
-static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
+void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
 {
 	struct resource *res;
 	int i;
@@ -222,6 +169,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
 		vdev->bar_mmap_supported[bar] = false;
 	}
 }
+EXPORT_SYMBOL_GPL(vfio_pci_probe_mmaps);
 
 static void vfio_pci_try_bus_reset(struct vfio_pci_core_device *vdev);
 static void vfio_pci_disable(struct vfio_pci_core_device *vdev);
@@ -309,7 +257,24 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 	return ret;
 }
 
-static int vfio_pci_enable(struct vfio_pci_core_device *vdev)
+void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
+{
+	vfio_pci_disable(vdev);
+
+	mutex_lock(&vdev->igate);
+	if (vdev->err_trigger) {
+		eventfd_ctx_put(vdev->err_trigger);
+		vdev->err_trigger = NULL;
+	}
+	if (vdev->req_trigger) {
+		eventfd_ctx_put(vdev->req_trigger);
+		vdev->req_trigger = NULL;
+	}
+	mutex_unlock(&vdev->igate);
+}
+EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
+
+int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	int ret;
@@ -407,14 +372,13 @@ static int vfio_pci_enable(struct vfio_pci_core_device *vdev)
 		}
 	}
 
-	vfio_pci_probe_mmaps(vdev);
-
 	return 0;
 
 disable_exit:
 	vfio_pci_disable(vdev);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_core_enable);
 
 static void vfio_pci_disable(struct vfio_pci_core_device *vdev)
 {
@@ -515,8 +479,6 @@ static void vfio_pci_disable(struct vfio_pci_core_device *vdev)
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 }
 
-static struct pci_driver vfio_pci_driver;
-
 static struct vfio_pci_core_device *get_pf_vdev(struct vfio_pci_core_device *vdev,
 					   struct vfio_device **pf_dev)
 {
@@ -529,7 +491,7 @@ static struct vfio_pci_core_device *get_pf_vdev(struct vfio_pci_core_device *vde
 	if (!*pf_dev)
 		return NULL;
 
-	if (pci_dev_driver(physfn) != &vfio_pci_driver) {
+	if (pci_dev_driver(physfn) != pci_dev_driver(vdev->pdev)) {
 		vfio_device_put(*pf_dev);
 		return NULL;
 	}
@@ -537,7 +499,7 @@ static struct vfio_pci_core_device *get_pf_vdev(struct vfio_pci_core_device *vde
 	return vfio_device_data(*pf_dev);
 }
 
-static void vfio_pci_vf_token_user_add(struct vfio_pci_core_device *vdev, int val)
+void vfio_pci_vf_token_user_add(struct vfio_pci_core_device *vdev, int val)
 {
 	struct vfio_device *pf_dev;
 	struct vfio_pci_core_device *pf_vdev = get_pf_vdev(vdev, &pf_dev);
@@ -552,60 +514,19 @@ static void vfio_pci_vf_token_user_add(struct vfio_pci_core_device *vdev, int va
 
 	vfio_device_put(pf_dev);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_vf_token_user_add);
 
-static void vfio_pci_release(void *device_data)
+void vfio_pci_core_spapr_eeh_open(struct vfio_pci_core_device *vdev)
 {
-	struct vfio_pci_core_device *vdev = device_data;
-
-	mutex_lock(&vdev->reflck->lock);
-
-	if (!(--vdev->refcnt)) {
-		vfio_pci_vf_token_user_add(vdev, -1);
-		vfio_spapr_pci_eeh_release(vdev->pdev);
-		vfio_pci_disable(vdev);
-
-		mutex_lock(&vdev->igate);
-		if (vdev->err_trigger) {
-			eventfd_ctx_put(vdev->err_trigger);
-			vdev->err_trigger = NULL;
-		}
-		if (vdev->req_trigger) {
-			eventfd_ctx_put(vdev->req_trigger);
-			vdev->req_trigger = NULL;
-		}
-		mutex_unlock(&vdev->igate);
-	}
-
-	mutex_unlock(&vdev->reflck->lock);
-
-	module_put(THIS_MODULE);
+	vfio_spapr_pci_eeh_open(vdev->pdev);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_core_spapr_eeh_open);
 
-static int vfio_pci_open(void *device_data)
+void vfio_pci_core_spapr_eeh_release(struct vfio_pci_core_device *vpdev)
 {
-	struct vfio_pci_core_device *vdev = device_data;
-	int ret = 0;
-
-	if (!try_module_get(THIS_MODULE))
-		return -ENODEV;
-
-	mutex_lock(&vdev->reflck->lock);
-
-	if (!vdev->refcnt) {
-		ret = vfio_pci_enable(vdev);
-		if (ret)
-			goto error;
-
-		vfio_spapr_pci_eeh_open(vdev->pdev);
-		vfio_pci_vf_token_user_add(vdev, 1);
-	}
-	vdev->refcnt++;
-error:
-	mutex_unlock(&vdev->reflck->lock);
-	if (ret)
-		module_put(THIS_MODULE);
-	return ret;
+	vfio_spapr_pci_eeh_release(vpdev->pdev);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_core_spapr_eeh_release);
 
 static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_type)
 {
@@ -797,8 +718,8 @@ struct vfio_devices {
 	int max_index;
 };
 
-static long vfio_pci_ioctl(void *device_data,
-			   unsigned int cmd, unsigned long arg)
+long vfio_pci_core_ioctl(void *device_data, unsigned int cmd,
+		unsigned long arg)
 {
 	struct vfio_pci_core_device *vdev = device_data;
 	unsigned long minsz;
@@ -1401,6 +1322,7 @@ static long vfio_pci_ioctl(void *device_data,
 
 	return -ENOTTY;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
 
 static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
 			   size_t count, loff_t *ppos, bool iswrite)
@@ -1434,23 +1356,25 @@ static ssize_t vfio_pci_rw(void *device_data, char __user *buf,
 	return -EINVAL;
 }
 
-static ssize_t vfio_pci_read(void *device_data, char __user *buf,
-			     size_t count, loff_t *ppos)
+ssize_t vfio_pci_core_read(void *device_data, char __user *buf, size_t count,
+		loff_t *ppos)
 {
 	if (!count)
 		return 0;
 
 	return vfio_pci_rw(device_data, buf, count, ppos, false);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_core_read);
 
-static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
-			      size_t count, loff_t *ppos)
+ssize_t vfio_pci_core_write(void *device_data, const char __user *buf,
+		size_t count, loff_t *ppos)
 {
 	if (!count)
 		return 0;
 
 	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_core_write);
 
 /* Return 1 on zap and vma_lock acquired, 0 on contention (only with @try) */
 static int vfio_pci_zap_and_vma_lock(struct vfio_pci_core_device *vdev, bool try)
@@ -1646,7 +1570,7 @@ static const struct vm_operations_struct vfio_pci_mmap_ops = {
 	.fault = vfio_pci_mmap_fault,
 };
 
-static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
+int vfio_pci_core_mmap(void *device_data, struct vm_area_struct *vma)
 {
 	struct vfio_pci_core_device *vdev = device_data;
 	struct pci_dev *pdev = vdev->pdev;
@@ -1713,8 +1637,9 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_core_mmap);
 
-static void vfio_pci_request(void *device_data, unsigned int count)
+void vfio_pci_core_request(void *device_data, unsigned int count)
 {
 	struct vfio_pci_core_device *vdev = device_data;
 	struct pci_dev *pdev = vdev->pdev;
@@ -1734,6 +1659,7 @@ static void vfio_pci_request(void *device_data, unsigned int count)
 
 	mutex_unlock(&vdev->igate);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_core_request);
 
 static int vfio_pci_validate_vf_token(struct vfio_pci_core_device *vdev,
 				      bool vf_token, uuid_t *uuid)
@@ -1830,7 +1756,7 @@ static int vfio_pci_validate_vf_token(struct vfio_pci_core_device *vdev,
 
 #define VF_TOKEN_ARG "vf_token="
 
-static int vfio_pci_match(void *device_data, char *buf)
+int vfio_pci_core_match(void *device_data, char *buf)
 {
 	struct vfio_pci_core_device *vdev = device_data;
 	bool vf_token = false;
@@ -1878,18 +1804,7 @@ static int vfio_pci_match(void *device_data, char *buf)
 
 	return 1; /* Match */
 }
-
-static const struct vfio_device_ops vfio_pci_ops = {
-	.name		= "vfio-pci",
-	.open		= vfio_pci_open,
-	.release	= vfio_pci_release,
-	.ioctl		= vfio_pci_ioctl,
-	.read		= vfio_pci_read,
-	.write		= vfio_pci_write,
-	.mmap		= vfio_pci_mmap,
-	.request	= vfio_pci_request,
-	.match		= vfio_pci_match,
-};
+EXPORT_SYMBOL_GPL(vfio_pci_core_match);
 
 static int vfio_pci_reflck_attach(struct vfio_pci_core_device *vdev);
 static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
@@ -1908,12 +1823,12 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
 		pci_info(vdev->pdev, "Captured SR-IOV VF %s driver_override\n",
 			 pci_name(pdev));
 		pdev->driver_override = kasprintf(GFP_KERNEL, "%s",
-						  vfio_pci_ops.name);
+						  vdev->vfio_pci_ops->name);
 	} else if (action == BUS_NOTIFY_BOUND_DRIVER &&
 		   pdev->is_virtfn && physfn == vdev->pdev) {
 		struct pci_driver *drv = pci_dev_driver(pdev);
 
-		if (drv && drv != &vfio_pci_driver)
+		if (drv && drv != pci_dev_driver(vdev->pdev))
 			pci_warn(vdev->pdev,
 				 "VF %s bound to driver %s while PF bound to vfio-pci\n",
 				 pci_name(pdev), drv->name);
@@ -1922,17 +1837,15 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
 	return 0;
 }
 
-static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+struct vfio_pci_core_device *vfio_create_pci_device(struct pci_dev *pdev,
+		const struct vfio_device_ops *vfio_pci_ops)
 {
 	struct vfio_pci_core_device *vdev;
 	struct iommu_group *group;
 	int ret;
 
-	if (vfio_pci_is_denylisted(pdev))
-		return -EINVAL;
-
 	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
-		return -EINVAL;
+		return ERR_PTR(-EINVAL);
 
 	/*
 	 * Prevent binding to PFs with VFs enabled, the VFs might be in use
@@ -1944,12 +1857,12 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	 */
 	if (pci_num_vf(pdev)) {
 		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
-		return -EBUSY;
+		return ERR_PTR(-EBUSY);
 	}
 
 	group = vfio_iommu_group_get(&pdev->dev);
 	if (!group)
-		return -EINVAL;
+		return ERR_PTR(-EINVAL);
 
 	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
 	if (!vdev) {
@@ -1958,6 +1871,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	}
 
 	vdev->pdev = pdev;
+	vdev->vfio_pci_ops = vfio_pci_ops;
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
 	mutex_init(&vdev->igate);
 	spin_lock_init(&vdev->irqlock);
@@ -1968,7 +1882,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	INIT_LIST_HEAD(&vdev->vma_list);
 	init_rwsem(&vdev->memory_lock);
 
-	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
+	ret = vfio_add_group_dev(&pdev->dev, vfio_pci_ops, vdev);
 	if (ret)
 		goto out_free;
 
@@ -2014,7 +1928,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 	}
 
-	return ret;
+	return vdev;
 
 out_vf_token:
 	kfree(vdev->vf_token);
@@ -2026,10 +1940,11 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	kfree(vdev);
 out_group_put:
 	vfio_iommu_group_put(group, &pdev->dev);
-	return ret;
+	return ERR_PTR(ret);
 }
+EXPORT_SYMBOL_GPL(vfio_create_pci_device);
 
-static void vfio_pci_remove(struct pci_dev *pdev)
+void vfio_destroy_pci_device(struct pci_dev *pdev)
 {
 	struct vfio_pci_core_device *vdev;
 
@@ -2067,9 +1982,10 @@ static void vfio_pci_remove(struct pci_dev *pdev)
 				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
 	}
 }
+EXPORT_SYMBOL_GPL(vfio_destroy_pci_device);
 
-static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
-						  pci_channel_state_t state)
+static pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
+		pci_channel_state_t state)
 {
 	struct vfio_pci_core_device *vdev;
 	struct vfio_device *device;
@@ -2096,7 +2012,7 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
 	return PCI_ERS_RESULT_CAN_RECOVER;
 }
 
-static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
+int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
 {
 	struct vfio_pci_core_device *vdev;
 	struct vfio_device *device;
@@ -2104,9 +2020,6 @@ static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
 
 	might_sleep();
 
-	if (!enable_sriov)
-		return -ENOENT;
-
 	device = vfio_device_get_from_dev(&pdev->dev);
 	if (!device)
 		return -ENODEV;
@@ -2126,19 +2039,12 @@ static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
 
 	return ret < 0 ? ret : nr_virtfn;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure);
 
-static const struct pci_error_handlers vfio_err_handlers = {
-	.error_detected = vfio_pci_aer_err_detected,
-};
-
-static struct pci_driver vfio_pci_driver = {
-	.name			= "vfio-pci",
-	.id_table		= NULL, /* only dynamic ids */
-	.probe			= vfio_pci_probe,
-	.remove			= vfio_pci_remove,
-	.sriov_configure	= vfio_pci_sriov_configure,
-	.err_handler		= &vfio_err_handlers,
+const struct pci_error_handlers vfio_pci_core_err_handlers = {
+	.error_detected = vfio_pci_core_aer_err_detected,
 };
+EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);
 
 static DEFINE_MUTEX(reflck_lock);
 
@@ -2171,13 +2077,13 @@ static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
 	if (!device)
 		return 0;
 
-	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+	vdev = vfio_device_data(device);
+
+	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
 		vfio_device_put(device);
 		return 0;
 	}
 
-	vdev = vfio_device_data(device);
-
 	if (vdev->reflck) {
 		vfio_pci_reflck_get(vdev->reflck);
 		*preflck = vdev->reflck;
@@ -2233,13 +2139,13 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
 	if (!device)
 		return -EINVAL;
 
-	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+	vdev = vfio_device_data(device);
+
+	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
 		vfio_device_put(device);
 		return -EBUSY;
 	}
 
-	vdev = vfio_device_data(device);
-
 	/* Fault if the device is not unused */
 	if (vdev->refcnt) {
 		vfio_device_put(device);
@@ -2263,13 +2169,13 @@ static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data)
 	if (!device)
 		return -EINVAL;
 
-	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+	vdev = vfio_device_data(device);
+
+	if (pci_dev_driver(pdev) != pci_dev_driver(vdev->pdev)) {
 		vfio_device_put(device);
 		return -EBUSY;
 	}
 
-	vdev = vfio_device_data(device);
-
 	/*
 	 * Locking multiple devices is prone to deadlock, runaway and
 	 * unwind if we hit contention.
@@ -2358,81 +2264,19 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_core_device *vdev)
 	kfree(devs.devices);
 }
 
-static void __exit vfio_pci_cleanup(void)
+static void __exit vfio_pci_core_cleanup(void)
 {
-	pci_unregister_driver(&vfio_pci_driver);
 	vfio_pci_uninit_perm_bits();
 }
 
-static void __init vfio_pci_fill_ids(void)
+static int __init vfio_pci_core_init(void)
 {
-	char *p, *id;
-	int rc;
-
-	/* no ids passed actually */
-	if (ids[0] == '\0')
-		return;
-
-	/* add ids specified in the module parameter */
-	p = ids;
-	while ((id = strsep(&p, ","))) {
-		unsigned int vendor, device, subvendor = PCI_ANY_ID,
-			subdevice = PCI_ANY_ID, class = 0, class_mask = 0;
-		int fields;
-
-		if (!strlen(id))
-			continue;
-
-		fields = sscanf(id, "%x:%x:%x:%x:%x:%x",
-				&vendor, &device, &subvendor, &subdevice,
-				&class, &class_mask);
-
-		if (fields < 2) {
-			pr_warn("invalid id string \"%s\"\n", id);
-			continue;
-		}
-
-		rc = pci_add_dynid(&vfio_pci_driver, vendor, device,
-				   subvendor, subdevice, class, class_mask, 0);
-		if (rc)
-			pr_warn("failed to add dynamic id [%04x:%04x[%04x:%04x]] class %#08x/%08x (%d)\n",
-				vendor, device, subvendor, subdevice,
-				class, class_mask, rc);
-		else
-			pr_info("add [%04x:%04x[%04x:%04x]] class %#08x/%08x\n",
-				vendor, device, subvendor, subdevice,
-				class, class_mask);
-	}
-}
-
-static int __init vfio_pci_init(void)
-{
-	int ret;
-
 	/* Allocate shared config space permision data used by all devices */
-	ret = vfio_pci_init_perm_bits();
-	if (ret)
-		return ret;
-
-	/* Register and scan for devices */
-	ret = pci_register_driver(&vfio_pci_driver);
-	if (ret)
-		goto out_driver;
-
-	vfio_pci_fill_ids();
-
-	if (disable_denylist)
-		pr_warn("device denylist disabled.\n");
-
-	return 0;
-
-out_driver:
-	vfio_pci_uninit_perm_bits();
-	return ret;
+	return vfio_pci_init_perm_bits();
 }
 
-module_init(vfio_pci_init);
-module_exit(vfio_pci_cleanup);
+module_init(vfio_pci_core_init);
+module_exit(vfio_pci_core_cleanup);
 
 MODULE_VERSION(DRIVER_VERSION);
 MODULE_LICENSE("GPL v2");
diff --git a/drivers/vfio/pci/vfio_pci_core.h b/drivers/vfio/pci/vfio_pci_core.h
index 3964ca898984..a3517a9472bd 100644
--- a/drivers/vfio/pci/vfio_pci_core.h
+++ b/drivers/vfio/pci/vfio_pci_core.h
@@ -10,6 +10,7 @@
 
 #include <linux/mutex.h>
 #include <linux/pci.h>
+#include <linux/vfio.h>
 #include <linux/irqbypass.h>
 #include <linux/types.h>
 #include <linux/uuid.h>
@@ -100,48 +101,52 @@ struct vfio_pci_mmap_vma {
 };
 
 struct vfio_pci_core_device {
-	struct pci_dev		*pdev;
-	void __iomem		*barmap[PCI_STD_NUM_BARS];
-	bool			bar_mmap_supported[PCI_STD_NUM_BARS];
-	u8			*pci_config_map;
-	u8			*vconfig;
-	struct perm_bits	*msi_perm;
-	spinlock_t		irqlock;
-	struct mutex		igate;
-	struct vfio_pci_irq_ctx	*ctx;
-	int			num_ctx;
-	int			irq_type;
-	int			num_regions;
-	struct vfio_pci_region	*region;
-	u8			msi_qmax;
-	u8			msix_bar;
-	u16			msix_size;
-	u32			msix_offset;
-	u32			rbar[7];
-	bool			pci_2_3;
-	bool			virq_disabled;
-	bool			reset_works;
-	bool			extended_caps;
-	bool			bardirty;
-	bool			has_vga;
-	bool			needs_reset;
-	bool			nointx;
-	bool			needs_pm_restore;
-	struct pci_saved_state	*pci_saved_state;
-	struct pci_saved_state	*pm_save;
-	struct vfio_pci_reflck	*reflck;
-	int			refcnt;
-	int			ioeventfds_nr;
-	struct eventfd_ctx	*err_trigger;
-	struct eventfd_ctx	*req_trigger;
-	struct list_head	dummy_resources_list;
-	struct mutex		ioeventfds_lock;
-	struct list_head	ioeventfds_list;
+	/* below are the public fields used by vfio_pci drivers */
+	struct pci_dev			*pdev;
+	const struct vfio_device_ops	*vfio_pci_ops;
+	struct vfio_pci_reflck		*reflck;
+	int				refcnt;
+	struct vfio_pci_region		*region;
+	u8				*pci_config_map;
+	u8				*vconfig;
+
+	/* below are the private internal fields used by vfio_pci_core */
+	void __iomem			*barmap[PCI_STD_NUM_BARS];
+	bool				bar_mmap_supported[PCI_STD_NUM_BARS];
+	struct perm_bits		*msi_perm;
+	spinlock_t			irqlock;
+	struct mutex			igate;
+	struct vfio_pci_irq_ctx		*ctx;
+	int				num_ctx;
+	int				irq_type;
+	int				num_regions;
+	u8				msi_qmax;
+	u8				msix_bar;
+	u16				msix_size;
+	u32				msix_offset;
+	u32				rbar[7];
+	bool				pci_2_3;
+	bool				virq_disabled;
+	bool				reset_works;
+	bool				extended_caps;
+	bool				bardirty;
+	bool				has_vga;
+	bool				needs_reset;
+	bool				nointx;
+	bool				needs_pm_restore;
+	struct pci_saved_state		*pci_saved_state;
+	struct pci_saved_state		*pm_save;
+	int				ioeventfds_nr;
+	struct eventfd_ctx		*err_trigger;
+	struct eventfd_ctx		*req_trigger;
+	struct list_head		dummy_resources_list;
+	struct mutex			ioeventfds_lock;
+	struct list_head		ioeventfds_list;
 	struct vfio_pci_vf_token	*vf_token;
-	struct notifier_block	nb;
-	struct mutex		vma_lock;
-	struct list_head	vma_list;
-	struct rw_semaphore	memory_lock;
+	struct notifier_block		nb;
+	struct mutex			vma_lock;
+	struct list_head		vma_list;
+	struct rw_semaphore		memory_lock;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
@@ -225,4 +230,30 @@ static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 }
 #endif
 
+/* Exported functions */
+struct vfio_pci_core_device *vfio_create_pci_device(struct pci_dev *pdev,
+		const struct vfio_device_ops *vfio_pci_ops);
+void vfio_destroy_pci_device(struct pci_dev *pdev);
+
+long vfio_pci_core_ioctl(void *device_data, unsigned int cmd,
+		unsigned long arg);
+ssize_t vfio_pci_core_read(void *device_data, char __user *buf, size_t count,
+		loff_t *ppos);
+ssize_t vfio_pci_core_write(void *device_data, const char __user *buf,
+		size_t count, loff_t *ppos);
+int vfio_pci_core_mmap(void *device_data, struct vm_area_struct *vma);
+void vfio_pci_core_request(void *device_data, unsigned int count);
+int vfio_pci_core_match(void *device_data, char *buf);
+
+void vfio_pci_core_disable(struct vfio_pci_core_device *vdev);
+int vfio_pci_core_enable(struct vfio_pci_core_device *vdev);
+void vfio_pci_core_spapr_eeh_open(struct vfio_pci_core_device *vdev);
+void vfio_pci_core_spapr_eeh_release(struct vfio_pci_core_device *vdev);
+void vfio_pci_vf_token_user_add(struct vfio_pci_core_device *vdev, int val);
+void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev);
+
+int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn);
+
+extern const struct pci_error_handlers vfio_pci_core_err_handlers;
+
 #endif /* VFIO_PCI_CORE_H */
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 5/9] vfio/pci: introduce vfio_pci_device structure
  2021-03-09  8:33 [PATCH v3 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
                   ` (3 preceding siblings ...)
  2021-03-09  8:33 ` [PATCH 4/9] vfio-pci: introduce vfio_pci_core subsystem driver Max Gurtovoy
@ 2021-03-09  8:33 ` Max Gurtovoy
  2021-03-09  8:33 ` [PATCH 6/9] vfio-pci-core: export vfio_pci_register_dev_region function Max Gurtovoy
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Max Gurtovoy @ 2021-03-09  8:33 UTC (permalink / raw)
  To: jgg, alex.williamson, cohuck, kvm, linux-kernel
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, aik, hch,
	Max Gurtovoy

This structure will hold the specific attributes for the generic
vfio_pci.ko driver. It will be allocated by the vfio_pci driver that
will register the vfio subsystem using vfio_pci_core_register_device
and it will unregister the using vfio_pci_core_unregister_device. In
this way every vfio_pci future driver will be able to use this mechanism
to set vendor specific attributes in its vendor specific structure and
register to subsystem core while utilizing vfio_pci_core library.

This is a standard Linux subsystem behaviour and will also ease on
vfio_pci drivers to extend callbacks of vfio_device_ops and will be able
to use container_of mechanism as well (instead of passing void pointers
as around the stack).

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/vfio_pci.c      | 31 +++++++++++++++++++++----
 drivers/vfio/pci/vfio_pci_core.c | 39 +++++++++++++-------------------
 drivers/vfio/pci/vfio_pci_core.h |  5 ++--
 3 files changed, 45 insertions(+), 30 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 447c31f4e64e..dbc0a6559914 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -19,6 +19,7 @@
 #include <linux/iommu.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
+#include <linux/list.h>
 #include <linux/notifier.h>
 #include <linux/pm_runtime.h>
 #include <linux/slab.h>
@@ -31,6 +32,10 @@
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
 #define DRIVER_DESC     "VFIO PCI - User Level meta-driver"
 
+struct vfio_pci_device {
+	struct vfio_pci_core_device	vdev;
+};
+
 static char ids[1024] __initdata;
 module_param_string(ids, ids, sizeof(ids), 0);
 MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]\" and multiple comma separated entries can be specified");
@@ -139,21 +144,37 @@ static const struct vfio_device_ops vfio_pci_ops = {
 
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
-	struct vfio_pci_core_device *vdev;
+	struct vfio_pci_device *vpdev;
+	int ret;
 
 	if (vfio_pci_is_denylisted(pdev))
 		return -EINVAL;
 
-	vdev = vfio_create_pci_device(pdev, &vfio_pci_ops);
-	if (IS_ERR(vdev))
-		return PTR_ERR(vdev);
+	vpdev = kzalloc(sizeof(*vpdev), GFP_KERNEL);
+	if (!vpdev)
+		return -ENOMEM;
+
+	ret = vfio_pci_core_register_device(&vpdev->vdev, pdev, &vfio_pci_ops);
+	if (ret)
+		goto out_free;
 
 	return 0;
+
+out_free:
+	kfree(vpdev);
+	return ret;
 }
 
 static void vfio_pci_remove(struct pci_dev *pdev)
 {
-	vfio_destroy_pci_device(pdev);
+	struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
+	struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
+	struct vfio_pci_device *vpdev;
+
+	vpdev = container_of(core_vpdev, struct vfio_pci_device, vdev);
+
+	vfio_pci_core_unregister_device(core_vpdev);
+	kfree(vpdev);
 }
 
 static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 878a3609b916..7b6be1e4646f 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1837,15 +1837,15 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
 	return 0;
 }
 
-struct vfio_pci_core_device *vfio_create_pci_device(struct pci_dev *pdev,
+int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev,
+		struct pci_dev *pdev,
 		const struct vfio_device_ops *vfio_pci_ops)
 {
-	struct vfio_pci_core_device *vdev;
 	struct iommu_group *group;
 	int ret;
 
 	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
-		return ERR_PTR(-EINVAL);
+		return -EINVAL;
 
 	/*
 	 * Prevent binding to PFs with VFs enabled, the VFs might be in use
@@ -1857,18 +1857,12 @@ struct vfio_pci_core_device *vfio_create_pci_device(struct pci_dev *pdev,
 	 */
 	if (pci_num_vf(pdev)) {
 		pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
-		return ERR_PTR(-EBUSY);
+		return -EBUSY;
 	}
 
 	group = vfio_iommu_group_get(&pdev->dev);
 	if (!group)
-		return ERR_PTR(-EINVAL);
-
-	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
-	if (!vdev) {
-		ret = -ENOMEM;
-		goto out_group_put;
-	}
+		return -EINVAL;
 
 	vdev->pdev = pdev;
 	vdev->vfio_pci_ops = vfio_pci_ops;
@@ -1884,7 +1878,7 @@ struct vfio_pci_core_device *vfio_create_pci_device(struct pci_dev *pdev,
 
 	ret = vfio_add_group_dev(&pdev->dev, vfio_pci_ops, vdev);
 	if (ret)
-		goto out_free;
+		goto out_group_put;
 
 	ret = vfio_pci_reflck_attach(vdev);
 	if (ret)
@@ -1928,7 +1922,7 @@ struct vfio_pci_core_device *vfio_create_pci_device(struct pci_dev *pdev,
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 	}
 
-	return vdev;
+	return 0;
 
 out_vf_token:
 	kfree(vdev->vf_token);
@@ -1936,22 +1930,22 @@ struct vfio_pci_core_device *vfio_create_pci_device(struct pci_dev *pdev,
 	vfio_pci_reflck_put(vdev->reflck);
 out_del_group_dev:
 	vfio_del_group_dev(&pdev->dev);
-out_free:
-	kfree(vdev);
 out_group_put:
 	vfio_iommu_group_put(group, &pdev->dev);
-	return ERR_PTR(ret);
+	return ret;
 }
-EXPORT_SYMBOL_GPL(vfio_create_pci_device);
+EXPORT_SYMBOL_GPL(vfio_pci_core_register_device);
 
-void vfio_destroy_pci_device(struct pci_dev *pdev)
+void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
 {
-	struct vfio_pci_core_device *vdev;
+	struct pci_dev *pdev;
+	struct vfio_pci_core_device *g_vdev;
 
+	pdev = vdev->pdev;
 	pci_disable_sriov(pdev);
 
-	vdev = vfio_del_group_dev(&pdev->dev);
-	if (!vdev)
+	g_vdev = vfio_del_group_dev(&pdev->dev);
+	if (g_vdev != vdev)
 		return;
 
 	if (vdev->vf_token) {
@@ -1973,7 +1967,6 @@ void vfio_destroy_pci_device(struct pci_dev *pdev)
 		vfio_pci_set_power_state(vdev, PCI_D0);
 
 	kfree(vdev->pm_save);
-	kfree(vdev);
 
 	if (vfio_pci_is_vga(pdev)) {
 		vga_client_register(pdev, NULL, NULL, NULL);
@@ -1982,7 +1975,7 @@ void vfio_destroy_pci_device(struct pci_dev *pdev)
 				VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
 	}
 }
-EXPORT_SYMBOL_GPL(vfio_destroy_pci_device);
+EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
 
 static pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
 		pci_channel_state_t state)
diff --git a/drivers/vfio/pci/vfio_pci_core.h b/drivers/vfio/pci/vfio_pci_core.h
index a3517a9472bd..46eb3443125b 100644
--- a/drivers/vfio/pci/vfio_pci_core.h
+++ b/drivers/vfio/pci/vfio_pci_core.h
@@ -231,9 +231,10 @@ static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 #endif
 
 /* Exported functions */
-struct vfio_pci_core_device *vfio_create_pci_device(struct pci_dev *pdev,
+int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev,
+		struct pci_dev *pdev,
 		const struct vfio_device_ops *vfio_pci_ops);
-void vfio_destroy_pci_device(struct pci_dev *pdev);
+void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev);
 
 long vfio_pci_core_ioctl(void *device_data, unsigned int cmd,
 		unsigned long arg);
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 6/9] vfio-pci-core: export vfio_pci_register_dev_region function
  2021-03-09  8:33 [PATCH v3 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
                   ` (4 preceding siblings ...)
  2021-03-09  8:33 ` [PATCH 5/9] vfio/pci: introduce vfio_pci_device structure Max Gurtovoy
@ 2021-03-09  8:33 ` Max Gurtovoy
  2021-03-09  8:33 ` [PATCH 7/9] vfio/pci_core: split nvlink2 to nvlink2gpu and npu2 Max Gurtovoy
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Max Gurtovoy @ 2021-03-09  8:33 UTC (permalink / raw)
  To: jgg, alex.williamson, cohuck, kvm, linux-kernel
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, aik, hch,
	Max Gurtovoy

This function will be used to allow vendor drivers to register regions
to be used and accessed by the core subsystem driver. This way, the core
will use the region ops that are vendor specific and managed by the
vendor vfio-pci driver.

Next step that can be made is to move the logic of igd and nvlink2 to a
dedicated module instead of managing their vendor specific extensions in
the core driver.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 1 +
 drivers/vfio/pci/vfio_pci_core.h | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 7b6be1e4646f..ba5dd4321487 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -711,6 +711,7 @@ int vfio_pci_register_dev_region(struct vfio_pci_core_device *vdev,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_register_dev_region);
 
 struct vfio_devices {
 	struct vfio_device **devices;
diff --git a/drivers/vfio/pci/vfio_pci_core.h b/drivers/vfio/pci/vfio_pci_core.h
index 46eb3443125b..60b42df6c519 100644
--- a/drivers/vfio/pci/vfio_pci_core.h
+++ b/drivers/vfio/pci/vfio_pci_core.h
@@ -257,4 +257,9 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn);
 
 extern const struct pci_error_handlers vfio_pci_core_err_handlers;
 
+int vfio_pci_register_dev_region(struct vfio_pci_core_device *vdev,
+		unsigned int type, unsigned int subtype,
+		const struct vfio_pci_regops *ops,
+		size_t size, u32 flags, void *data);
+
 #endif /* VFIO_PCI_CORE_H */
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 7/9] vfio/pci_core: split nvlink2 to nvlink2gpu and npu2
  2021-03-09  8:33 [PATCH v3 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
                   ` (5 preceding siblings ...)
  2021-03-09  8:33 ` [PATCH 6/9] vfio-pci-core: export vfio_pci_register_dev_region function Max Gurtovoy
@ 2021-03-09  8:33 ` Max Gurtovoy
  2021-03-10  8:08   ` Christoph Hellwig
  2021-03-09  8:33 ` [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers Max Gurtovoy
  2021-03-09  8:33 ` [PATCH 9/9] vfio/pci: export igd support into vendor vfio_pci driver Max Gurtovoy
  8 siblings, 1 reply; 53+ messages in thread
From: Max Gurtovoy @ 2021-03-09  8:33 UTC (permalink / raw)
  To: jgg, alex.williamson, cohuck, kvm, linux-kernel
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, aik, hch,
	Max Gurtovoy

This is a preparation for moving vendor specific code from
vfio_pci_core to vendor specific vfio_pci drivers. The next step will be
creating a dedicated module to NVIDIA NVLINK2 devices with P9 extensions
and a dedicated module for Power9 NPU NVLink2 HBAs.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/Makefile                     |   2 +-
 drivers/vfio/pci/npu2_trace.h                 |  50 ++++
 .../vfio/pci/{trace.h => nvlink2gpu_trace.h}  |  27 +--
 drivers/vfio/pci/vfio_pci_core.c              |   2 +-
 drivers/vfio/pci/vfio_pci_core.h              |   4 +-
 drivers/vfio/pci/vfio_pci_npu2.c              | 222 ++++++++++++++++++
 ...io_pci_nvlink2.c => vfio_pci_nvlink2gpu.c} | 201 +---------------
 7 files changed, 280 insertions(+), 228 deletions(-)
 create mode 100644 drivers/vfio/pci/npu2_trace.h
 rename drivers/vfio/pci/{trace.h => nvlink2gpu_trace.h} (72%)
 create mode 100644 drivers/vfio/pci/vfio_pci_npu2.c
 rename drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_nvlink2gpu.c} (59%)

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 16e7d77d63ce..f539f32c9296 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -5,7 +5,7 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o vfio_pci_npu2.o
 vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
 
 vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/npu2_trace.h b/drivers/vfio/pci/npu2_trace.h
new file mode 100644
index 000000000000..c8a1110132dc
--- /dev/null
+++ b/drivers/vfio/pci/npu2_trace.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * VFIO PCI mmap/mmap_fault tracepoints
+ *
+ * Copyright (C) 2018 IBM Corp.  All rights reserved.
+ *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vfio_pci
+
+#if !defined(_TRACE_VFIO_PCI_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VFIO_PCI_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(vfio_pci_npu2_mmap,
+	TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua,
+			unsigned long size, int ret),
+	TP_ARGS(pdev, hpa, ua, size, ret),
+
+	TP_STRUCT__entry(
+		__field(const char *, name)
+		__field(unsigned long, hpa)
+		__field(unsigned long, ua)
+		__field(unsigned long, size)
+		__field(int, ret)
+	),
+
+	TP_fast_assign(
+		__entry->name = dev_name(&pdev->dev),
+		__entry->hpa = hpa;
+		__entry->ua = ua;
+		__entry->size = size;
+		__entry->ret = ret;
+	),
+
+	TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa,
+			__entry->ua, __entry->size, __entry->ret)
+);
+
+#endif /* _TRACE_VFIO_PCI_H */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH ../../drivers/vfio/pci
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE npu2_trace
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/drivers/vfio/pci/trace.h b/drivers/vfio/pci/nvlink2gpu_trace.h
similarity index 72%
rename from drivers/vfio/pci/trace.h
rename to drivers/vfio/pci/nvlink2gpu_trace.h
index b2aa986ab9ed..2392b9d4c6c9 100644
--- a/drivers/vfio/pci/trace.h
+++ b/drivers/vfio/pci/nvlink2gpu_trace.h
@@ -62,37 +62,12 @@ TRACE_EVENT(vfio_pci_nvgpu_mmap,
 			__entry->ua, __entry->size, __entry->ret)
 );
 
-TRACE_EVENT(vfio_pci_npu2_mmap,
-	TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua,
-			unsigned long size, int ret),
-	TP_ARGS(pdev, hpa, ua, size, ret),
-
-	TP_STRUCT__entry(
-		__field(const char *, name)
-		__field(unsigned long, hpa)
-		__field(unsigned long, ua)
-		__field(unsigned long, size)
-		__field(int, ret)
-	),
-
-	TP_fast_assign(
-		__entry->name = dev_name(&pdev->dev),
-		__entry->hpa = hpa;
-		__entry->ua = ua;
-		__entry->size = size;
-		__entry->ret = ret;
-	),
-
-	TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa,
-			__entry->ua, __entry->size, __entry->ret)
-);
-
 #endif /* _TRACE_VFIO_PCI_H */
 
 #undef TRACE_INCLUDE_PATH
 #define TRACE_INCLUDE_PATH ../../drivers/vfio/pci
 #undef TRACE_INCLUDE_FILE
-#define TRACE_INCLUDE_FILE trace
+#define TRACE_INCLUDE_FILE nvlink2gpu_trace
 
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index ba5dd4321487..4de8e352df9c 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -356,7 +356,7 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 
 	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
 	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
-		ret = vfio_pci_nvdia_v100_nvlink2_init(vdev);
+		ret = vfio_pci_nvidia_v100_nvlink2_init(vdev);
 		if (ret && ret != -ENODEV) {
 			pci_warn(pdev, "Failed to setup NVIDIA NV2 RAM region\n");
 			goto disable_exit;
diff --git a/drivers/vfio/pci/vfio_pci_core.h b/drivers/vfio/pci/vfio_pci_core.h
index 60b42df6c519..8989443c3086 100644
--- a/drivers/vfio/pci/vfio_pci_core.h
+++ b/drivers/vfio/pci/vfio_pci_core.h
@@ -205,10 +205,10 @@ static inline int vfio_pci_igd_init(struct vfio_pci_core_device *vdev)
 }
 #endif
 #ifdef CONFIG_VFIO_PCI_NVLINK2
-extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_core_device *vdev);
+extern int vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev);
 extern int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev);
 #else
-static inline int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
+static inline int vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
 {
 	return -ENODEV;
 }
diff --git a/drivers/vfio/pci/vfio_pci_npu2.c b/drivers/vfio/pci/vfio_pci_npu2.c
new file mode 100644
index 000000000000..717745256ab3
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_npu2.c
@@ -0,0 +1,222 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VFIO PCI driver for POWER9 NPU support (NVLink2 host bus adapter).
+ *
+ * Copyright (c) 2020, Mellanox Technologies, Ltd.  All rights reserved.
+ *
+ * Copyright (C) 2018 IBM Corp.  All rights reserved.
+ *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * Register an on-GPU RAM region for cacheable access.
+ *
+ * Derived from original vfio_pci_igd.c:
+ * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
+ *	Author: Alex Williamson <alex.williamson@redhat.com>
+ */
+
+#include <linux/io.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/sched/mm.h>
+#include <linux/mmu_context.h>
+#include <asm/kvm_ppc.h>
+
+#include "vfio_pci_core.h"
+
+#define CREATE_TRACE_POINTS
+#include "npu2_trace.h"
+
+EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap);
+
+struct vfio_pci_npu2_data {
+	void *base; /* ATSD register virtual address, for emulated access */
+	unsigned long mmio_atsd; /* ATSD physical address */
+	unsigned long gpu_tgt; /* TGT address of corresponding GPU RAM */
+	unsigned int link_speed; /* The link speed from DT's ibm,nvlink-speed */
+};
+
+static size_t vfio_pci_npu2_rw(struct vfio_pci_core_device *vdev,
+		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+	struct vfio_pci_npu2_data *data = vdev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos >= vdev->region[i].size)
+		return -EINVAL;
+
+	count = min(count, (size_t)(vdev->region[i].size - pos));
+
+	if (iswrite) {
+		if (copy_from_user(data->base + pos, buf, count))
+			return -EFAULT;
+	} else {
+		if (copy_to_user(buf, data->base + pos, count))
+			return -EFAULT;
+	}
+	*ppos += count;
+
+	return count;
+}
+
+static int vfio_pci_npu2_mmap(struct vfio_pci_core_device *vdev,
+		struct vfio_pci_region *region, struct vm_area_struct *vma)
+{
+	int ret;
+	struct vfio_pci_npu2_data *data = region->data;
+	unsigned long req_len = vma->vm_end - vma->vm_start;
+
+	if (req_len != PAGE_SIZE)
+		return -EINVAL;
+
+	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+	ret = remap_pfn_range(vma, vma->vm_start, data->mmio_atsd >> PAGE_SHIFT,
+			req_len, vma->vm_page_prot);
+	trace_vfio_pci_npu2_mmap(vdev->pdev, data->mmio_atsd, vma->vm_start,
+			vma->vm_end - vma->vm_start, ret);
+
+	return ret;
+}
+
+static void vfio_pci_npu2_release(struct vfio_pci_core_device *vdev,
+		struct vfio_pci_region *region)
+{
+	struct vfio_pci_npu2_data *data = region->data;
+
+	memunmap(data->base);
+	kfree(data);
+}
+
+static int vfio_pci_npu2_add_capability(struct vfio_pci_core_device *vdev,
+		struct vfio_pci_region *region, struct vfio_info_cap *caps)
+{
+	struct vfio_pci_npu2_data *data = region->data;
+	struct vfio_region_info_cap_nvlink2_ssatgt captgt = {
+		.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT,
+		.header.version = 1,
+		.tgt = data->gpu_tgt
+	};
+	struct vfio_region_info_cap_nvlink2_lnkspd capspd = {
+		.header.id = VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD,
+		.header.version = 1,
+		.link_speed = data->link_speed
+	};
+	int ret;
+
+	ret = vfio_info_add_capability(caps, &captgt.header, sizeof(captgt));
+	if (ret)
+		return ret;
+
+	return vfio_info_add_capability(caps, &capspd.header, sizeof(capspd));
+}
+
+static const struct vfio_pci_regops vfio_pci_npu2_regops = {
+	.rw = vfio_pci_npu2_rw,
+	.mmap = vfio_pci_npu2_mmap,
+	.release = vfio_pci_npu2_release,
+	.add_capability = vfio_pci_npu2_add_capability,
+};
+
+int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
+{
+	int ret;
+	struct vfio_pci_npu2_data *data;
+	struct device_node *nvlink_dn;
+	u32 nvlink_index = 0, mem_phandle = 0;
+	struct pci_dev *npdev = vdev->pdev;
+	struct device_node *npu_node = pci_device_to_OF_node(npdev);
+	struct pci_controller *hose = pci_bus_to_host(npdev->bus);
+	u64 mmio_atsd = 0;
+	u64 tgt = 0;
+	u32 link_speed = 0xff;
+
+	/*
+	 * PCI config space does not tell us about NVLink presense but
+	 * platform does, use this.
+	 */
+	if (!pnv_pci_get_gpu_dev(vdev->pdev))
+		return -ENODEV;
+
+	if (of_property_read_u32(npu_node, "memory-region", &mem_phandle))
+		return -ENODEV;
+
+	/*
+	 * NPU2 normally has 8 ATSD registers (for concurrency) and 6 links
+	 * so we can allocate one register per link, using nvlink index as
+	 * a key.
+	 * There is always at least one ATSD register so as long as at least
+	 * NVLink bridge #0 is passed to the guest, ATSD will be available.
+	 */
+	nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
+	if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
+			&nvlink_index)))
+		return -ENODEV;
+
+	if (of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", nvlink_index,
+			&mmio_atsd)) {
+		if (of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", 0,
+				&mmio_atsd)) {
+			dev_warn(&vdev->pdev->dev, "No available ATSD found\n");
+			mmio_atsd = 0;
+		} else {
+			dev_warn(&vdev->pdev->dev,
+				 "Using fallback ibm,mmio-atsd[0] for ATSD.\n");
+		}
+	}
+
+	if (of_property_read_u64(npu_node, "ibm,device-tgt-addr", &tgt)) {
+		dev_warn(&vdev->pdev->dev, "No ibm,device-tgt-addr found\n");
+		return -EFAULT;
+	}
+
+	if (of_property_read_u32(npu_node, "ibm,nvlink-speed", &link_speed)) {
+		dev_warn(&vdev->pdev->dev, "No ibm,nvlink-speed found\n");
+		return -EFAULT;
+	}
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	data->mmio_atsd = mmio_atsd;
+	data->gpu_tgt = tgt;
+	data->link_speed = link_speed;
+	if (data->mmio_atsd) {
+		data->base = memremap(data->mmio_atsd, SZ_64K, MEMREMAP_WT);
+		if (!data->base) {
+			ret = -ENOMEM;
+			goto free_exit;
+		}
+	}
+
+	/*
+	 * We want to expose the capability even if this specific NVLink
+	 * did not get its own ATSD register because capabilities
+	 * belong to VFIO regions and normally there will be ATSD register
+	 * assigned to the NVLink bridge.
+	 */
+	ret = vfio_pci_register_dev_region(vdev,
+			PCI_VENDOR_ID_IBM |
+			VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+			VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,
+			&vfio_pci_npu2_regops,
+			data->mmio_atsd ? PAGE_SIZE : 0,
+			VFIO_REGION_INFO_FLAG_READ |
+			VFIO_REGION_INFO_FLAG_WRITE |
+			VFIO_REGION_INFO_FLAG_MMAP,
+			data);
+	if (ret)
+		goto free_exit;
+
+	return 0;
+
+free_exit:
+	if (data->base)
+		memunmap(data->base);
+	kfree(data);
+
+	return ret;
+}
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2gpu.c
similarity index 59%
rename from drivers/vfio/pci/vfio_pci_nvlink2.c
rename to drivers/vfio/pci/vfio_pci_nvlink2gpu.c
index 8ef2c62a9d27..6dce1e78ee82 100644
--- a/drivers/vfio/pci/vfio_pci_nvlink2.c
+++ b/drivers/vfio/pci/vfio_pci_nvlink2gpu.c
@@ -19,14 +19,14 @@
 #include <linux/sched/mm.h>
 #include <linux/mmu_context.h>
 #include <asm/kvm_ppc.h>
+
 #include "vfio_pci_core.h"
 
 #define CREATE_TRACE_POINTS
-#include "trace.h"
+#include "nvlink2gpu_trace.h"
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap_fault);
 EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap);
-EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap);
 
 struct vfio_pci_nvgpu_data {
 	unsigned long gpu_hpa; /* GPU RAM physical address */
@@ -207,7 +207,7 @@ static int vfio_pci_nvgpu_group_notifier(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
-int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
+int vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
 {
 	int ret;
 	u64 reg[2];
@@ -293,198 +293,3 @@ int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
 
 	return ret;
 }
-
-/*
- * IBM NPU2 bridge
- */
-struct vfio_pci_npu2_data {
-	void *base; /* ATSD register virtual address, for emulated access */
-	unsigned long mmio_atsd; /* ATSD physical address */
-	unsigned long gpu_tgt; /* TGT address of corresponding GPU RAM */
-	unsigned int link_speed; /* The link speed from DT's ibm,nvlink-speed */
-};
-
-static size_t vfio_pci_npu2_rw(struct vfio_pci_core_device *vdev,
-		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
-{
-	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
-	struct vfio_pci_npu2_data *data = vdev->region[i].data;
-	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
-
-	if (pos >= vdev->region[i].size)
-		return -EINVAL;
-
-	count = min(count, (size_t)(vdev->region[i].size - pos));
-
-	if (iswrite) {
-		if (copy_from_user(data->base + pos, buf, count))
-			return -EFAULT;
-	} else {
-		if (copy_to_user(buf, data->base + pos, count))
-			return -EFAULT;
-	}
-	*ppos += count;
-
-	return count;
-}
-
-static int vfio_pci_npu2_mmap(struct vfio_pci_core_device *vdev,
-		struct vfio_pci_region *region, struct vm_area_struct *vma)
-{
-	int ret;
-	struct vfio_pci_npu2_data *data = region->data;
-	unsigned long req_len = vma->vm_end - vma->vm_start;
-
-	if (req_len != PAGE_SIZE)
-		return -EINVAL;
-
-	vma->vm_flags |= VM_PFNMAP;
-	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-
-	ret = remap_pfn_range(vma, vma->vm_start, data->mmio_atsd >> PAGE_SHIFT,
-			req_len, vma->vm_page_prot);
-	trace_vfio_pci_npu2_mmap(vdev->pdev, data->mmio_atsd, vma->vm_start,
-			vma->vm_end - vma->vm_start, ret);
-
-	return ret;
-}
-
-static void vfio_pci_npu2_release(struct vfio_pci_core_device *vdev,
-		struct vfio_pci_region *region)
-{
-	struct vfio_pci_npu2_data *data = region->data;
-
-	memunmap(data->base);
-	kfree(data);
-}
-
-static int vfio_pci_npu2_add_capability(struct vfio_pci_core_device *vdev,
-		struct vfio_pci_region *region, struct vfio_info_cap *caps)
-{
-	struct vfio_pci_npu2_data *data = region->data;
-	struct vfio_region_info_cap_nvlink2_ssatgt captgt = {
-		.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT,
-		.header.version = 1,
-		.tgt = data->gpu_tgt
-	};
-	struct vfio_region_info_cap_nvlink2_lnkspd capspd = {
-		.header.id = VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD,
-		.header.version = 1,
-		.link_speed = data->link_speed
-	};
-	int ret;
-
-	ret = vfio_info_add_capability(caps, &captgt.header, sizeof(captgt));
-	if (ret)
-		return ret;
-
-	return vfio_info_add_capability(caps, &capspd.header, sizeof(capspd));
-}
-
-static const struct vfio_pci_regops vfio_pci_npu2_regops = {
-	.rw = vfio_pci_npu2_rw,
-	.mmap = vfio_pci_npu2_mmap,
-	.release = vfio_pci_npu2_release,
-	.add_capability = vfio_pci_npu2_add_capability,
-};
-
-int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
-{
-	int ret;
-	struct vfio_pci_npu2_data *data;
-	struct device_node *nvlink_dn;
-	u32 nvlink_index = 0, mem_phandle = 0;
-	struct pci_dev *npdev = vdev->pdev;
-	struct device_node *npu_node = pci_device_to_OF_node(npdev);
-	struct pci_controller *hose = pci_bus_to_host(npdev->bus);
-	u64 mmio_atsd = 0;
-	u64 tgt = 0;
-	u32 link_speed = 0xff;
-
-	/*
-	 * PCI config space does not tell us about NVLink presense but
-	 * platform does, use this.
-	 */
-	if (!pnv_pci_get_gpu_dev(vdev->pdev))
-		return -ENODEV;
-
-	if (of_property_read_u32(npu_node, "memory-region", &mem_phandle))
-		return -ENODEV;
-
-	/*
-	 * NPU2 normally has 8 ATSD registers (for concurrency) and 6 links
-	 * so we can allocate one register per link, using nvlink index as
-	 * a key.
-	 * There is always at least one ATSD register so as long as at least
-	 * NVLink bridge #0 is passed to the guest, ATSD will be available.
-	 */
-	nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
-	if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
-			&nvlink_index)))
-		return -ENODEV;
-
-	if (of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", nvlink_index,
-			&mmio_atsd)) {
-		if (of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", 0,
-				&mmio_atsd)) {
-			dev_warn(&vdev->pdev->dev, "No available ATSD found\n");
-			mmio_atsd = 0;
-		} else {
-			dev_warn(&vdev->pdev->dev,
-				 "Using fallback ibm,mmio-atsd[0] for ATSD.\n");
-		}
-	}
-
-	if (of_property_read_u64(npu_node, "ibm,device-tgt-addr", &tgt)) {
-		dev_warn(&vdev->pdev->dev, "No ibm,device-tgt-addr found\n");
-		return -EFAULT;
-	}
-
-	if (of_property_read_u32(npu_node, "ibm,nvlink-speed", &link_speed)) {
-		dev_warn(&vdev->pdev->dev, "No ibm,nvlink-speed found\n");
-		return -EFAULT;
-	}
-
-	data = kzalloc(sizeof(*data), GFP_KERNEL);
-	if (!data)
-		return -ENOMEM;
-
-	data->mmio_atsd = mmio_atsd;
-	data->gpu_tgt = tgt;
-	data->link_speed = link_speed;
-	if (data->mmio_atsd) {
-		data->base = memremap(data->mmio_atsd, SZ_64K, MEMREMAP_WT);
-		if (!data->base) {
-			ret = -ENOMEM;
-			goto free_exit;
-		}
-	}
-
-	/*
-	 * We want to expose the capability even if this specific NVLink
-	 * did not get its own ATSD register because capabilities
-	 * belong to VFIO regions and normally there will be ATSD register
-	 * assigned to the NVLink bridge.
-	 */
-	ret = vfio_pci_register_dev_region(vdev,
-			PCI_VENDOR_ID_IBM |
-			VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
-			VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,
-			&vfio_pci_npu2_regops,
-			data->mmio_atsd ? PAGE_SIZE : 0,
-			VFIO_REGION_INFO_FLAG_READ |
-			VFIO_REGION_INFO_FLAG_WRITE |
-			VFIO_REGION_INFO_FLAG_MMAP,
-			data);
-	if (ret)
-		goto free_exit;
-
-	return 0;
-
-free_exit:
-	if (data->base)
-		memunmap(data->base);
-	kfree(data);
-
-	return ret;
-}
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-09  8:33 [PATCH v3 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
                   ` (6 preceding siblings ...)
  2021-03-09  8:33 ` [PATCH 7/9] vfio/pci_core: split nvlink2 to nvlink2gpu and npu2 Max Gurtovoy
@ 2021-03-09  8:33 ` Max Gurtovoy
  2021-03-10  6:39   ` Alexey Kardashevskiy
  2021-03-09  8:33 ` [PATCH 9/9] vfio/pci: export igd support into vendor vfio_pci driver Max Gurtovoy
  8 siblings, 1 reply; 53+ messages in thread
From: Max Gurtovoy @ 2021-03-09  8:33 UTC (permalink / raw)
  To: jgg, alex.williamson, cohuck, kvm, linux-kernel
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, aik, hch,
	Max Gurtovoy

The new drivers introduced are nvlink2gpu_vfio_pci.ko and
npu2_vfio_pci.ko.
The first will be responsible for providing special extensions for
NVIDIA GPUs with NVLINK2 support for P9 platform (and others in the
future). The last will be responsible for POWER9 NPU2 unit (NVLink2 host
bus adapter).

Also, preserve backward compatibility for users that were binding
NVLINK2 devices to vfio_pci.ko. Hopefully this compatibility layer will
be dropped in the future

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/Kconfig                      |  28 +++-
 drivers/vfio/pci/Makefile                     |   7 +-
 .../pci/{vfio_pci_npu2.c => npu2_vfio_pci.c}  | 144 ++++++++++++++++-
 drivers/vfio/pci/npu2_vfio_pci.h              |  24 +++
 ...pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} | 149 +++++++++++++++++-
 drivers/vfio/pci/nvlink2gpu_vfio_pci.h        |  24 +++
 drivers/vfio/pci/vfio_pci.c                   |  61 ++++++-
 drivers/vfio/pci/vfio_pci_core.c              |  18 ---
 drivers/vfio/pci/vfio_pci_core.h              |  14 --
 9 files changed, 422 insertions(+), 47 deletions(-)
 rename drivers/vfio/pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} (64%)
 create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h
 rename drivers/vfio/pci/{vfio_pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} (67%)
 create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 829e90a2e5a3..88c89863a205 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -48,8 +48,30 @@ config VFIO_PCI_IGD
 
 	  To enable Intel IGD assignment through vfio-pci, say Y.
 
-config VFIO_PCI_NVLINK2
-	def_bool y
+config VFIO_PCI_NVLINK2GPU
+	tristate "VFIO support for NVIDIA NVLINK2 GPUs"
 	depends on VFIO_PCI_CORE && PPC_POWERNV
 	help
-	  VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
+	  VFIO PCI driver for NVIDIA NVLINK2 GPUs with specific extensions
+	  for P9 Witherspoon machine.
+
+config VFIO_PCI_NPU2
+	tristate "VFIO support for IBM NPU host bus adapter on P9"
+	depends on VFIO_PCI_NVLINK2GPU && PPC_POWERNV
+	help
+	  VFIO PCI specific extensions for IBM NVLink2 host bus adapter on P9
+	  Witherspoon machine.
+
+config VFIO_PCI_DRIVER_COMPAT
+	bool "VFIO PCI backward compatibility for vendor specific extensions"
+	default y
+	depends on VFIO_PCI
+	help
+	  Say Y here if you want to preserve VFIO PCI backward
+	  compatibility. vfio_pci.ko will continue to automatically use
+	  the NVLINK2, NPU2 and IGD VFIO drivers when it is attached to
+	  a compatible device.
+
+	  When N is selected the user must bind explicity to the module
+	  they want to handle the device and vfio_pci.ko will have no
+	  device specific special behaviors.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index f539f32c9296..86fb62e271fc 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,10 +2,15 @@
 
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o
+obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o vfio_pci_npu2.o
 vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
 
 vfio-pci-y := vfio_pci.o
+
+npu2-vfio-pci-y := npu2_vfio_pci.o
+
+nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci_npu2.c b/drivers/vfio/pci/npu2_vfio_pci.c
similarity index 64%
rename from drivers/vfio/pci/vfio_pci_npu2.c
rename to drivers/vfio/pci/npu2_vfio_pci.c
index 717745256ab3..7071bda0f2b6 100644
--- a/drivers/vfio/pci/vfio_pci_npu2.c
+++ b/drivers/vfio/pci/npu2_vfio_pci.c
@@ -14,19 +14,28 @@
  *	Author: Alex Williamson <alex.williamson@redhat.com>
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
 #include <linux/io.h>
 #include <linux/pci.h>
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
+#include <linux/list.h>
 #include <linux/sched/mm.h>
 #include <linux/mmu_context.h>
 #include <asm/kvm_ppc.h>
 
 #include "vfio_pci_core.h"
+#include "npu2_vfio_pci.h"
 
 #define CREATE_TRACE_POINTS
 #include "npu2_trace.h"
 
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Alexey Kardashevskiy <aik@ozlabs.ru>"
+#define DRIVER_DESC     "NPU2 VFIO PCI - User Level meta-driver for POWER9 NPU NVLink2 HBA"
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap);
 
 struct vfio_pci_npu2_data {
@@ -36,6 +45,10 @@ struct vfio_pci_npu2_data {
 	unsigned int link_speed; /* The link speed from DT's ibm,nvlink-speed */
 };
 
+struct npu2_vfio_pci_device {
+	struct vfio_pci_core_device	vdev;
+};
+
 static size_t vfio_pci_npu2_rw(struct vfio_pci_core_device *vdev,
 		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
 {
@@ -120,7 +133,7 @@ static const struct vfio_pci_regops vfio_pci_npu2_regops = {
 	.add_capability = vfio_pci_npu2_add_capability,
 };
 
-int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
+static int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
 {
 	int ret;
 	struct vfio_pci_npu2_data *data;
@@ -220,3 +233,132 @@ int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
 
 	return ret;
 }
+
+static void npu2_vfio_pci_release(void *device_data)
+{
+	struct vfio_pci_core_device *vdev = device_data;
+
+	mutex_lock(&vdev->reflck->lock);
+	if (!(--vdev->refcnt)) {
+		vfio_pci_vf_token_user_add(vdev, -1);
+		vfio_pci_core_spapr_eeh_release(vdev);
+		vfio_pci_core_disable(vdev);
+	}
+	mutex_unlock(&vdev->reflck->lock);
+
+	module_put(THIS_MODULE);
+}
+
+static int npu2_vfio_pci_open(void *device_data)
+{
+	struct vfio_pci_core_device *vdev = device_data;
+	int ret = 0;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!vdev->refcnt) {
+		ret = vfio_pci_core_enable(vdev);
+		if (ret)
+			goto error;
+
+		ret = vfio_pci_ibm_npu2_init(vdev);
+		if (ret && ret != -ENODEV) {
+			pci_warn(vdev->pdev,
+				 "Failed to setup NVIDIA NV2 ATSD region\n");
+			vfio_pci_core_disable(vdev);
+			goto error;
+		}
+		ret = 0;
+		vfio_pci_probe_mmaps(vdev);
+		vfio_pci_core_spapr_eeh_open(vdev);
+		vfio_pci_vf_token_user_add(vdev, 1);
+	}
+	vdev->refcnt++;
+error:
+	mutex_unlock(&vdev->reflck->lock);
+	if (ret)
+		module_put(THIS_MODULE);
+	return ret;
+}
+
+static const struct vfio_device_ops npu2_vfio_pci_ops = {
+	.name		= "npu2-vfio-pci",
+	.open		= npu2_vfio_pci_open,
+	.release	= npu2_vfio_pci_release,
+	.ioctl		= vfio_pci_core_ioctl,
+	.read		= vfio_pci_core_read,
+	.write		= vfio_pci_core_write,
+	.mmap		= vfio_pci_core_mmap,
+	.request	= vfio_pci_core_request,
+	.match		= vfio_pci_core_match,
+};
+
+static int npu2_vfio_pci_probe(struct pci_dev *pdev,
+		const struct pci_device_id *id)
+{
+	struct npu2_vfio_pci_device *npvdev;
+	int ret;
+
+	npvdev = kzalloc(sizeof(*npvdev), GFP_KERNEL);
+	if (!npvdev)
+		return -ENOMEM;
+
+	ret = vfio_pci_core_register_device(&npvdev->vdev, pdev,
+			&npu2_vfio_pci_ops);
+	if (ret)
+		goto out_free;
+
+	return 0;
+
+out_free:
+	kfree(npvdev);
+	return ret;
+}
+
+static void npu2_vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
+	struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
+	struct npu2_vfio_pci_device *npvdev;
+
+	npvdev = container_of(core_vpdev, struct npu2_vfio_pci_device, vdev);
+
+	vfio_pci_core_unregister_device(core_vpdev);
+	kfree(npvdev);
+}
+
+static const struct pci_device_id npu2_vfio_pci_table[] = {
+	{ PCI_VDEVICE(IBM, 0x04ea) },
+	{ 0, }
+};
+
+static struct pci_driver npu2_vfio_pci_driver = {
+	.name			= "npu2-vfio-pci",
+	.id_table		= npu2_vfio_pci_table,
+	.probe			= npu2_vfio_pci_probe,
+	.remove			= npu2_vfio_pci_remove,
+#ifdef CONFIG_PCI_IOV
+	.sriov_configure	= vfio_pci_core_sriov_configure,
+#endif
+	.err_handler		= &vfio_pci_core_err_handlers,
+};
+
+#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
+struct pci_driver *get_npu2_vfio_pci_driver(struct pci_dev *pdev)
+{
+	if (pci_match_id(npu2_vfio_pci_driver.id_table, pdev))
+		return &npu2_vfio_pci_driver;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(get_npu2_vfio_pci_driver);
+#endif
+
+module_pci_driver(npu2_vfio_pci_driver);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/npu2_vfio_pci.h b/drivers/vfio/pci/npu2_vfio_pci.h
new file mode 100644
index 000000000000..92010d340346
--- /dev/null
+++ b/drivers/vfio/pci/npu2_vfio_pci.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2020, Mellanox Technologies, Ltd.  All rights reserved.
+ *     Author: Max Gurtovoy <mgurtovoy@nvidia.com>
+ */
+
+#ifndef NPU2_VFIO_PCI_H
+#define NPU2_VFIO_PCI_H
+
+#include <linux/pci.h>
+#include <linux/module.h>
+
+#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
+#if defined(CONFIG_VFIO_PCI_NPU2) || defined(CONFIG_VFIO_PCI_NPU2_MODULE)
+struct pci_driver *get_npu2_vfio_pci_driver(struct pci_dev *pdev);
+#else
+struct pci_driver *get_npu2_vfio_pci_driver(struct pci_dev *pdev)
+{
+	return NULL;
+}
+#endif
+#endif
+
+#endif /* NPU2_VFIO_PCI_H */
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2gpu.c b/drivers/vfio/pci/nvlink2gpu_vfio_pci.c
similarity index 67%
rename from drivers/vfio/pci/vfio_pci_nvlink2gpu.c
rename to drivers/vfio/pci/nvlink2gpu_vfio_pci.c
index 6dce1e78ee82..84a5ac1ce8ac 100644
--- a/drivers/vfio/pci/vfio_pci_nvlink2gpu.c
+++ b/drivers/vfio/pci/nvlink2gpu_vfio_pci.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /*
- * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
+ * VFIO PCI NVIDIA NVLink2 GPUs support.
  *
  * Copyright (C) 2018 IBM Corp.  All rights reserved.
  *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
@@ -12,6 +12,9 @@
  *	Author: Alex Williamson <alex.williamson@redhat.com>
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
 #include <linux/io.h>
 #include <linux/pci.h>
 #include <linux/uaccess.h>
@@ -21,10 +24,15 @@
 #include <asm/kvm_ppc.h>
 
 #include "vfio_pci_core.h"
+#include "nvlink2gpu_vfio_pci.h"
 
 #define CREATE_TRACE_POINTS
 #include "nvlink2gpu_trace.h"
 
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Alexey Kardashevskiy <aik@ozlabs.ru>"
+#define DRIVER_DESC     "NVLINK2GPU VFIO PCI - User Level meta-driver for NVIDIA NVLink2 GPUs"
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap_fault);
 EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap);
 
@@ -39,6 +47,10 @@ struct vfio_pci_nvgpu_data {
 	struct notifier_block group_notifier;
 };
 
+struct nv_vfio_pci_device {
+	struct vfio_pci_core_device	vdev;
+};
+
 static size_t vfio_pci_nvgpu_rw(struct vfio_pci_core_device *vdev,
 		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
 {
@@ -207,7 +219,8 @@ static int vfio_pci_nvgpu_group_notifier(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
-int vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
+static int
+vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
 {
 	int ret;
 	u64 reg[2];
@@ -293,3 +306,135 @@ int vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
 
 	return ret;
 }
+
+static void nvlink2gpu_vfio_pci_release(void *device_data)
+{
+	struct vfio_pci_core_device *vdev = device_data;
+
+	mutex_lock(&vdev->reflck->lock);
+	if (!(--vdev->refcnt)) {
+		vfio_pci_vf_token_user_add(vdev, -1);
+		vfio_pci_core_spapr_eeh_release(vdev);
+		vfio_pci_core_disable(vdev);
+	}
+	mutex_unlock(&vdev->reflck->lock);
+
+	module_put(THIS_MODULE);
+}
+
+static int nvlink2gpu_vfio_pci_open(void *device_data)
+{
+	struct vfio_pci_core_device *vdev = device_data;
+	int ret = 0;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!vdev->refcnt) {
+		ret = vfio_pci_core_enable(vdev);
+		if (ret)
+			goto error;
+
+		ret = vfio_pci_nvidia_v100_nvlink2_init(vdev);
+		if (ret && ret != -ENODEV) {
+			pci_warn(vdev->pdev,
+				 "Failed to setup NVIDIA NV2 RAM region\n");
+			vfio_pci_core_disable(vdev);
+			goto error;
+		}
+		ret = 0;
+		vfio_pci_probe_mmaps(vdev);
+		vfio_pci_core_spapr_eeh_open(vdev);
+		vfio_pci_vf_token_user_add(vdev, 1);
+	}
+	vdev->refcnt++;
+error:
+	mutex_unlock(&vdev->reflck->lock);
+	if (ret)
+		module_put(THIS_MODULE);
+	return ret;
+}
+
+static const struct vfio_device_ops nvlink2gpu_vfio_pci_ops = {
+	.name		= "nvlink2gpu-vfio-pci",
+	.open		= nvlink2gpu_vfio_pci_open,
+	.release	= nvlink2gpu_vfio_pci_release,
+	.ioctl		= vfio_pci_core_ioctl,
+	.read		= vfio_pci_core_read,
+	.write		= vfio_pci_core_write,
+	.mmap		= vfio_pci_core_mmap,
+	.request	= vfio_pci_core_request,
+	.match		= vfio_pci_core_match,
+};
+
+static int nvlink2gpu_vfio_pci_probe(struct pci_dev *pdev,
+		const struct pci_device_id *id)
+{
+	struct nv_vfio_pci_device *nvdev;
+	int ret;
+
+	nvdev = kzalloc(sizeof(*nvdev), GFP_KERNEL);
+	if (!nvdev)
+		return -ENOMEM;
+
+	ret = vfio_pci_core_register_device(&nvdev->vdev, pdev,
+			&nvlink2gpu_vfio_pci_ops);
+	if (ret)
+		goto out_free;
+
+	return 0;
+
+out_free:
+	kfree(nvdev);
+	return ret;
+}
+
+static void nvlink2gpu_vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
+	struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
+	struct nv_vfio_pci_device *nvdev;
+
+	nvdev = container_of(core_vpdev, struct nv_vfio_pci_device, vdev);
+
+	vfio_pci_core_unregister_device(core_vpdev);
+	kfree(nvdev);
+}
+
+static const struct pci_device_id nvlink2gpu_vfio_pci_table[] = {
+	{ PCI_VDEVICE(NVIDIA, 0x1DB1) }, /* GV100GL-A NVIDIA Tesla V100-SXM2-16GB */
+	{ PCI_VDEVICE(NVIDIA, 0x1DB5) }, /* GV100GL-A NVIDIA Tesla V100-SXM2-32GB */
+	{ PCI_VDEVICE(NVIDIA, 0x1DB8) }, /* GV100GL-A NVIDIA Tesla V100-SXM3-32GB */
+	{ PCI_VDEVICE(NVIDIA, 0x1DF5) }, /* GV100GL-B NVIDIA Tesla V100-SXM2-16GB */
+	{ 0, }
+};
+
+static struct pci_driver nvlink2gpu_vfio_pci_driver = {
+	.name			= "nvlink2gpu-vfio-pci",
+	.id_table		= nvlink2gpu_vfio_pci_table,
+	.probe			= nvlink2gpu_vfio_pci_probe,
+	.remove			= nvlink2gpu_vfio_pci_remove,
+#ifdef CONFIG_PCI_IOV
+	.sriov_configure	= vfio_pci_core_sriov_configure,
+#endif
+	.err_handler		= &vfio_pci_core_err_handlers,
+};
+
+#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
+struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev)
+{
+	if (pci_match_id(nvlink2gpu_vfio_pci_driver.id_table, pdev))
+		return &nvlink2gpu_vfio_pci_driver;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(get_nvlink2gpu_vfio_pci_driver);
+#endif
+
+module_pci_driver(nvlink2gpu_vfio_pci_driver);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/nvlink2gpu_vfio_pci.h b/drivers/vfio/pci/nvlink2gpu_vfio_pci.h
new file mode 100644
index 000000000000..ebd5b600b190
--- /dev/null
+++ b/drivers/vfio/pci/nvlink2gpu_vfio_pci.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2020, Mellanox Technologies, Ltd.  All rights reserved.
+ *     Author: Max Gurtovoy <mgurtovoy@nvidia.com>
+ */
+
+#ifndef NVLINK2GPU_VFIO_PCI_H
+#define NVLINK2GPU_VFIO_PCI_H
+
+#include <linux/pci.h>
+#include <linux/module.h>
+
+#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
+#if defined(CONFIG_VFIO_PCI_NVLINK2GPU) || defined(CONFIG_VFIO_PCI_NVLINK2GPU_MODULE)
+struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev);
+#else
+struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev)
+{
+	return NULL;
+}
+#endif
+#endif
+
+#endif /* NVLINK2GPU_VFIO_PCI_H */
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index dbc0a6559914..8e81ea039f31 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -27,6 +27,10 @@
 #include <linux/uaccess.h>
 
 #include "vfio_pci_core.h"
+#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
+#include "npu2_vfio_pci.h"
+#include "nvlink2gpu_vfio_pci.h"
+#endif
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -142,14 +146,48 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.match		= vfio_pci_core_match,
 };
 
+/*
+ * This layer is used for backward compatibility. Hopefully it will be
+ * removed in the future.
+ */
+static struct pci_driver *vfio_pci_get_compat_driver(struct pci_dev *pdev)
+{
+	switch (pdev->vendor) {
+	case PCI_VENDOR_ID_NVIDIA:
+		switch (pdev->device) {
+		case 0x1db1:
+		case 0x1db5:
+		case 0x1db8:
+		case 0x1df5:
+			return get_nvlink2gpu_vfio_pci_driver(pdev);
+		default:
+			return NULL;
+		}
+	case PCI_VENDOR_ID_IBM:
+		switch (pdev->device) {
+		case 0x04ea:
+			return get_npu2_vfio_pci_driver(pdev);
+		default:
+			return NULL;
+		}
+	}
+
+	return NULL;
+}
+
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct vfio_pci_device *vpdev;
+	struct pci_driver *driver;
 	int ret;
 
 	if (vfio_pci_is_denylisted(pdev))
 		return -EINVAL;
 
+	driver = vfio_pci_get_compat_driver(pdev);
+	if (driver)
+		return driver->probe(pdev, id);
+
 	vpdev = kzalloc(sizeof(*vpdev), GFP_KERNEL);
 	if (!vpdev)
 		return -ENOMEM;
@@ -167,14 +205,21 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 static void vfio_pci_remove(struct pci_dev *pdev)
 {
-	struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
-	struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
-	struct vfio_pci_device *vpdev;
-
-	vpdev = container_of(core_vpdev, struct vfio_pci_device, vdev);
-
-	vfio_pci_core_unregister_device(core_vpdev);
-	kfree(vpdev);
+	struct pci_driver *driver;
+
+	driver = vfio_pci_get_compat_driver(pdev);
+	if (driver) {
+		driver->remove(pdev);
+	} else {
+		struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
+		struct vfio_pci_core_device *core_vpdev;
+		struct vfio_pci_device *vpdev;
+
+		core_vpdev = vfio_device_data(vdev);
+		vpdev = container_of(core_vpdev, struct vfio_pci_device, vdev);
+		vfio_pci_core_unregister_device(core_vpdev);
+		kfree(vpdev);
+	}
 }
 
 static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 4de8e352df9c..f9b39abe54cb 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -354,24 +354,6 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 		}
 	}
 
-	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
-	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
-		ret = vfio_pci_nvidia_v100_nvlink2_init(vdev);
-		if (ret && ret != -ENODEV) {
-			pci_warn(pdev, "Failed to setup NVIDIA NV2 RAM region\n");
-			goto disable_exit;
-		}
-	}
-
-	if (pdev->vendor == PCI_VENDOR_ID_IBM &&
-	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
-		ret = vfio_pci_ibm_npu2_init(vdev);
-		if (ret && ret != -ENODEV) {
-			pci_warn(pdev, "Failed to setup NVIDIA NV2 ATSD region\n");
-			goto disable_exit;
-		}
-	}
-
 	return 0;
 
 disable_exit:
diff --git a/drivers/vfio/pci/vfio_pci_core.h b/drivers/vfio/pci/vfio_pci_core.h
index 8989443c3086..31f3836e606e 100644
--- a/drivers/vfio/pci/vfio_pci_core.h
+++ b/drivers/vfio/pci/vfio_pci_core.h
@@ -204,20 +204,6 @@ static inline int vfio_pci_igd_init(struct vfio_pci_core_device *vdev)
 	return -ENODEV;
 }
 #endif
-#ifdef CONFIG_VFIO_PCI_NVLINK2
-extern int vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev);
-extern int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev);
-#else
-static inline int vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
-{
-	return -ENODEV;
-}
-
-static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
-{
-	return -ENODEV;
-}
-#endif
 
 #ifdef CONFIG_S390
 extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 9/9] vfio/pci: export igd support into vendor vfio_pci driver
  2021-03-09  8:33 [PATCH v3 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
                   ` (7 preceding siblings ...)
  2021-03-09  8:33 ` [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers Max Gurtovoy
@ 2021-03-09  8:33 ` Max Gurtovoy
  2021-03-10  8:15   ` Christoph Hellwig
  8 siblings, 1 reply; 53+ messages in thread
From: Max Gurtovoy @ 2021-03-09  8:33 UTC (permalink / raw)
  To: jgg, alex.williamson, cohuck, kvm, linux-kernel
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, aik, hch,
	Max Gurtovoy

Create a new driver igd_vfio_pci.ko that will be responsible for
providing special extensions for INTEL Graphics card (GVT-d).

Also preserve backward compatibility with vfio_pci.ko vendor specific
extensions.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/Kconfig                      |   5 +-
 drivers/vfio/pci/Makefile                     |   4 +-
 .../pci/{vfio_pci_igd.c => igd_vfio_pci.c}    | 147 +++++++++++++++++-
 drivers/vfio/pci/igd_vfio_pci.h               |  24 +++
 drivers/vfio/pci/vfio_pci.c                   |   4 +
 drivers/vfio/pci/vfio_pci_core.c              |  15 --
 drivers/vfio/pci/vfio_pci_core.h              |   9 --
 7 files changed, 176 insertions(+), 32 deletions(-)
 rename drivers/vfio/pci/{vfio_pci_igd.c => igd_vfio_pci.c} (62%)
 create mode 100644 drivers/vfio/pci/igd_vfio_pci.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 88c89863a205..09d85ba3e5b2 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -37,17 +37,14 @@ config VFIO_PCI_INTX
 	def_bool y if !S390
 
 config VFIO_PCI_IGD
-	bool "VFIO PCI extensions for Intel graphics (GVT-d)"
+	tristate "VFIO PCI extensions for Intel graphics (GVT-d)"
 	depends on VFIO_PCI_CORE && X86
-	default y
 	help
 	  Support for Intel IGD specific extensions to enable direct
 	  assignment to virtual machines.  This includes exposing an IGD
 	  specific firmware table and read-only copies of the host bridge
 	  and LPC bridge config space.
 
-	  To enable Intel IGD assignment through vfio-pci, say Y.
-
 config VFIO_PCI_NVLINK2GPU
 	tristate "VFIO support for NVIDIA NVLINK2 GPUs"
 	depends on VFIO_PCI_CORE && PPC_POWERNV
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 86fb62e271fc..298b2fb3f075 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -4,9 +4,9 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
 obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o
 obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o
+obj-$(CONFIG_VFIO_PCI_IGD) += igd-vfio-pci.o
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
 
 vfio-pci-y := vfio_pci.o
@@ -14,3 +14,5 @@ vfio-pci-y := vfio_pci.o
 npu2-vfio-pci-y := npu2_vfio_pci.o
 
 nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o
+
+igd-vfio-pci-y := igd_vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/igd_vfio_pci.c
similarity index 62%
rename from drivers/vfio/pci/vfio_pci_igd.c
rename to drivers/vfio/pci/igd_vfio_pci.c
index 2388c9722ed8..bbbc432bca82 100644
--- a/drivers/vfio/pci/vfio_pci_igd.c
+++ b/drivers/vfio/pci/igd_vfio_pci.c
@@ -10,19 +10,32 @@
  * address is also virtualized to prevent user modification.
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
 #include <linux/io.h>
 #include <linux/pci.h>
+#include <linux/list.h>
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 
 #include "vfio_pci_core.h"
+#include "igd_vfio_pci.h"
 
 #define OPREGION_SIGNATURE	"IntelGraphicsMem"
 #define OPREGION_SIZE		(8 * 1024)
 #define OPREGION_PCI_ADDR	0xfc
 
-static size_t vfio_pci_igd_rw(struct vfio_pci_core_device *vdev, char __user *buf,
-			      size_t count, loff_t *ppos, bool iswrite)
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
+#define DRIVER_DESC     "IGD VFIO PCI - User Level meta-driver for Intel Graphics Processing Unit"
+
+struct igd_vfio_pci_device {
+	struct vfio_pci_core_device	vdev;
+};
+
+static size_t vfio_pci_igd_rw(struct vfio_pci_core_device *vdev,
+		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
 {
 	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
 	void *base = vdev->region[i].data;
@@ -261,7 +274,7 @@ static int vfio_pci_igd_cfg_init(struct vfio_pci_core_device *vdev)
 	return 0;
 }
 
-int vfio_pci_igd_init(struct vfio_pci_core_device *vdev)
+static int vfio_pci_igd_init(struct vfio_pci_core_device *vdev)
 {
 	int ret;
 
@@ -275,3 +288,131 @@ int vfio_pci_igd_init(struct vfio_pci_core_device *vdev)
 
 	return 0;
 }
+
+static void igd_vfio_pci_release(void *device_data)
+{
+	struct vfio_pci_core_device *vdev = device_data;
+
+	mutex_lock(&vdev->reflck->lock);
+	if (!(--vdev->refcnt)) {
+		vfio_pci_vf_token_user_add(vdev, -1);
+		vfio_pci_core_spapr_eeh_release(vdev);
+		vfio_pci_core_disable(vdev);
+	}
+	mutex_unlock(&vdev->reflck->lock);
+
+	module_put(THIS_MODULE);
+}
+
+static int igd_vfio_pci_open(void *device_data)
+{
+	struct vfio_pci_core_device *vdev = device_data;
+	int ret = 0;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vdev->reflck->lock);
+
+	if (!vdev->refcnt) {
+		ret = vfio_pci_core_enable(vdev);
+		if (ret)
+			goto error;
+
+		ret = vfio_pci_igd_init(vdev);
+		if (ret && ret != -ENODEV) {
+			pci_warn(vdev->pdev, "Failed to setup Intel IGD regions\n");
+			vfio_pci_core_disable(vdev);
+			goto error;
+		}
+		ret = 0;
+		vfio_pci_probe_mmaps(vdev);
+		vfio_pci_core_spapr_eeh_open(vdev);
+		vfio_pci_vf_token_user_add(vdev, 1);
+	}
+	vdev->refcnt++;
+error:
+	mutex_unlock(&vdev->reflck->lock);
+	if (ret)
+		module_put(THIS_MODULE);
+	return ret;
+}
+
+static const struct vfio_device_ops igd_vfio_pci_ops = {
+	.name		= "igd-vfio-pci",
+	.open		= igd_vfio_pci_open,
+	.release	= igd_vfio_pci_release,
+	.ioctl		= vfio_pci_core_ioctl,
+	.read		= vfio_pci_core_read,
+	.write		= vfio_pci_core_write,
+	.mmap		= vfio_pci_core_mmap,
+	.request	= vfio_pci_core_request,
+	.match		= vfio_pci_core_match,
+};
+
+static int igd_vfio_pci_probe(struct pci_dev *pdev,
+		const struct pci_device_id *id)
+{
+	struct igd_vfio_pci_device *igvdev;
+	int ret;
+
+	igvdev = kzalloc(sizeof(*igvdev), GFP_KERNEL);
+	if (!igvdev)
+		return -ENOMEM;
+
+	ret = vfio_pci_core_register_device(&igvdev->vdev, pdev,
+			&igd_vfio_pci_ops);
+	if (ret)
+		goto out_free;
+
+	return 0;
+
+out_free:
+	kfree(igvdev);
+	return ret;
+}
+
+static void igd_vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
+	struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
+	struct igd_vfio_pci_device *igvdev;
+
+	igvdev = container_of(core_vpdev, struct igd_vfio_pci_device, vdev);
+
+	vfio_pci_core_unregister_device(core_vpdev);
+	kfree(igvdev);
+}
+
+static const struct pci_device_id igd_vfio_pci_table[] = {
+	{ PCI_VENDOR_ID_INTEL, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID, PCI_CLASS_DISPLAY_VGA << 8, 0xff0000, 0 },
+	{ 0, }
+};
+
+static struct pci_driver igd_vfio_pci_driver = {
+	.name			= "igd-vfio-pci",
+	.id_table		= igd_vfio_pci_table,
+	.probe			= igd_vfio_pci_probe,
+	.remove			= igd_vfio_pci_remove,
+#ifdef CONFIG_PCI_IOV
+	.sriov_configure	= vfio_pci_core_sriov_configure,
+#endif
+	.err_handler		= &vfio_pci_core_err_handlers,
+};
+
+#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
+struct pci_driver *get_igd_vfio_pci_driver(struct pci_dev *pdev)
+{
+	if (pci_match_id(igd_vfio_pci_driver.id_table, pdev))
+		return &igd_vfio_pci_driver;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(get_igd_vfio_pci_driver);
+#endif
+
+module_pci_driver(igd_vfio_pci_driver);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/igd_vfio_pci.h b/drivers/vfio/pci/igd_vfio_pci.h
new file mode 100644
index 000000000000..859aeca354cb
--- /dev/null
+++ b/drivers/vfio/pci/igd_vfio_pci.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2020, Mellanox Technologies, Ltd.  All rights reserved.
+ *     Author: Max Gurtovoy <mgurtovoy@nvidia.com>
+ */
+
+#ifndef IGD_VFIO_PCI_H
+#define IGD_VFIO_PCI_H
+
+#include <linux/pci.h>
+#include <linux/module.h>
+
+#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
+#if defined(CONFIG_VFIO_PCI_IGD) || defined(CONFIG_VFIO_PCI_IGD_MODULE)
+struct pci_driver *get_igd_vfio_pci_driver(struct pci_dev *pdev);
+#else
+struct pci_driver *get_igd_vfio_pci_driver(struct pci_dev *pdev)
+{
+	return NULL;
+}
+#endif
+#endif
+
+#endif /* IGD_VFIO_PCI_H */
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 8e81ea039f31..1c2f6d55a243 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -30,6 +30,7 @@
 #ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
 #include "npu2_vfio_pci.h"
 #include "nvlink2gpu_vfio_pci.h"
+#include "igd_vfio_pci.h"
 #endif
 
 #define DRIVER_VERSION  "0.2"
@@ -170,6 +171,9 @@ static struct pci_driver *vfio_pci_get_compat_driver(struct pci_dev *pdev)
 		default:
 			return NULL;
 		}
+	case PCI_VENDOR_ID_INTEL:
+		if (pdev->class == PCI_CLASS_DISPLAY_VGA << 8)
+			return get_igd_vfio_pci_driver(pdev);
 	}
 
 	return NULL;
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f9b39abe54cb..59c9d0d56a0b 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -343,22 +343,7 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 	if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
 		vdev->has_vga = true;
 
-
-	if (vfio_pci_is_vga(pdev) &&
-	    pdev->vendor == PCI_VENDOR_ID_INTEL &&
-	    IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
-		ret = vfio_pci_igd_init(vdev);
-		if (ret && ret != -ENODEV) {
-			pci_warn(pdev, "Failed to setup Intel IGD regions\n");
-			goto disable_exit;
-		}
-	}
-
 	return 0;
-
-disable_exit:
-	vfio_pci_disable(vdev);
-	return ret;
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_enable);
 
diff --git a/drivers/vfio/pci/vfio_pci_core.h b/drivers/vfio/pci/vfio_pci_core.h
index 31f3836e606e..2b5ea0db9284 100644
--- a/drivers/vfio/pci/vfio_pci_core.h
+++ b/drivers/vfio/pci/vfio_pci_core.h
@@ -196,15 +196,6 @@ extern u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev);
 extern void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev,
 					       u16 cmd);
 
-#ifdef CONFIG_VFIO_PCI_IGD
-extern int vfio_pci_igd_init(struct vfio_pci_core_device *vdev);
-#else
-static inline int vfio_pci_igd_init(struct vfio_pci_core_device *vdev)
-{
-	return -ENODEV;
-}
-#endif
-
 #ifdef CONFIG_S390
 extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
 				       struct vfio_info_cap *caps);
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-09  8:33 ` [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers Max Gurtovoy
@ 2021-03-10  6:39   ` Alexey Kardashevskiy
  2021-03-10 12:57     ` Max Gurtovoy
  0 siblings, 1 reply; 53+ messages in thread
From: Alexey Kardashevskiy @ 2021-03-10  6:39 UTC (permalink / raw)
  To: Max Gurtovoy, jgg, alex.williamson, cohuck, kvm, linux-kernel
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, hch



On 09/03/2021 19:33, Max Gurtovoy wrote:
> The new drivers introduced are nvlink2gpu_vfio_pci.ko and
> npu2_vfio_pci.ko.
> The first will be responsible for providing special extensions for
> NVIDIA GPUs with NVLINK2 support for P9 platform (and others in the
> future). The last will be responsible for POWER9 NPU2 unit (NVLink2 host
> bus adapter).
> 
> Also, preserve backward compatibility for users that were binding
> NVLINK2 devices to vfio_pci.ko. Hopefully this compatibility layer will
> be dropped in the future
> 
> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> ---
>   drivers/vfio/pci/Kconfig                      |  28 +++-
>   drivers/vfio/pci/Makefile                     |   7 +-
>   .../pci/{vfio_pci_npu2.c => npu2_vfio_pci.c}  | 144 ++++++++++++++++-
>   drivers/vfio/pci/npu2_vfio_pci.h              |  24 +++
>   ...pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} | 149 +++++++++++++++++-
>   drivers/vfio/pci/nvlink2gpu_vfio_pci.h        |  24 +++
>   drivers/vfio/pci/vfio_pci.c                   |  61 ++++++-
>   drivers/vfio/pci/vfio_pci_core.c              |  18 ---
>   drivers/vfio/pci/vfio_pci_core.h              |  14 --
>   9 files changed, 422 insertions(+), 47 deletions(-)
>   rename drivers/vfio/pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} (64%)
>   create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h
>   rename drivers/vfio/pci/{vfio_pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} (67%)
>   create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h
> 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 829e90a2e5a3..88c89863a205 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -48,8 +48,30 @@ config VFIO_PCI_IGD
>   
>   	  To enable Intel IGD assignment through vfio-pci, say Y.
>   
> -config VFIO_PCI_NVLINK2
> -	def_bool y
> +config VFIO_PCI_NVLINK2GPU
> +	tristate "VFIO support for NVIDIA NVLINK2 GPUs"
>   	depends on VFIO_PCI_CORE && PPC_POWERNV
>   	help
> -	  VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
> +	  VFIO PCI driver for NVIDIA NVLINK2 GPUs with specific extensions
> +	  for P9 Witherspoon machine.
> +
> +config VFIO_PCI_NPU2
> +	tristate "VFIO support for IBM NPU host bus adapter on P9"
> +	depends on VFIO_PCI_NVLINK2GPU && PPC_POWERNV
> +	help
> +	  VFIO PCI specific extensions for IBM NVLink2 host bus adapter on P9
> +	  Witherspoon machine.
> +
> +config VFIO_PCI_DRIVER_COMPAT
> +	bool "VFIO PCI backward compatibility for vendor specific extensions"
> +	default y
> +	depends on VFIO_PCI
> +	help
> +	  Say Y here if you want to preserve VFIO PCI backward
> +	  compatibility. vfio_pci.ko will continue to automatically use
> +	  the NVLINK2, NPU2 and IGD VFIO drivers when it is attached to
> +	  a compatible device.
> +
> +	  When N is selected the user must bind explicity to the module
> +	  they want to handle the device and vfio_pci.ko will have no
> +	  device specific special behaviors.
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index f539f32c9296..86fb62e271fc 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -2,10 +2,15 @@
>   
>   obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
>   obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> +obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o
> +obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o
>   
>   vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>   vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> -vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o vfio_pci_npu2.o
>   vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
>   
>   vfio-pci-y := vfio_pci.o
> +
> +npu2-vfio-pci-y := npu2_vfio_pci.o
> +
> +nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o
> diff --git a/drivers/vfio/pci/vfio_pci_npu2.c b/drivers/vfio/pci/npu2_vfio_pci.c
> similarity index 64%
> rename from drivers/vfio/pci/vfio_pci_npu2.c
> rename to drivers/vfio/pci/npu2_vfio_pci.c
> index 717745256ab3..7071bda0f2b6 100644
> --- a/drivers/vfio/pci/vfio_pci_npu2.c
> +++ b/drivers/vfio/pci/npu2_vfio_pci.c
> @@ -14,19 +14,28 @@
>    *	Author: Alex Williamson <alex.williamson@redhat.com>
>    */
>   
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/module.h>
>   #include <linux/io.h>
>   #include <linux/pci.h>
>   #include <linux/uaccess.h>
>   #include <linux/vfio.h>
> +#include <linux/list.h>
>   #include <linux/sched/mm.h>
>   #include <linux/mmu_context.h>
>   #include <asm/kvm_ppc.h>
>   
>   #include "vfio_pci_core.h"
> +#include "npu2_vfio_pci.h"
>   
>   #define CREATE_TRACE_POINTS
>   #include "npu2_trace.h"
>   
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "Alexey Kardashevskiy <aik@ozlabs.ru>"
> +#define DRIVER_DESC     "NPU2 VFIO PCI - User Level meta-driver for POWER9 NPU NVLink2 HBA"
> +
>   EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap);
>   
>   struct vfio_pci_npu2_data {
> @@ -36,6 +45,10 @@ struct vfio_pci_npu2_data {
>   	unsigned int link_speed; /* The link speed from DT's ibm,nvlink-speed */
>   };
>   
> +struct npu2_vfio_pci_device {
> +	struct vfio_pci_core_device	vdev;
> +};
> +
>   static size_t vfio_pci_npu2_rw(struct vfio_pci_core_device *vdev,
>   		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
>   {
> @@ -120,7 +133,7 @@ static const struct vfio_pci_regops vfio_pci_npu2_regops = {
>   	.add_capability = vfio_pci_npu2_add_capability,
>   };
>   
> -int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
> +static int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
>   {
>   	int ret;
>   	struct vfio_pci_npu2_data *data;
> @@ -220,3 +233,132 @@ int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
>   
>   	return ret;
>   }
> +
> +static void npu2_vfio_pci_release(void *device_data)
> +{
> +	struct vfio_pci_core_device *vdev = device_data;
> +
> +	mutex_lock(&vdev->reflck->lock);
> +	if (!(--vdev->refcnt)) {
> +		vfio_pci_vf_token_user_add(vdev, -1);
> +		vfio_pci_core_spapr_eeh_release(vdev);
> +		vfio_pci_core_disable(vdev);
> +	}
> +	mutex_unlock(&vdev->reflck->lock);
> +
> +	module_put(THIS_MODULE);
> +}
> +
> +static int npu2_vfio_pci_open(void *device_data)
> +{
> +	struct vfio_pci_core_device *vdev = device_data;
> +	int ret = 0;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;
> +
> +	mutex_lock(&vdev->reflck->lock);
> +
> +	if (!vdev->refcnt) {
> +		ret = vfio_pci_core_enable(vdev);
> +		if (ret)
> +			goto error;
> +
> +		ret = vfio_pci_ibm_npu2_init(vdev);
> +		if (ret && ret != -ENODEV) {
> +			pci_warn(vdev->pdev,
> +				 "Failed to setup NVIDIA NV2 ATSD region\n");
> +			vfio_pci_core_disable(vdev);
> +			goto error;
> +		}
> +		ret = 0;
> +		vfio_pci_probe_mmaps(vdev);
> +		vfio_pci_core_spapr_eeh_open(vdev);
> +		vfio_pci_vf_token_user_add(vdev, 1);
> +	}
> +	vdev->refcnt++;
> +error:
> +	mutex_unlock(&vdev->reflck->lock);
> +	if (ret)
> +		module_put(THIS_MODULE);
> +	return ret;
> +}
> +
> +static const struct vfio_device_ops npu2_vfio_pci_ops = {
> +	.name		= "npu2-vfio-pci",
> +	.open		= npu2_vfio_pci_open,
> +	.release	= npu2_vfio_pci_release,
> +	.ioctl		= vfio_pci_core_ioctl,
> +	.read		= vfio_pci_core_read,
> +	.write		= vfio_pci_core_write,
> +	.mmap		= vfio_pci_core_mmap,
> +	.request	= vfio_pci_core_request,
> +	.match		= vfio_pci_core_match,
> +};
> +
> +static int npu2_vfio_pci_probe(struct pci_dev *pdev,
> +		const struct pci_device_id *id)
> +{
> +	struct npu2_vfio_pci_device *npvdev;
> +	int ret;
> +
> +	npvdev = kzalloc(sizeof(*npvdev), GFP_KERNEL);
> +	if (!npvdev)
> +		return -ENOMEM;
> +
> +	ret = vfio_pci_core_register_device(&npvdev->vdev, pdev,
> +			&npu2_vfio_pci_ops);
> +	if (ret)
> +		goto out_free;
> +
> +	return 0;
> +
> +out_free:
> +	kfree(npvdev);
> +	return ret;
> +}
> +
> +static void npu2_vfio_pci_remove(struct pci_dev *pdev)
> +{
> +	struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
> +	struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
> +	struct npu2_vfio_pci_device *npvdev;
> +
> +	npvdev = container_of(core_vpdev, struct npu2_vfio_pci_device, vdev);
> +
> +	vfio_pci_core_unregister_device(core_vpdev);
> +	kfree(npvdev);
> +}
> +
> +static const struct pci_device_id npu2_vfio_pci_table[] = {
> +	{ PCI_VDEVICE(IBM, 0x04ea) },
> +	{ 0, }
> +};
> +
> +static struct pci_driver npu2_vfio_pci_driver = {
> +	.name			= "npu2-vfio-pci",
> +	.id_table		= npu2_vfio_pci_table,
> +	.probe			= npu2_vfio_pci_probe,
> +	.remove			= npu2_vfio_pci_remove,
> +#ifdef CONFIG_PCI_IOV
> +	.sriov_configure	= vfio_pci_core_sriov_configure,
> +#endif
> +	.err_handler		= &vfio_pci_core_err_handlers,
> +};
> +
> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
> +struct pci_driver *get_npu2_vfio_pci_driver(struct pci_dev *pdev)
> +{
> +	if (pci_match_id(npu2_vfio_pci_driver.id_table, pdev))
> +		return &npu2_vfio_pci_driver;
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(get_npu2_vfio_pci_driver);
> +#endif
> +
> +module_pci_driver(npu2_vfio_pci_driver);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/pci/npu2_vfio_pci.h b/drivers/vfio/pci/npu2_vfio_pci.h
> new file mode 100644
> index 000000000000..92010d340346
> --- /dev/null
> +++ b/drivers/vfio/pci/npu2_vfio_pci.h
> @@ -0,0 +1,24 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2020, Mellanox Technologies, Ltd.  All rights reserved.
> + *     Author: Max Gurtovoy <mgurtovoy@nvidia.com>
> + */
> +
> +#ifndef NPU2_VFIO_PCI_H
> +#define NPU2_VFIO_PCI_H
> +
> +#include <linux/pci.h>
> +#include <linux/module.h>
> +
> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
> +#if defined(CONFIG_VFIO_PCI_NPU2) || defined(CONFIG_VFIO_PCI_NPU2_MODULE)
> +struct pci_driver *get_npu2_vfio_pci_driver(struct pci_dev *pdev);
> +#else
> +struct pci_driver *get_npu2_vfio_pci_driver(struct pci_dev *pdev)
> +{
> +	return NULL;
> +}
> +#endif
> +#endif
> +
> +#endif /* NPU2_VFIO_PCI_H */
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2gpu.c b/drivers/vfio/pci/nvlink2gpu_vfio_pci.c
> similarity index 67%
> rename from drivers/vfio/pci/vfio_pci_nvlink2gpu.c
> rename to drivers/vfio/pci/nvlink2gpu_vfio_pci.c
> index 6dce1e78ee82..84a5ac1ce8ac 100644
> --- a/drivers/vfio/pci/vfio_pci_nvlink2gpu.c
> +++ b/drivers/vfio/pci/nvlink2gpu_vfio_pci.c
> @@ -1,6 +1,6 @@
>   // SPDX-License-Identifier: GPL-2.0-only
>   /*
> - * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
> + * VFIO PCI NVIDIA NVLink2 GPUs support.
>    *
>    * Copyright (C) 2018 IBM Corp.  All rights reserved.
>    *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
> @@ -12,6 +12,9 @@
>    *	Author: Alex Williamson <alex.williamson@redhat.com>
>    */
>   
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/module.h>
>   #include <linux/io.h>
>   #include <linux/pci.h>
>   #include <linux/uaccess.h>
> @@ -21,10 +24,15 @@
>   #include <asm/kvm_ppc.h>
>   
>   #include "vfio_pci_core.h"
> +#include "nvlink2gpu_vfio_pci.h"
>   
>   #define CREATE_TRACE_POINTS
>   #include "nvlink2gpu_trace.h"
>   
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "Alexey Kardashevskiy <aik@ozlabs.ru>"
> +#define DRIVER_DESC     "NVLINK2GPU VFIO PCI - User Level meta-driver for NVIDIA NVLink2 GPUs"
> +
>   EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap_fault);
>   EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap);
>   
> @@ -39,6 +47,10 @@ struct vfio_pci_nvgpu_data {
>   	struct notifier_block group_notifier;
>   };
>   
> +struct nv_vfio_pci_device {
> +	struct vfio_pci_core_device	vdev;
> +};
> +
>   static size_t vfio_pci_nvgpu_rw(struct vfio_pci_core_device *vdev,
>   		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
>   {
> @@ -207,7 +219,8 @@ static int vfio_pci_nvgpu_group_notifier(struct notifier_block *nb,
>   	return NOTIFY_OK;
>   }
>   
> -int vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
> +static int
> +vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
>   {
>   	int ret;
>   	u64 reg[2];
> @@ -293,3 +306,135 @@ int vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
>   
>   	return ret;
>   }
> +
> +static void nvlink2gpu_vfio_pci_release(void *device_data)
> +{
> +	struct vfio_pci_core_device *vdev = device_data;
> +
> +	mutex_lock(&vdev->reflck->lock);
> +	if (!(--vdev->refcnt)) {
> +		vfio_pci_vf_token_user_add(vdev, -1);
> +		vfio_pci_core_spapr_eeh_release(vdev);
> +		vfio_pci_core_disable(vdev);
> +	}
> +	mutex_unlock(&vdev->reflck->lock);
> +
> +	module_put(THIS_MODULE);
> +}
> +
> +static int nvlink2gpu_vfio_pci_open(void *device_data)
> +{
> +	struct vfio_pci_core_device *vdev = device_data;
> +	int ret = 0;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;
> +
> +	mutex_lock(&vdev->reflck->lock);
> +
> +	if (!vdev->refcnt) {
> +		ret = vfio_pci_core_enable(vdev);
> +		if (ret)
> +			goto error;
> +
> +		ret = vfio_pci_nvidia_v100_nvlink2_init(vdev);
> +		if (ret && ret != -ENODEV) {
> +			pci_warn(vdev->pdev,
> +				 "Failed to setup NVIDIA NV2 RAM region\n");
> +			vfio_pci_core_disable(vdev);
> +			goto error;
> +		}
> +		ret = 0;
> +		vfio_pci_probe_mmaps(vdev);
> +		vfio_pci_core_spapr_eeh_open(vdev);
> +		vfio_pci_vf_token_user_add(vdev, 1);
> +	}
> +	vdev->refcnt++;
> +error:
> +	mutex_unlock(&vdev->reflck->lock);
> +	if (ret)
> +		module_put(THIS_MODULE);
> +	return ret;
> +}
> +
> +static const struct vfio_device_ops nvlink2gpu_vfio_pci_ops = {
> +	.name		= "nvlink2gpu-vfio-pci",
> +	.open		= nvlink2gpu_vfio_pci_open,
> +	.release	= nvlink2gpu_vfio_pci_release,
> +	.ioctl		= vfio_pci_core_ioctl,
> +	.read		= vfio_pci_core_read,
> +	.write		= vfio_pci_core_write,
> +	.mmap		= vfio_pci_core_mmap,
> +	.request	= vfio_pci_core_request,
> +	.match		= vfio_pci_core_match,
> +};
> +
> +static int nvlink2gpu_vfio_pci_probe(struct pci_dev *pdev,
> +		const struct pci_device_id *id)
> +{
> +	struct nv_vfio_pci_device *nvdev;
> +	int ret;
> +
> +	nvdev = kzalloc(sizeof(*nvdev), GFP_KERNEL);
> +	if (!nvdev)
> +		return -ENOMEM;
> +
> +	ret = vfio_pci_core_register_device(&nvdev->vdev, pdev,
> +			&nvlink2gpu_vfio_pci_ops);
> +	if (ret)
> +		goto out_free;
> +
> +	return 0;
> +
> +out_free:
> +	kfree(nvdev);
> +	return ret;
> +}
> +
> +static void nvlink2gpu_vfio_pci_remove(struct pci_dev *pdev)
> +{
> +	struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
> +	struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
> +	struct nv_vfio_pci_device *nvdev;
> +
> +	nvdev = container_of(core_vpdev, struct nv_vfio_pci_device, vdev);
> +
> +	vfio_pci_core_unregister_device(core_vpdev);
> +	kfree(nvdev);
> +}
> +
> +static const struct pci_device_id nvlink2gpu_vfio_pci_table[] = {
> +	{ PCI_VDEVICE(NVIDIA, 0x1DB1) }, /* GV100GL-A NVIDIA Tesla V100-SXM2-16GB */
> +	{ PCI_VDEVICE(NVIDIA, 0x1DB5) }, /* GV100GL-A NVIDIA Tesla V100-SXM2-32GB */
> +	{ PCI_VDEVICE(NVIDIA, 0x1DB8) }, /* GV100GL-A NVIDIA Tesla V100-SXM3-32GB */
> +	{ PCI_VDEVICE(NVIDIA, 0x1DF5) }, /* GV100GL-B NVIDIA Tesla V100-SXM2-16GB */


Where is this list from?

Also, how is this supposed to work at the boot time? Will the kernel try 
binding let's say this one and nouveau? Which one is going to win?


> +	{ 0, }


Why a comma?

> +};



> +
> +static struct pci_driver nvlink2gpu_vfio_pci_driver = {
> +	.name			= "nvlink2gpu-vfio-pci",
> +	.id_table		= nvlink2gpu_vfio_pci_table,
> +	.probe			= nvlink2gpu_vfio_pci_probe,
> +	.remove			= nvlink2gpu_vfio_pci_remove,
> +#ifdef CONFIG_PCI_IOV
> +	.sriov_configure	= vfio_pci_core_sriov_configure,
> +#endif


What is this IOV business about?


> +	.err_handler		= &vfio_pci_core_err_handlers,
> +};
> +
> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
> +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev)
> +{
> +	if (pci_match_id(nvlink2gpu_vfio_pci_driver.id_table, pdev))
> +		return &nvlink2gpu_vfio_pci_driver;


Why do we need matching PCI ids here instead of looking at the FDT which 
will work better?


> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(get_nvlink2gpu_vfio_pci_driver);
> +#endif
> +
> +module_pci_driver(nvlink2gpu_vfio_pci_driver);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/pci/nvlink2gpu_vfio_pci.h b/drivers/vfio/pci/nvlink2gpu_vfio_pci.h
> new file mode 100644
> index 000000000000..ebd5b600b190
> --- /dev/null
> +++ b/drivers/vfio/pci/nvlink2gpu_vfio_pci.h
> @@ -0,0 +1,24 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2020, Mellanox Technologies, Ltd.  All rights reserved.
> + *     Author: Max Gurtovoy <mgurtovoy@nvidia.com>
> + */
> +
> +#ifndef NVLINK2GPU_VFIO_PCI_H
> +#define NVLINK2GPU_VFIO_PCI_H
> +
> +#include <linux/pci.h>
> +#include <linux/module.h>
> +
> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
> +#if defined(CONFIG_VFIO_PCI_NVLINK2GPU) || defined(CONFIG_VFIO_PCI_NVLINK2GPU_MODULE)
> +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev);
> +#else
> +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev)
> +{
> +	return NULL;
> +}
> +#endif
> +#endif
> +
> +#endif /* NVLINK2GPU_VFIO_PCI_H */
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index dbc0a6559914..8e81ea039f31 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -27,6 +27,10 @@
>   #include <linux/uaccess.h>
>   
>   #include "vfio_pci_core.h"
> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
> +#include "npu2_vfio_pci.h"
> +#include "nvlink2gpu_vfio_pci.h"
> +#endif
>   
>   #define DRIVER_VERSION  "0.2"
>   #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> @@ -142,14 +146,48 @@ static const struct vfio_device_ops vfio_pci_ops = {
>   	.match		= vfio_pci_core_match,
>   };
>   
> +/*
> + * This layer is used for backward compatibility. Hopefully it will be
> + * removed in the future.
> + */
> +static struct pci_driver *vfio_pci_get_compat_driver(struct pci_dev *pdev)
> +{
> +	switch (pdev->vendor) {
> +	case PCI_VENDOR_ID_NVIDIA:
> +		switch (pdev->device) {
> +		case 0x1db1:
> +		case 0x1db5:
> +		case 0x1db8:
> +		case 0x1df5:
> +			return get_nvlink2gpu_vfio_pci_driver(pdev);

This does not really need a switch, could simply call these 
get_xxxx_vfio_pci_driver. Thanks,


> +		default:
> +			return NULL;
> +		}
> +	case PCI_VENDOR_ID_IBM:
> +		switch (pdev->device) {
> +		case 0x04ea:
> +			return get_npu2_vfio_pci_driver(pdev);
> +		default:
> +			return NULL;
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
>   static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>   {
>   	struct vfio_pci_device *vpdev;
> +	struct pci_driver *driver;
>   	int ret;
>   
>   	if (vfio_pci_is_denylisted(pdev))
>   		return -EINVAL;
>   
> +	driver = vfio_pci_get_compat_driver(pdev);
> +	if (driver)
> +		return driver->probe(pdev, id);
> +
>   	vpdev = kzalloc(sizeof(*vpdev), GFP_KERNEL);
>   	if (!vpdev)
>   		return -ENOMEM;
> @@ -167,14 +205,21 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>   
>   static void vfio_pci_remove(struct pci_dev *pdev)
>   {
> -	struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
> -	struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
> -	struct vfio_pci_device *vpdev;
> -
> -	vpdev = container_of(core_vpdev, struct vfio_pci_device, vdev);
> -
> -	vfio_pci_core_unregister_device(core_vpdev);
> -	kfree(vpdev);
> +	struct pci_driver *driver;
> +
> +	driver = vfio_pci_get_compat_driver(pdev);
> +	if (driver) {
> +		driver->remove(pdev);
> +	} else {
> +		struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
> +		struct vfio_pci_core_device *core_vpdev;
> +		struct vfio_pci_device *vpdev;
> +
> +		core_vpdev = vfio_device_data(vdev);
> +		vpdev = container_of(core_vpdev, struct vfio_pci_device, vdev);
> +		vfio_pci_core_unregister_device(core_vpdev);
> +		kfree(vpdev);
> +	}
>   }
>   
>   static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 4de8e352df9c..f9b39abe54cb 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -354,24 +354,6 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>   		}
>   	}
>   
> -	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> -	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
> -		ret = vfio_pci_nvidia_v100_nvlink2_init(vdev);
> -		if (ret && ret != -ENODEV) {
> -			pci_warn(pdev, "Failed to setup NVIDIA NV2 RAM region\n");
> -			goto disable_exit;
> -		}
> -	}
> -
> -	if (pdev->vendor == PCI_VENDOR_ID_IBM &&
> -	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
> -		ret = vfio_pci_ibm_npu2_init(vdev);
> -		if (ret && ret != -ENODEV) {
> -			pci_warn(pdev, "Failed to setup NVIDIA NV2 ATSD region\n");
> -			goto disable_exit;
> -		}
> -	}
> -
>   	return 0;
>   
>   disable_exit:
> diff --git a/drivers/vfio/pci/vfio_pci_core.h b/drivers/vfio/pci/vfio_pci_core.h
> index 8989443c3086..31f3836e606e 100644
> --- a/drivers/vfio/pci/vfio_pci_core.h
> +++ b/drivers/vfio/pci/vfio_pci_core.h
> @@ -204,20 +204,6 @@ static inline int vfio_pci_igd_init(struct vfio_pci_core_device *vdev)
>   	return -ENODEV;
>   }
>   #endif
> -#ifdef CONFIG_VFIO_PCI_NVLINK2
> -extern int vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev);
> -extern int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev);
> -#else
> -static inline int vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
> -{
> -	return -ENODEV;
> -}
> -
> -static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
> -{
> -	return -ENODEV;
> -}
> -#endif
>   
>   #ifdef CONFIG_S390
>   extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 7/9] vfio/pci_core: split nvlink2 to nvlink2gpu and npu2
  2021-03-09  8:33 ` [PATCH 7/9] vfio/pci_core: split nvlink2 to nvlink2gpu and npu2 Max Gurtovoy
@ 2021-03-10  8:08   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2021-03-10  8:08 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: jgg, alex.williamson, cohuck, kvm, linux-kernel, liranl, oren,
	tzahio, leonro, yarong, aviadye, shahafs, artemp, kwankhede,
	ACurrid, cjia, yishaih, mjrosato, aik, hch

On Tue, Mar 09, 2021 at 08:33:55AM +0000, Max Gurtovoy wrote:
> This is a preparation for moving vendor specific code from
> vfio_pci_core to vendor specific vfio_pci drivers. The next step will be
> creating a dedicated module to NVIDIA NVLINK2 devices with P9 extensions
> and a dedicated module for Power9 NPU NVLink2 HBAs.

As said before - this driver always failed the "has open source user space"
(which in this kind could also kernelspace in a VM) support and should
just be removed entirely, including all the cruft for it in arch/powerpc.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 9/9] vfio/pci: export igd support into vendor vfio_pci driver
  2021-03-09  8:33 ` [PATCH 9/9] vfio/pci: export igd support into vendor vfio_pci driver Max Gurtovoy
@ 2021-03-10  8:15   ` Christoph Hellwig
  2021-03-10 12:31     ` Jason Gunthorpe
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2021-03-10  8:15 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: jgg, alex.williamson, cohuck, kvm, linux-kernel, liranl, oren,
	tzahio, leonro, yarong, aviadye, shahafs, artemp, kwankhede,
	ACurrid, cjia, yishaih, mjrosato, aik, hch

The terminology is all weird here.  You don't export functionality
you move it.  And this is not a "vendor" driver, but just a device
specific one.

> +struct igd_vfio_pci_device {
> +	struct vfio_pci_core_device	vdev;
> +};

Why do you need this separate structure?  You could just use
vfio_pci_core_device directly.

> +static void igd_vfio_pci_release(void *device_data)
> +{
> +	struct vfio_pci_core_device *vdev = device_data;
> +
> +	mutex_lock(&vdev->reflck->lock);
> +	if (!(--vdev->refcnt)) {

No need for the braces here.

> +		vfio_pci_vf_token_user_add(vdev, -1);
> +		vfio_pci_core_spapr_eeh_release(vdev);
> +		vfio_pci_core_disable(vdev);
> +	}
> +	mutex_unlock(&vdev->reflck->lock);

But more importantly all this code should be in a helper exported
from the core code.

> +
> +	module_put(THIS_MODULE);

This looks bogus - the module reference is now gone while
igd_vfio_pci_release is still running.  Module refcounting always
need to be done by the caller, not the individual driver.

> +static int igd_vfio_pci_open(void *device_data)
> +{
> +	struct vfio_pci_core_device *vdev = device_data;
> +	int ret = 0;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;

Same here - thi is something the caller needs to do.

> +	mutex_lock(&vdev->reflck->lock);
> +
> +	if (!vdev->refcnt) {
> +		ret = vfio_pci_core_enable(vdev);
> +		if (ret)
> +			goto error;
> +
> +		ret = vfio_pci_igd_init(vdev);
> +		if (ret && ret != -ENODEV) {
> +			pci_warn(vdev->pdev, "Failed to setup Intel IGD regions\n");
> +			vfio_pci_core_disable(vdev);
> +			goto error;
> +		}
> +		ret = 0;
> +		vfio_pci_probe_mmaps(vdev);
> +		vfio_pci_core_spapr_eeh_open(vdev);
> +		vfio_pci_vf_token_user_add(vdev, 1);

Way to much boilerplate.  Why doesn't the core only call a function
that does the vfio_pci_igd_init?

> +static const struct pci_device_id igd_vfio_pci_table[] = {
> +	{ PCI_VENDOR_ID_INTEL, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID, PCI_CLASS_DISPLAY_VGA << 8, 0xff0000, 0 },

Please avoid the overly long line.  And a match as big as any Intel
graphics at very least needs a comment documenting why this is safe
and will perpetually remain safe.

> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
> +struct pci_driver *get_igd_vfio_pci_driver(struct pci_dev *pdev)
> +{
> +	if (pci_match_id(igd_vfio_pci_driver.id_table, pdev))
> +		return &igd_vfio_pci_driver;
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(get_igd_vfio_pci_driver);
> +#endif

> +	case PCI_VENDOR_ID_INTEL:
> +		if (pdev->class == PCI_CLASS_DISPLAY_VGA << 8)
> +			return get_igd_vfio_pci_driver(pdev);

And this now means that the core code has a dependency on the igd
one, making the whole split rather pointless.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 9/9] vfio/pci: export igd support into vendor vfio_pci driver
  2021-03-10  8:15   ` Christoph Hellwig
@ 2021-03-10 12:31     ` Jason Gunthorpe
  2021-03-11 11:37       ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-10 12:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Max Gurtovoy, alex.williamson, cohuck, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, aik

On Wed, Mar 10, 2021 at 09:15:08AM +0100, Christoph Hellwig wrote:
> > +		vfio_pci_vf_token_user_add(vdev, -1);
> > +		vfio_pci_core_spapr_eeh_release(vdev);
> > +		vfio_pci_core_disable(vdev);
> > +	}
> > +	mutex_unlock(&vdev->reflck->lock);
> 
> But more importantly all this code should be in a helper exported
> from the core code.

Yes, that needs more refactoring. I'm viewing this series as a
"statement of intent" and once we commit to doing this we can go
through the bigger effort to split up vfio_pci_core and tidy its API.

Obviously this is a big project, given the past comments I don't want
to send more effort here until we see a community consensus emerge
that this is what we want to do. If we build a sub-driver instead the
work is all in the trash bin.

> > +	module_put(THIS_MODULE);
> 
> This looks bogus - the module reference is now gone while
> igd_vfio_pci_release is still running.  Module refcounting always
> need to be done by the caller, not the individual driver.

Yes, this module handling in vfio should be in the core code linked to
the core fops driven by the vfio_driver_ops.

It is on my list to add to my other series, this isn't the only place
in VFIO drivers are doing this..

> > +	case PCI_VENDOR_ID_INTEL:
> > +		if (pdev->class == PCI_CLASS_DISPLAY_VGA << 8)
> > +			return get_igd_vfio_pci_driver(pdev);
> 
> And this now means that the core code has a dependency on the igd
> one, making the whole split rather pointless.

Yes, if CONFIG_VFIO_PCI_DRIVER_COMPAT is set - this is a uAPI so we
don't want to see it broken. The intention is to organize the software
layers differently, not to decouple the two modules.

Future things, like the mlx5 driver that is coming, will not use this
COMPAT path at all.

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-10  6:39   ` Alexey Kardashevskiy
@ 2021-03-10 12:57     ` Max Gurtovoy
  2021-03-10 13:02       ` Jason Gunthorpe
                         ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Max Gurtovoy @ 2021-03-10 12:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy, jgg, alex.williamson, cohuck, kvm, linux-kernel
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, hch


On 3/10/2021 8:39 AM, Alexey Kardashevskiy wrote:
>
>
> On 09/03/2021 19:33, Max Gurtovoy wrote:
>> The new drivers introduced are nvlink2gpu_vfio_pci.ko and
>> npu2_vfio_pci.ko.
>> The first will be responsible for providing special extensions for
>> NVIDIA GPUs with NVLINK2 support for P9 platform (and others in the
>> future). The last will be responsible for POWER9 NPU2 unit (NVLink2 host
>> bus adapter).
>>
>> Also, preserve backward compatibility for users that were binding
>> NVLINK2 devices to vfio_pci.ko. Hopefully this compatibility layer will
>> be dropped in the future
>>
>> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
>> ---
>>   drivers/vfio/pci/Kconfig                      |  28 +++-
>>   drivers/vfio/pci/Makefile                     |   7 +-
>>   .../pci/{vfio_pci_npu2.c => npu2_vfio_pci.c}  | 144 ++++++++++++++++-
>>   drivers/vfio/pci/npu2_vfio_pci.h              |  24 +++
>>   ...pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} | 149 +++++++++++++++++-
>>   drivers/vfio/pci/nvlink2gpu_vfio_pci.h        |  24 +++
>>   drivers/vfio/pci/vfio_pci.c                   |  61 ++++++-
>>   drivers/vfio/pci/vfio_pci_core.c              |  18 ---
>>   drivers/vfio/pci/vfio_pci_core.h              |  14 --
>>   9 files changed, 422 insertions(+), 47 deletions(-)
>>   rename drivers/vfio/pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} (64%)
>>   create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h
>>   rename drivers/vfio/pci/{vfio_pci_nvlink2gpu.c => 
>> nvlink2gpu_vfio_pci.c} (67%)
>>   create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h
>>
>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
>> index 829e90a2e5a3..88c89863a205 100644
>> --- a/drivers/vfio/pci/Kconfig
>> +++ b/drivers/vfio/pci/Kconfig
>> @@ -48,8 +48,30 @@ config VFIO_PCI_IGD
>>           To enable Intel IGD assignment through vfio-pci, say Y.
>>   -config VFIO_PCI_NVLINK2
>> -    def_bool y
>> +config VFIO_PCI_NVLINK2GPU
>> +    tristate "VFIO support for NVIDIA NVLINK2 GPUs"
>>       depends on VFIO_PCI_CORE && PPC_POWERNV
>>       help
>> -      VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
>> +      VFIO PCI driver for NVIDIA NVLINK2 GPUs with specific extensions
>> +      for P9 Witherspoon machine.
>> +
>> +config VFIO_PCI_NPU2
>> +    tristate "VFIO support for IBM NPU host bus adapter on P9"
>> +    depends on VFIO_PCI_NVLINK2GPU && PPC_POWERNV
>> +    help
>> +      VFIO PCI specific extensions for IBM NVLink2 host bus adapter 
>> on P9
>> +      Witherspoon machine.
>> +
>> +config VFIO_PCI_DRIVER_COMPAT
>> +    bool "VFIO PCI backward compatibility for vendor specific 
>> extensions"
>> +    default y
>> +    depends on VFIO_PCI
>> +    help
>> +      Say Y here if you want to preserve VFIO PCI backward
>> +      compatibility. vfio_pci.ko will continue to automatically use
>> +      the NVLINK2, NPU2 and IGD VFIO drivers when it is attached to
>> +      a compatible device.
>> +
>> +      When N is selected the user must bind explicity to the module
>> +      they want to handle the device and vfio_pci.ko will have no
>> +      device specific special behaviors.
>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>> index f539f32c9296..86fb62e271fc 100644
>> --- a/drivers/vfio/pci/Makefile
>> +++ b/drivers/vfio/pci/Makefile
>> @@ -2,10 +2,15 @@
>>     obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
>>   obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>> +obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o
>> +obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o
>>     vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o 
>> vfio_pci_rdwr.o vfio_pci_config.o
>>   vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>> -vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o 
>> vfio_pci_npu2.o
>>   vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
>>     vfio-pci-y := vfio_pci.o
>> +
>> +npu2-vfio-pci-y := npu2_vfio_pci.o
>> +
>> +nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o
>> diff --git a/drivers/vfio/pci/vfio_pci_npu2.c 
>> b/drivers/vfio/pci/npu2_vfio_pci.c
>> similarity index 64%
>> rename from drivers/vfio/pci/vfio_pci_npu2.c
>> rename to drivers/vfio/pci/npu2_vfio_pci.c
>> index 717745256ab3..7071bda0f2b6 100644
>> --- a/drivers/vfio/pci/vfio_pci_npu2.c
>> +++ b/drivers/vfio/pci/npu2_vfio_pci.c
>> @@ -14,19 +14,28 @@
>>    *    Author: Alex Williamson <alex.williamson@redhat.com>
>>    */
>>   +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>> +#include <linux/module.h>
>>   #include <linux/io.h>
>>   #include <linux/pci.h>
>>   #include <linux/uaccess.h>
>>   #include <linux/vfio.h>
>> +#include <linux/list.h>
>>   #include <linux/sched/mm.h>
>>   #include <linux/mmu_context.h>
>>   #include <asm/kvm_ppc.h>
>>     #include "vfio_pci_core.h"
>> +#include "npu2_vfio_pci.h"
>>     #define CREATE_TRACE_POINTS
>>   #include "npu2_trace.h"
>>   +#define DRIVER_VERSION  "0.1"
>> +#define DRIVER_AUTHOR   "Alexey Kardashevskiy <aik@ozlabs.ru>"
>> +#define DRIVER_DESC     "NPU2 VFIO PCI - User Level meta-driver for 
>> POWER9 NPU NVLink2 HBA"
>> +
>>   EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap);
>>     struct vfio_pci_npu2_data {
>> @@ -36,6 +45,10 @@ struct vfio_pci_npu2_data {
>>       unsigned int link_speed; /* The link speed from DT's 
>> ibm,nvlink-speed */
>>   };
>>   +struct npu2_vfio_pci_device {
>> +    struct vfio_pci_core_device    vdev;
>> +};
>> +
>>   static size_t vfio_pci_npu2_rw(struct vfio_pci_core_device *vdev,
>>           char __user *buf, size_t count, loff_t *ppos, bool iswrite)
>>   {
>> @@ -120,7 +133,7 @@ static const struct vfio_pci_regops 
>> vfio_pci_npu2_regops = {
>>       .add_capability = vfio_pci_npu2_add_capability,
>>   };
>>   -int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
>> +static int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
>>   {
>>       int ret;
>>       struct vfio_pci_npu2_data *data;
>> @@ -220,3 +233,132 @@ int vfio_pci_ibm_npu2_init(struct 
>> vfio_pci_core_device *vdev)
>>         return ret;
>>   }
>> +
>> +static void npu2_vfio_pci_release(void *device_data)
>> +{
>> +    struct vfio_pci_core_device *vdev = device_data;
>> +
>> +    mutex_lock(&vdev->reflck->lock);
>> +    if (!(--vdev->refcnt)) {
>> +        vfio_pci_vf_token_user_add(vdev, -1);
>> +        vfio_pci_core_spapr_eeh_release(vdev);
>> +        vfio_pci_core_disable(vdev);
>> +    }
>> +    mutex_unlock(&vdev->reflck->lock);
>> +
>> +    module_put(THIS_MODULE);
>> +}
>> +
>> +static int npu2_vfio_pci_open(void *device_data)
>> +{
>> +    struct vfio_pci_core_device *vdev = device_data;
>> +    int ret = 0;
>> +
>> +    if (!try_module_get(THIS_MODULE))
>> +        return -ENODEV;
>> +
>> +    mutex_lock(&vdev->reflck->lock);
>> +
>> +    if (!vdev->refcnt) {
>> +        ret = vfio_pci_core_enable(vdev);
>> +        if (ret)
>> +            goto error;
>> +
>> +        ret = vfio_pci_ibm_npu2_init(vdev);
>> +        if (ret && ret != -ENODEV) {
>> +            pci_warn(vdev->pdev,
>> +                 "Failed to setup NVIDIA NV2 ATSD region\n");
>> +            vfio_pci_core_disable(vdev);
>> +            goto error;
>> +        }
>> +        ret = 0;
>> +        vfio_pci_probe_mmaps(vdev);
>> +        vfio_pci_core_spapr_eeh_open(vdev);
>> +        vfio_pci_vf_token_user_add(vdev, 1);
>> +    }
>> +    vdev->refcnt++;
>> +error:
>> +    mutex_unlock(&vdev->reflck->lock);
>> +    if (ret)
>> +        module_put(THIS_MODULE);
>> +    return ret;
>> +}
>> +
>> +static const struct vfio_device_ops npu2_vfio_pci_ops = {
>> +    .name        = "npu2-vfio-pci",
>> +    .open        = npu2_vfio_pci_open,
>> +    .release    = npu2_vfio_pci_release,
>> +    .ioctl        = vfio_pci_core_ioctl,
>> +    .read        = vfio_pci_core_read,
>> +    .write        = vfio_pci_core_write,
>> +    .mmap        = vfio_pci_core_mmap,
>> +    .request    = vfio_pci_core_request,
>> +    .match        = vfio_pci_core_match,
>> +};
>> +
>> +static int npu2_vfio_pci_probe(struct pci_dev *pdev,
>> +        const struct pci_device_id *id)
>> +{
>> +    struct npu2_vfio_pci_device *npvdev;
>> +    int ret;
>> +
>> +    npvdev = kzalloc(sizeof(*npvdev), GFP_KERNEL);
>> +    if (!npvdev)
>> +        return -ENOMEM;
>> +
>> +    ret = vfio_pci_core_register_device(&npvdev->vdev, pdev,
>> +            &npu2_vfio_pci_ops);
>> +    if (ret)
>> +        goto out_free;
>> +
>> +    return 0;
>> +
>> +out_free:
>> +    kfree(npvdev);
>> +    return ret;
>> +}
>> +
>> +static void npu2_vfio_pci_remove(struct pci_dev *pdev)
>> +{
>> +    struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
>> +    struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
>> +    struct npu2_vfio_pci_device *npvdev;
>> +
>> +    npvdev = container_of(core_vpdev, struct npu2_vfio_pci_device, 
>> vdev);
>> +
>> +    vfio_pci_core_unregister_device(core_vpdev);
>> +    kfree(npvdev);
>> +}
>> +
>> +static const struct pci_device_id npu2_vfio_pci_table[] = {
>> +    { PCI_VDEVICE(IBM, 0x04ea) },
>> +    { 0, }
>> +};
>> +
>> +static struct pci_driver npu2_vfio_pci_driver = {
>> +    .name            = "npu2-vfio-pci",
>> +    .id_table        = npu2_vfio_pci_table,
>> +    .probe            = npu2_vfio_pci_probe,
>> +    .remove            = npu2_vfio_pci_remove,
>> +#ifdef CONFIG_PCI_IOV
>> +    .sriov_configure    = vfio_pci_core_sriov_configure,
>> +#endif
>> +    .err_handler        = &vfio_pci_core_err_handlers,
>> +};
>> +
>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>> +struct pci_driver *get_npu2_vfio_pci_driver(struct pci_dev *pdev)
>> +{
>> +    if (pci_match_id(npu2_vfio_pci_driver.id_table, pdev))
>> +        return &npu2_vfio_pci_driver;
>> +    return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(get_npu2_vfio_pci_driver);
>> +#endif
>> +
>> +module_pci_driver(npu2_vfio_pci_driver);
>> +
>> +MODULE_VERSION(DRIVER_VERSION);
>> +MODULE_LICENSE("GPL v2");
>> +MODULE_AUTHOR(DRIVER_AUTHOR);
>> +MODULE_DESCRIPTION(DRIVER_DESC);
>> diff --git a/drivers/vfio/pci/npu2_vfio_pci.h 
>> b/drivers/vfio/pci/npu2_vfio_pci.h
>> new file mode 100644
>> index 000000000000..92010d340346
>> --- /dev/null
>> +++ b/drivers/vfio/pci/npu2_vfio_pci.h
>> @@ -0,0 +1,24 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Copyright (c) 2020, Mellanox Technologies, Ltd.  All rights 
>> reserved.
>> + *     Author: Max Gurtovoy <mgurtovoy@nvidia.com>
>> + */
>> +
>> +#ifndef NPU2_VFIO_PCI_H
>> +#define NPU2_VFIO_PCI_H
>> +
>> +#include <linux/pci.h>
>> +#include <linux/module.h>
>> +
>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>> +#if defined(CONFIG_VFIO_PCI_NPU2) || 
>> defined(CONFIG_VFIO_PCI_NPU2_MODULE)
>> +struct pci_driver *get_npu2_vfio_pci_driver(struct pci_dev *pdev);
>> +#else
>> +struct pci_driver *get_npu2_vfio_pci_driver(struct pci_dev *pdev)
>> +{
>> +    return NULL;
>> +}
>> +#endif
>> +#endif
>> +
>> +#endif /* NPU2_VFIO_PCI_H */
>> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2gpu.c 
>> b/drivers/vfio/pci/nvlink2gpu_vfio_pci.c
>> similarity index 67%
>> rename from drivers/vfio/pci/vfio_pci_nvlink2gpu.c
>> rename to drivers/vfio/pci/nvlink2gpu_vfio_pci.c
>> index 6dce1e78ee82..84a5ac1ce8ac 100644
>> --- a/drivers/vfio/pci/vfio_pci_nvlink2gpu.c
>> +++ b/drivers/vfio/pci/nvlink2gpu_vfio_pci.c
>> @@ -1,6 +1,6 @@
>>   // SPDX-License-Identifier: GPL-2.0-only
>>   /*
>> - * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
>> + * VFIO PCI NVIDIA NVLink2 GPUs support.
>>    *
>>    * Copyright (C) 2018 IBM Corp.  All rights reserved.
>>    *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
>> @@ -12,6 +12,9 @@
>>    *    Author: Alex Williamson <alex.williamson@redhat.com>
>>    */
>>   +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>> +#include <linux/module.h>
>>   #include <linux/io.h>
>>   #include <linux/pci.h>
>>   #include <linux/uaccess.h>
>> @@ -21,10 +24,15 @@
>>   #include <asm/kvm_ppc.h>
>>     #include "vfio_pci_core.h"
>> +#include "nvlink2gpu_vfio_pci.h"
>>     #define CREATE_TRACE_POINTS
>>   #include "nvlink2gpu_trace.h"
>>   +#define DRIVER_VERSION  "0.1"
>> +#define DRIVER_AUTHOR   "Alexey Kardashevskiy <aik@ozlabs.ru>"
>> +#define DRIVER_DESC     "NVLINK2GPU VFIO PCI - User Level 
>> meta-driver for NVIDIA NVLink2 GPUs"
>> +
>>   EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap_fault);
>>   EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap);
>>   @@ -39,6 +47,10 @@ struct vfio_pci_nvgpu_data {
>>       struct notifier_block group_notifier;
>>   };
>>   +struct nv_vfio_pci_device {
>> +    struct vfio_pci_core_device    vdev;
>> +};
>> +
>>   static size_t vfio_pci_nvgpu_rw(struct vfio_pci_core_device *vdev,
>>           char __user *buf, size_t count, loff_t *ppos, bool iswrite)
>>   {
>> @@ -207,7 +219,8 @@ static int vfio_pci_nvgpu_group_notifier(struct 
>> notifier_block *nb,
>>       return NOTIFY_OK;
>>   }
>>   -int vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device 
>> *vdev)
>> +static int
>> +vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
>>   {
>>       int ret;
>>       u64 reg[2];
>> @@ -293,3 +306,135 @@ int vfio_pci_nvidia_v100_nvlink2_init(struct 
>> vfio_pci_core_device *vdev)
>>         return ret;
>>   }
>> +
>> +static void nvlink2gpu_vfio_pci_release(void *device_data)
>> +{
>> +    struct vfio_pci_core_device *vdev = device_data;
>> +
>> +    mutex_lock(&vdev->reflck->lock);
>> +    if (!(--vdev->refcnt)) {
>> +        vfio_pci_vf_token_user_add(vdev, -1);
>> +        vfio_pci_core_spapr_eeh_release(vdev);
>> +        vfio_pci_core_disable(vdev);
>> +    }
>> +    mutex_unlock(&vdev->reflck->lock);
>> +
>> +    module_put(THIS_MODULE);
>> +}
>> +
>> +static int nvlink2gpu_vfio_pci_open(void *device_data)
>> +{
>> +    struct vfio_pci_core_device *vdev = device_data;
>> +    int ret = 0;
>> +
>> +    if (!try_module_get(THIS_MODULE))
>> +        return -ENODEV;
>> +
>> +    mutex_lock(&vdev->reflck->lock);
>> +
>> +    if (!vdev->refcnt) {
>> +        ret = vfio_pci_core_enable(vdev);
>> +        if (ret)
>> +            goto error;
>> +
>> +        ret = vfio_pci_nvidia_v100_nvlink2_init(vdev);
>> +        if (ret && ret != -ENODEV) {
>> +            pci_warn(vdev->pdev,
>> +                 "Failed to setup NVIDIA NV2 RAM region\n");
>> +            vfio_pci_core_disable(vdev);
>> +            goto error;
>> +        }
>> +        ret = 0;
>> +        vfio_pci_probe_mmaps(vdev);
>> +        vfio_pci_core_spapr_eeh_open(vdev);
>> +        vfio_pci_vf_token_user_add(vdev, 1);
>> +    }
>> +    vdev->refcnt++;
>> +error:
>> +    mutex_unlock(&vdev->reflck->lock);
>> +    if (ret)
>> +        module_put(THIS_MODULE);
>> +    return ret;
>> +}
>> +
>> +static const struct vfio_device_ops nvlink2gpu_vfio_pci_ops = {
>> +    .name        = "nvlink2gpu-vfio-pci",
>> +    .open        = nvlink2gpu_vfio_pci_open,
>> +    .release    = nvlink2gpu_vfio_pci_release,
>> +    .ioctl        = vfio_pci_core_ioctl,
>> +    .read        = vfio_pci_core_read,
>> +    .write        = vfio_pci_core_write,
>> +    .mmap        = vfio_pci_core_mmap,
>> +    .request    = vfio_pci_core_request,
>> +    .match        = vfio_pci_core_match,
>> +};
>> +
>> +static int nvlink2gpu_vfio_pci_probe(struct pci_dev *pdev,
>> +        const struct pci_device_id *id)
>> +{
>> +    struct nv_vfio_pci_device *nvdev;
>> +    int ret;
>> +
>> +    nvdev = kzalloc(sizeof(*nvdev), GFP_KERNEL);
>> +    if (!nvdev)
>> +        return -ENOMEM;
>> +
>> +    ret = vfio_pci_core_register_device(&nvdev->vdev, pdev,
>> +            &nvlink2gpu_vfio_pci_ops);
>> +    if (ret)
>> +        goto out_free;
>> +
>> +    return 0;
>> +
>> +out_free:
>> +    kfree(nvdev);
>> +    return ret;
>> +}
>> +
>> +static void nvlink2gpu_vfio_pci_remove(struct pci_dev *pdev)
>> +{
>> +    struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
>> +    struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
>> +    struct nv_vfio_pci_device *nvdev;
>> +
>> +    nvdev = container_of(core_vpdev, struct nv_vfio_pci_device, vdev);
>> +
>> +    vfio_pci_core_unregister_device(core_vpdev);
>> +    kfree(nvdev);
>> +}
>> +
>> +static const struct pci_device_id nvlink2gpu_vfio_pci_table[] = {
>> +    { PCI_VDEVICE(NVIDIA, 0x1DB1) }, /* GV100GL-A NVIDIA Tesla 
>> V100-SXM2-16GB */
>> +    { PCI_VDEVICE(NVIDIA, 0x1DB5) }, /* GV100GL-A NVIDIA Tesla 
>> V100-SXM2-32GB */
>> +    { PCI_VDEVICE(NVIDIA, 0x1DB8) }, /* GV100GL-A NVIDIA Tesla 
>> V100-SXM3-32GB */
>> +    { PCI_VDEVICE(NVIDIA, 0x1DF5) }, /* GV100GL-B NVIDIA Tesla 
>> V100-SXM2-16GB */
>
>
> Where is this list from?
>
> Also, how is this supposed to work at the boot time? Will the kernel 
> try binding let's say this one and nouveau? Which one is going to win?

At boot time nouveau driver will win since the vfio drivers don't 
declare MODULE_DEVICE_TABLE


>
>
>> +    { 0, }
>
>
> Why a comma?

I'll remove the comma.


>
>> +};
>
>
>
>> +
>> +static struct pci_driver nvlink2gpu_vfio_pci_driver = {
>> +    .name            = "nvlink2gpu-vfio-pci",
>> +    .id_table        = nvlink2gpu_vfio_pci_table,
>> +    .probe            = nvlink2gpu_vfio_pci_probe,
>> +    .remove            = nvlink2gpu_vfio_pci_remove,
>> +#ifdef CONFIG_PCI_IOV
>> +    .sriov_configure    = vfio_pci_core_sriov_configure,
>> +#endif
>
>
> What is this IOV business about?

from vfio_pci

#ifdef CONFIG_PCI_IOV
module_param(enable_sriov, bool, 0644);
MODULE_PARM_DESC(enable_sriov, "Enable support for SR-IOV 
configuration.  Enabling SR-IOV on a PF typically requires support of 
the userspace PF driver, enabling VFs without such support may result in 
non-functional VFs or PF.");
#endif


>
>
>> +    .err_handler        = &vfio_pci_core_err_handlers,
>> +};
>> +
>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>> +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev)
>> +{
>> +    if (pci_match_id(nvlink2gpu_vfio_pci_driver.id_table, pdev))
>> +        return &nvlink2gpu_vfio_pci_driver;
>
>
> Why do we need matching PCI ids here instead of looking at the FDT 
> which will work better?

what is FDT ? any is it better to use it instead of match_id ?


>
>
>> +    return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(get_nvlink2gpu_vfio_pci_driver);
>> +#endif
>> +
>> +module_pci_driver(nvlink2gpu_vfio_pci_driver);
>> +
>> +MODULE_VERSION(DRIVER_VERSION);
>> +MODULE_LICENSE("GPL v2");
>> +MODULE_AUTHOR(DRIVER_AUTHOR);
>> +MODULE_DESCRIPTION(DRIVER_DESC);
>> diff --git a/drivers/vfio/pci/nvlink2gpu_vfio_pci.h 
>> b/drivers/vfio/pci/nvlink2gpu_vfio_pci.h
>> new file mode 100644
>> index 000000000000..ebd5b600b190
>> --- /dev/null
>> +++ b/drivers/vfio/pci/nvlink2gpu_vfio_pci.h
>> @@ -0,0 +1,24 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Copyright (c) 2020, Mellanox Technologies, Ltd.  All rights 
>> reserved.
>> + *     Author: Max Gurtovoy <mgurtovoy@nvidia.com>
>> + */
>> +
>> +#ifndef NVLINK2GPU_VFIO_PCI_H
>> +#define NVLINK2GPU_VFIO_PCI_H
>> +
>> +#include <linux/pci.h>
>> +#include <linux/module.h>
>> +
>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>> +#if defined(CONFIG_VFIO_PCI_NVLINK2GPU) || 
>> defined(CONFIG_VFIO_PCI_NVLINK2GPU_MODULE)
>> +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev 
>> *pdev);
>> +#else
>> +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev)
>> +{
>> +    return NULL;
>> +}
>> +#endif
>> +#endif
>> +
>> +#endif /* NVLINK2GPU_VFIO_PCI_H */
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index dbc0a6559914..8e81ea039f31 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -27,6 +27,10 @@
>>   #include <linux/uaccess.h>
>>     #include "vfio_pci_core.h"
>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>> +#include "npu2_vfio_pci.h"
>> +#include "nvlink2gpu_vfio_pci.h"
>> +#endif
>>     #define DRIVER_VERSION  "0.2"
>>   #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
>> @@ -142,14 +146,48 @@ static const struct vfio_device_ops 
>> vfio_pci_ops = {
>>       .match        = vfio_pci_core_match,
>>   };
>>   +/*
>> + * This layer is used for backward compatibility. Hopefully it will be
>> + * removed in the future.
>> + */
>> +static struct pci_driver *vfio_pci_get_compat_driver(struct pci_dev 
>> *pdev)
>> +{
>> +    switch (pdev->vendor) {
>> +    case PCI_VENDOR_ID_NVIDIA:
>> +        switch (pdev->device) {
>> +        case 0x1db1:
>> +        case 0x1db5:
>> +        case 0x1db8:
>> +        case 0x1df5:
>> +            return get_nvlink2gpu_vfio_pci_driver(pdev);
>
> This does not really need a switch, could simply call these 
> get_xxxx_vfio_pci_driver. Thanks,

maybe the result will be the same but I don't think we need to send all 
NVIDIA devices or IBM devices to this function.

we can maybe export the tables from the vfio_vendor driver and match it 
here.

>
>
>> +        default:
>> +            return NULL;
>> +        }
>> +    case PCI_VENDOR_ID_IBM:
>> +        switch (pdev->device) {
>> +        case 0x04ea:
>> +            return get_npu2_vfio_pci_driver(pdev);
>> +        default:
>> +            return NULL;
>> +        }
>> +    }
>> +
>> +    return NULL;
>> +}
>> +
>>   static int vfio_pci_probe(struct pci_dev *pdev, const struct 
>> pci_device_id *id)
>>   {
>>       struct vfio_pci_device *vpdev;
>> +    struct pci_driver *driver;
>>       int ret;
>>         if (vfio_pci_is_denylisted(pdev))
>>           return -EINVAL;
>>   +    driver = vfio_pci_get_compat_driver(pdev);
>> +    if (driver)
>> +        return driver->probe(pdev, id);
>> +
>>       vpdev = kzalloc(sizeof(*vpdev), GFP_KERNEL);
>>       if (!vpdev)
>>           return -ENOMEM;
>> @@ -167,14 +205,21 @@ static int vfio_pci_probe(struct pci_dev *pdev, 
>> const struct pci_device_id *id)
>>     static void vfio_pci_remove(struct pci_dev *pdev)
>>   {
>> -    struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
>> -    struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
>> -    struct vfio_pci_device *vpdev;
>> -
>> -    vpdev = container_of(core_vpdev, struct vfio_pci_device, vdev);
>> -
>> -    vfio_pci_core_unregister_device(core_vpdev);
>> -    kfree(vpdev);
>> +    struct pci_driver *driver;
>> +
>> +    driver = vfio_pci_get_compat_driver(pdev);
>> +    if (driver) {
>> +        driver->remove(pdev);
>> +    } else {
>> +        struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
>> +        struct vfio_pci_core_device *core_vpdev;
>> +        struct vfio_pci_device *vpdev;
>> +
>> +        core_vpdev = vfio_device_data(vdev);
>> +        vpdev = container_of(core_vpdev, struct vfio_pci_device, vdev);
>> +        vfio_pci_core_unregister_device(core_vpdev);
>> +        kfree(vpdev);
>> +    }
>>   }
>>     static int vfio_pci_sriov_configure(struct pci_dev *pdev, int 
>> nr_virtfn)
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c 
>> b/drivers/vfio/pci/vfio_pci_core.c
>> index 4de8e352df9c..f9b39abe54cb 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -354,24 +354,6 @@ int vfio_pci_core_enable(struct 
>> vfio_pci_core_device *vdev)
>>           }
>>       }
>>   -    if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
>> -        IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
>> -        ret = vfio_pci_nvidia_v100_nvlink2_init(vdev);
>> -        if (ret && ret != -ENODEV) {
>> -            pci_warn(pdev, "Failed to setup NVIDIA NV2 RAM region\n");
>> -            goto disable_exit;
>> -        }
>> -    }
>> -
>> -    if (pdev->vendor == PCI_VENDOR_ID_IBM &&
>> -        IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
>> -        ret = vfio_pci_ibm_npu2_init(vdev);
>> -        if (ret && ret != -ENODEV) {
>> -            pci_warn(pdev, "Failed to setup NVIDIA NV2 ATSD region\n");
>> -            goto disable_exit;
>> -        }
>> -    }
>> -
>>       return 0;
>>     disable_exit:
>> diff --git a/drivers/vfio/pci/vfio_pci_core.h 
>> b/drivers/vfio/pci/vfio_pci_core.h
>> index 8989443c3086..31f3836e606e 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.h
>> +++ b/drivers/vfio/pci/vfio_pci_core.h
>> @@ -204,20 +204,6 @@ static inline int vfio_pci_igd_init(struct 
>> vfio_pci_core_device *vdev)
>>       return -ENODEV;
>>   }
>>   #endif
>> -#ifdef CONFIG_VFIO_PCI_NVLINK2
>> -extern int vfio_pci_nvidia_v100_nvlink2_init(struct 
>> vfio_pci_core_device *vdev);
>> -extern int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev);
>> -#else
>> -static inline int vfio_pci_nvidia_v100_nvlink2_init(struct 
>> vfio_pci_core_device *vdev)
>> -{
>> -    return -ENODEV;
>> -}
>> -
>> -static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device 
>> *vdev)
>> -{
>> -    return -ENODEV;
>> -}
>> -#endif
>>     #ifdef CONFIG_S390
>>   extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device 
>> *vdev,
>>
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-10 12:57     ` Max Gurtovoy
@ 2021-03-10 13:02       ` Jason Gunthorpe
  2021-03-10 14:24         ` Alexey Kardashevskiy
  2021-03-10 14:19       ` Alexey Kardashevskiy
  2021-03-19 15:23       ` Alex Williamson
  2 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-10 13:02 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Alexey Kardashevskiy, alex.williamson, cohuck, kvm, linux-kernel,
	liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, hch

On Wed, Mar 10, 2021 at 02:57:57PM +0200, Max Gurtovoy wrote:

> > > +    .err_handler        = &vfio_pci_core_err_handlers,
> > > +};
> > > +
> > > +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
> > > +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev)
> > > +{
> > > +    if (pci_match_id(nvlink2gpu_vfio_pci_driver.id_table, pdev))
> > > +        return &nvlink2gpu_vfio_pci_driver;
> > 
> > 
> > Why do we need matching PCI ids here instead of looking at the FDT which
> > will work better?
> 
> what is FDT ? any is it better to use it instead of match_id ?

This is emulating the device_driver match for the pci_driver.

I don't think we can combine FDT matching with pci_driver, can we?

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-10 12:57     ` Max Gurtovoy
  2021-03-10 13:02       ` Jason Gunthorpe
@ 2021-03-10 14:19       ` Alexey Kardashevskiy
  2021-03-11  1:10         ` Max Gurtovoy
  2021-03-19 15:23       ` Alex Williamson
  2 siblings, 1 reply; 53+ messages in thread
From: Alexey Kardashevskiy @ 2021-03-10 14:19 UTC (permalink / raw)
  To: Max Gurtovoy, jgg, alex.williamson, cohuck, kvm, linux-kernel
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, hch



On 10/03/2021 23:57, Max Gurtovoy wrote:
> 
> On 3/10/2021 8:39 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 09/03/2021 19:33, Max Gurtovoy wrote:
>>> The new drivers introduced are nvlink2gpu_vfio_pci.ko and
>>> npu2_vfio_pci.ko.
>>> The first will be responsible for providing special extensions for
>>> NVIDIA GPUs with NVLINK2 support for P9 platform (and others in the
>>> future). The last will be responsible for POWER9 NPU2 unit (NVLink2 host
>>> bus adapter).
>>>
>>> Also, preserve backward compatibility for users that were binding
>>> NVLINK2 devices to vfio_pci.ko. Hopefully this compatibility layer will
>>> be dropped in the future
>>>
>>> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
>>> ---
>>>   drivers/vfio/pci/Kconfig                      |  28 +++-
>>>   drivers/vfio/pci/Makefile                     |   7 +-
>>>   .../pci/{vfio_pci_npu2.c => npu2_vfio_pci.c}  | 144 ++++++++++++++++-
>>>   drivers/vfio/pci/npu2_vfio_pci.h              |  24 +++
>>>   ...pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} | 149 +++++++++++++++++-
>>>   drivers/vfio/pci/nvlink2gpu_vfio_pci.h        |  24 +++
>>>   drivers/vfio/pci/vfio_pci.c                   |  61 ++++++-
>>>   drivers/vfio/pci/vfio_pci_core.c              |  18 ---
>>>   drivers/vfio/pci/vfio_pci_core.h              |  14 --
>>>   9 files changed, 422 insertions(+), 47 deletions(-)
>>>   rename drivers/vfio/pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} (64%)
>>>   create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h
>>>   rename drivers/vfio/pci/{vfio_pci_nvlink2gpu.c => 
>>> nvlink2gpu_vfio_pci.c} (67%)
>>>   create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h
>>>
>>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
>>> index 829e90a2e5a3..88c89863a205 100644
>>> --- a/drivers/vfio/pci/Kconfig
>>> +++ b/drivers/vfio/pci/Kconfig
>>> @@ -48,8 +48,30 @@ config VFIO_PCI_IGD
>>>           To enable Intel IGD assignment through vfio-pci, say Y.
>>>   -config VFIO_PCI_NVLINK2
>>> -    def_bool y
>>> +config VFIO_PCI_NVLINK2GPU
>>> +    tristate "VFIO support for NVIDIA NVLINK2 GPUs"
>>>       depends on VFIO_PCI_CORE && PPC_POWERNV
>>>       help
>>> -      VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
>>> +      VFIO PCI driver for NVIDIA NVLINK2 GPUs with specific extensions
>>> +      for P9 Witherspoon machine.
>>> +
>>> +config VFIO_PCI_NPU2
>>> +    tristate "VFIO support for IBM NPU host bus adapter on P9"
>>> +    depends on VFIO_PCI_NVLINK2GPU && PPC_POWERNV
>>> +    help
>>> +      VFIO PCI specific extensions for IBM NVLink2 host bus adapter 
>>> on P9
>>> +      Witherspoon machine.
>>> +
>>> +config VFIO_PCI_DRIVER_COMPAT
>>> +    bool "VFIO PCI backward compatibility for vendor specific 
>>> extensions"
>>> +    default y
>>> +    depends on VFIO_PCI
>>> +    help
>>> +      Say Y here if you want to preserve VFIO PCI backward
>>> +      compatibility. vfio_pci.ko will continue to automatically use
>>> +      the NVLINK2, NPU2 and IGD VFIO drivers when it is attached to
>>> +      a compatible device.
>>> +
>>> +      When N is selected the user must bind explicity to the module
>>> +      they want to handle the device and vfio_pci.ko will have no
>>> +      device specific special behaviors.
>>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>>> index f539f32c9296..86fb62e271fc 100644
>>> --- a/drivers/vfio/pci/Makefile
>>> +++ b/drivers/vfio/pci/Makefile
>>> @@ -2,10 +2,15 @@
>>>     obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
>>>   obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>>> +obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o
>>> +obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o
>>>     vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o 
>>> vfio_pci_rdwr.o vfio_pci_config.o
>>>   vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>>> -vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o 
>>> vfio_pci_npu2.o
>>>   vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
>>>     vfio-pci-y := vfio_pci.o
>>> +
>>> +npu2-vfio-pci-y := npu2_vfio_pci.o
>>> +
>>> +nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o
>>> diff --git a/drivers/vfio/pci/vfio_pci_npu2.c 
>>> b/drivers/vfio/pci/npu2_vfio_pci.c
>>> similarity index 64%
>>> rename from drivers/vfio/pci/vfio_pci_npu2.c
>>> rename to drivers/vfio/pci/npu2_vfio_pci.c
>>> index 717745256ab3..7071bda0f2b6 100644
>>> --- a/drivers/vfio/pci/vfio_pci_npu2.c
>>> +++ b/drivers/vfio/pci/npu2_vfio_pci.c
>>> @@ -14,19 +14,28 @@
>>>    *    Author: Alex Williamson <alex.williamson@redhat.com>
>>>    */
>>>   +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>>> +
>>> +#include <linux/module.h>
>>>   #include <linux/io.h>
>>>   #include <linux/pci.h>
>>>   #include <linux/uaccess.h>
>>>   #include <linux/vfio.h>
>>> +#include <linux/list.h>
>>>   #include <linux/sched/mm.h>
>>>   #include <linux/mmu_context.h>
>>>   #include <asm/kvm_ppc.h>
>>>     #include "vfio_pci_core.h"
>>> +#include "npu2_vfio_pci.h"
>>>     #define CREATE_TRACE_POINTS
>>>   #include "npu2_trace.h"
>>>   +#define DRIVER_VERSION  "0.1"
>>> +#define DRIVER_AUTHOR   "Alexey Kardashevskiy <aik@ozlabs.ru>"
>>> +#define DRIVER_DESC     "NPU2 VFIO PCI - User Level meta-driver for 
>>> POWER9 NPU NVLink2 HBA"
>>> +
>>>   EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap);
>>>     struct vfio_pci_npu2_data {
>>> @@ -36,6 +45,10 @@ struct vfio_pci_npu2_data {
>>>       unsigned int link_speed; /* The link speed from DT's 
>>> ibm,nvlink-speed */
>>>   };
>>>   +struct npu2_vfio_pci_device {
>>> +    struct vfio_pci_core_device    vdev;
>>> +};
>>> +
>>>   static size_t vfio_pci_npu2_rw(struct vfio_pci_core_device *vdev,
>>>           char __user *buf, size_t count, loff_t *ppos, bool iswrite)
>>>   {
>>> @@ -120,7 +133,7 @@ static const struct vfio_pci_regops 
>>> vfio_pci_npu2_regops = {
>>>       .add_capability = vfio_pci_npu2_add_capability,
>>>   };
>>>   -int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
>>> +static int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
>>>   {
>>>       int ret;
>>>       struct vfio_pci_npu2_data *data;
>>> @@ -220,3 +233,132 @@ int vfio_pci_ibm_npu2_init(struct 
>>> vfio_pci_core_device *vdev)
>>>         return ret;
>>>   }
>>> +
>>> +static void npu2_vfio_pci_release(void *device_data)
>>> +{
>>> +    struct vfio_pci_core_device *vdev = device_data;
>>> +
>>> +    mutex_lock(&vdev->reflck->lock);
>>> +    if (!(--vdev->refcnt)) {
>>> +        vfio_pci_vf_token_user_add(vdev, -1);
>>> +        vfio_pci_core_spapr_eeh_release(vdev);
>>> +        vfio_pci_core_disable(vdev);
>>> +    }
>>> +    mutex_unlock(&vdev->reflck->lock);
>>> +
>>> +    module_put(THIS_MODULE);
>>> +}
>>> +
>>> +static int npu2_vfio_pci_open(void *device_data)
>>> +{
>>> +    struct vfio_pci_core_device *vdev = device_data;
>>> +    int ret = 0;
>>> +
>>> +    if (!try_module_get(THIS_MODULE))
>>> +        return -ENODEV;
>>> +
>>> +    mutex_lock(&vdev->reflck->lock);
>>> +
>>> +    if (!vdev->refcnt) {
>>> +        ret = vfio_pci_core_enable(vdev);
>>> +        if (ret)
>>> +            goto error;
>>> +
>>> +        ret = vfio_pci_ibm_npu2_init(vdev);
>>> +        if (ret && ret != -ENODEV) {
>>> +            pci_warn(vdev->pdev,
>>> +                 "Failed to setup NVIDIA NV2 ATSD region\n");
>>> +            vfio_pci_core_disable(vdev);
>>> +            goto error;
>>> +        }
>>> +        ret = 0;
>>> +        vfio_pci_probe_mmaps(vdev);
>>> +        vfio_pci_core_spapr_eeh_open(vdev);
>>> +        vfio_pci_vf_token_user_add(vdev, 1);
>>> +    }
>>> +    vdev->refcnt++;
>>> +error:
>>> +    mutex_unlock(&vdev->reflck->lock);
>>> +    if (ret)
>>> +        module_put(THIS_MODULE);
>>> +    return ret;
>>> +}
>>> +
>>> +static const struct vfio_device_ops npu2_vfio_pci_ops = {
>>> +    .name        = "npu2-vfio-pci",
>>> +    .open        = npu2_vfio_pci_open,
>>> +    .release    = npu2_vfio_pci_release,
>>> +    .ioctl        = vfio_pci_core_ioctl,
>>> +    .read        = vfio_pci_core_read,
>>> +    .write        = vfio_pci_core_write,
>>> +    .mmap        = vfio_pci_core_mmap,
>>> +    .request    = vfio_pci_core_request,
>>> +    .match        = vfio_pci_core_match,
>>> +};
>>> +
>>> +static int npu2_vfio_pci_probe(struct pci_dev *pdev,
>>> +        const struct pci_device_id *id)
>>> +{
>>> +    struct npu2_vfio_pci_device *npvdev;
>>> +    int ret;
>>> +
>>> +    npvdev = kzalloc(sizeof(*npvdev), GFP_KERNEL);
>>> +    if (!npvdev)
>>> +        return -ENOMEM;
>>> +
>>> +    ret = vfio_pci_core_register_device(&npvdev->vdev, pdev,
>>> +            &npu2_vfio_pci_ops);
>>> +    if (ret)
>>> +        goto out_free;
>>> +
>>> +    return 0;
>>> +
>>> +out_free:
>>> +    kfree(npvdev);
>>> +    return ret;
>>> +}
>>> +
>>> +static void npu2_vfio_pci_remove(struct pci_dev *pdev)
>>> +{
>>> +    struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
>>> +    struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
>>> +    struct npu2_vfio_pci_device *npvdev;
>>> +
>>> +    npvdev = container_of(core_vpdev, struct npu2_vfio_pci_device, 
>>> vdev);
>>> +
>>> +    vfio_pci_core_unregister_device(core_vpdev);
>>> +    kfree(npvdev);
>>> +}
>>> +
>>> +static const struct pci_device_id npu2_vfio_pci_table[] = {
>>> +    { PCI_VDEVICE(IBM, 0x04ea) },
>>> +    { 0, }
>>> +};
>>> +
>>> +static struct pci_driver npu2_vfio_pci_driver = {
>>> +    .name            = "npu2-vfio-pci",
>>> +    .id_table        = npu2_vfio_pci_table,
>>> +    .probe            = npu2_vfio_pci_probe,
>>> +    .remove            = npu2_vfio_pci_remove,
>>> +#ifdef CONFIG_PCI_IOV
>>> +    .sriov_configure    = vfio_pci_core_sriov_configure,
>>> +#endif
>>> +    .err_handler        = &vfio_pci_core_err_handlers,
>>> +};
>>> +
>>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>>> +struct pci_driver *get_npu2_vfio_pci_driver(struct pci_dev *pdev)
>>> +{
>>> +    if (pci_match_id(npu2_vfio_pci_driver.id_table, pdev))
>>> +        return &npu2_vfio_pci_driver;
>>> +    return NULL;
>>> +}
>>> +EXPORT_SYMBOL_GPL(get_npu2_vfio_pci_driver);
>>> +#endif
>>> +
>>> +module_pci_driver(npu2_vfio_pci_driver);
>>> +
>>> +MODULE_VERSION(DRIVER_VERSION);
>>> +MODULE_LICENSE("GPL v2");
>>> +MODULE_AUTHOR(DRIVER_AUTHOR);
>>> +MODULE_DESCRIPTION(DRIVER_DESC);
>>> diff --git a/drivers/vfio/pci/npu2_vfio_pci.h 
>>> b/drivers/vfio/pci/npu2_vfio_pci.h
>>> new file mode 100644
>>> index 000000000000..92010d340346
>>> --- /dev/null
>>> +++ b/drivers/vfio/pci/npu2_vfio_pci.h
>>> @@ -0,0 +1,24 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +/*
>>> + * Copyright (c) 2020, Mellanox Technologies, Ltd.  All rights 
>>> reserved.
>>> + *     Author: Max Gurtovoy <mgurtovoy@nvidia.com>
>>> + */
>>> +
>>> +#ifndef NPU2_VFIO_PCI_H
>>> +#define NPU2_VFIO_PCI_H
>>> +
>>> +#include <linux/pci.h>
>>> +#include <linux/module.h>
>>> +
>>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>>> +#if defined(CONFIG_VFIO_PCI_NPU2) || 
>>> defined(CONFIG_VFIO_PCI_NPU2_MODULE)
>>> +struct pci_driver *get_npu2_vfio_pci_driver(struct pci_dev *pdev);
>>> +#else
>>> +struct pci_driver *get_npu2_vfio_pci_driver(struct pci_dev *pdev)
>>> +{
>>> +    return NULL;
>>> +}
>>> +#endif
>>> +#endif
>>> +
>>> +#endif /* NPU2_VFIO_PCI_H */
>>> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2gpu.c 
>>> b/drivers/vfio/pci/nvlink2gpu_vfio_pci.c
>>> similarity index 67%
>>> rename from drivers/vfio/pci/vfio_pci_nvlink2gpu.c
>>> rename to drivers/vfio/pci/nvlink2gpu_vfio_pci.c
>>> index 6dce1e78ee82..84a5ac1ce8ac 100644
>>> --- a/drivers/vfio/pci/vfio_pci_nvlink2gpu.c
>>> +++ b/drivers/vfio/pci/nvlink2gpu_vfio_pci.c
>>> @@ -1,6 +1,6 @@
>>>   // SPDX-License-Identifier: GPL-2.0-only
>>>   /*
>>> - * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
>>> + * VFIO PCI NVIDIA NVLink2 GPUs support.
>>>    *
>>>    * Copyright (C) 2018 IBM Corp.  All rights reserved.
>>>    *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> @@ -12,6 +12,9 @@
>>>    *    Author: Alex Williamson <alex.williamson@redhat.com>
>>>    */
>>>   +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>>> +
>>> +#include <linux/module.h>
>>>   #include <linux/io.h>
>>>   #include <linux/pci.h>
>>>   #include <linux/uaccess.h>
>>> @@ -21,10 +24,15 @@
>>>   #include <asm/kvm_ppc.h>
>>>     #include "vfio_pci_core.h"
>>> +#include "nvlink2gpu_vfio_pci.h"
>>>     #define CREATE_TRACE_POINTS
>>>   #include "nvlink2gpu_trace.h"
>>>   +#define DRIVER_VERSION  "0.1"
>>> +#define DRIVER_AUTHOR   "Alexey Kardashevskiy <aik@ozlabs.ru>"
>>> +#define DRIVER_DESC     "NVLINK2GPU VFIO PCI - User Level 
>>> meta-driver for NVIDIA NVLink2 GPUs"
>>> +
>>>   EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap_fault);
>>>   EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap);
>>>   @@ -39,6 +47,10 @@ struct vfio_pci_nvgpu_data {
>>>       struct notifier_block group_notifier;
>>>   };
>>>   +struct nv_vfio_pci_device {
>>> +    struct vfio_pci_core_device    vdev;
>>> +};
>>> +
>>>   static size_t vfio_pci_nvgpu_rw(struct vfio_pci_core_device *vdev,
>>>           char __user *buf, size_t count, loff_t *ppos, bool iswrite)
>>>   {
>>> @@ -207,7 +219,8 @@ static int vfio_pci_nvgpu_group_notifier(struct 
>>> notifier_block *nb,
>>>       return NOTIFY_OK;
>>>   }
>>>   -int vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device 
>>> *vdev)
>>> +static int
>>> +vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
>>>   {
>>>       int ret;
>>>       u64 reg[2];
>>> @@ -293,3 +306,135 @@ int vfio_pci_nvidia_v100_nvlink2_init(struct 
>>> vfio_pci_core_device *vdev)
>>>         return ret;
>>>   }
>>> +
>>> +static void nvlink2gpu_vfio_pci_release(void *device_data)
>>> +{
>>> +    struct vfio_pci_core_device *vdev = device_data;
>>> +
>>> +    mutex_lock(&vdev->reflck->lock);
>>> +    if (!(--vdev->refcnt)) {
>>> +        vfio_pci_vf_token_user_add(vdev, -1);
>>> +        vfio_pci_core_spapr_eeh_release(vdev);
>>> +        vfio_pci_core_disable(vdev);
>>> +    }
>>> +    mutex_unlock(&vdev->reflck->lock);
>>> +
>>> +    module_put(THIS_MODULE);
>>> +}
>>> +
>>> +static int nvlink2gpu_vfio_pci_open(void *device_data)
>>> +{
>>> +    struct vfio_pci_core_device *vdev = device_data;
>>> +    int ret = 0;
>>> +
>>> +    if (!try_module_get(THIS_MODULE))
>>> +        return -ENODEV;
>>> +
>>> +    mutex_lock(&vdev->reflck->lock);
>>> +
>>> +    if (!vdev->refcnt) {
>>> +        ret = vfio_pci_core_enable(vdev);
>>> +        if (ret)
>>> +            goto error;
>>> +
>>> +        ret = vfio_pci_nvidia_v100_nvlink2_init(vdev);
>>> +        if (ret && ret != -ENODEV) {
>>> +            pci_warn(vdev->pdev,
>>> +                 "Failed to setup NVIDIA NV2 RAM region\n");
>>> +            vfio_pci_core_disable(vdev);
>>> +            goto error;
>>> +        }
>>> +        ret = 0;
>>> +        vfio_pci_probe_mmaps(vdev);
>>> +        vfio_pci_core_spapr_eeh_open(vdev);
>>> +        vfio_pci_vf_token_user_add(vdev, 1);
>>> +    }
>>> +    vdev->refcnt++;
>>> +error:
>>> +    mutex_unlock(&vdev->reflck->lock);
>>> +    if (ret)
>>> +        module_put(THIS_MODULE);
>>> +    return ret;
>>> +}
>>> +
>>> +static const struct vfio_device_ops nvlink2gpu_vfio_pci_ops = {
>>> +    .name        = "nvlink2gpu-vfio-pci",
>>> +    .open        = nvlink2gpu_vfio_pci_open,
>>> +    .release    = nvlink2gpu_vfio_pci_release,
>>> +    .ioctl        = vfio_pci_core_ioctl,
>>> +    .read        = vfio_pci_core_read,
>>> +    .write        = vfio_pci_core_write,
>>> +    .mmap        = vfio_pci_core_mmap,
>>> +    .request    = vfio_pci_core_request,
>>> +    .match        = vfio_pci_core_match,
>>> +};
>>> +
>>> +static int nvlink2gpu_vfio_pci_probe(struct pci_dev *pdev,
>>> +        const struct pci_device_id *id)
>>> +{
>>> +    struct nv_vfio_pci_device *nvdev;
>>> +    int ret;
>>> +
>>> +    nvdev = kzalloc(sizeof(*nvdev), GFP_KERNEL);
>>> +    if (!nvdev)
>>> +        return -ENOMEM;
>>> +
>>> +    ret = vfio_pci_core_register_device(&nvdev->vdev, pdev,
>>> +            &nvlink2gpu_vfio_pci_ops);
>>> +    if (ret)
>>> +        goto out_free;
>>> +
>>> +    return 0;
>>> +
>>> +out_free:
>>> +    kfree(nvdev);
>>> +    return ret;
>>> +}
>>> +
>>> +static void nvlink2gpu_vfio_pci_remove(struct pci_dev *pdev)
>>> +{
>>> +    struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
>>> +    struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
>>> +    struct nv_vfio_pci_device *nvdev;
>>> +
>>> +    nvdev = container_of(core_vpdev, struct nv_vfio_pci_device, vdev);
>>> +
>>> +    vfio_pci_core_unregister_device(core_vpdev);
>>> +    kfree(nvdev);
>>> +}
>>> +
>>> +static const struct pci_device_id nvlink2gpu_vfio_pci_table[] = {
>>> +    { PCI_VDEVICE(NVIDIA, 0x1DB1) }, /* GV100GL-A NVIDIA Tesla 
>>> V100-SXM2-16GB */
>>> +    { PCI_VDEVICE(NVIDIA, 0x1DB5) }, /* GV100GL-A NVIDIA Tesla 
>>> V100-SXM2-32GB */
>>> +    { PCI_VDEVICE(NVIDIA, 0x1DB8) }, /* GV100GL-A NVIDIA Tesla 
>>> V100-SXM3-32GB */
>>> +    { PCI_VDEVICE(NVIDIA, 0x1DF5) }, /* GV100GL-B NVIDIA Tesla 
>>> V100-SXM2-16GB */
>>
>>
>> Where is this list from?
>>
>> Also, how is this supposed to work at the boot time? Will the kernel 
>> try binding let's say this one and nouveau? Which one is going to win?
> 
> At boot time nouveau driver will win since the vfio drivers don't 
> declare MODULE_DEVICE_TABLE


ok but where is the list from anyway?


> 
> 
>>
>>
>>> +    { 0, }
>>
>>
>> Why a comma?
> 
> I'll remove the comma.
> 
> 
>>
>>> +};
>>
>>
>>
>>> +
>>> +static struct pci_driver nvlink2gpu_vfio_pci_driver = {
>>> +    .name            = "nvlink2gpu-vfio-pci",
>>> +    .id_table        = nvlink2gpu_vfio_pci_table,
>>> +    .probe            = nvlink2gpu_vfio_pci_probe,
>>> +    .remove            = nvlink2gpu_vfio_pci_remove,
>>> +#ifdef CONFIG_PCI_IOV
>>> +    .sriov_configure    = vfio_pci_core_sriov_configure,
>>> +#endif
>>
>>
>> What is this IOV business about?
> 
> from vfio_pci
> 
> #ifdef CONFIG_PCI_IOV
> module_param(enable_sriov, bool, 0644);
> MODULE_PARM_DESC(enable_sriov, "Enable support for SR-IOV 
> configuration.  Enabling SR-IOV on a PF typically requires support of 
> the userspace PF driver, enabling VFs without such support may result in 
> non-functional VFs or PF.");
> #endif


I know what IOV is in general :) What I meant to say was that I am 
pretty sure these GPUs cannot do IOV so this does not need to be in 
these NVLink drivers.



> 
> 
>>
>>
>>> +    .err_handler        = &vfio_pci_core_err_handlers,
>>> +};
>>> +
>>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>>> +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev)
>>> +{
>>> +    if (pci_match_id(nvlink2gpu_vfio_pci_driver.id_table, pdev))
>>> +        return &nvlink2gpu_vfio_pci_driver;
>>
>>
>> Why do we need matching PCI ids here instead of looking at the FDT 
>> which will work better?
> 
> what is FDT ? any is it better to use it instead of match_id ?


Flattened Device Tree - a way for the firmware to pass the configuration 
to the OS. This data tells if there are NVLinks and what they are linked 
to. This defines if the feature is available as it should work with any 
GPU in this form factor.


> 
>>
>>
>>> +    return NULL;
>>> +}
>>> +EXPORT_SYMBOL_GPL(get_nvlink2gpu_vfio_pci_driver);
>>> +#endif
>>> +
>>> +module_pci_driver(nvlink2gpu_vfio_pci_driver);
>>> +
>>> +MODULE_VERSION(DRIVER_VERSION);
>>> +MODULE_LICENSE("GPL v2");
>>> +MODULE_AUTHOR(DRIVER_AUTHOR);
>>> +MODULE_DESCRIPTION(DRIVER_DESC);
>>> diff --git a/drivers/vfio/pci/nvlink2gpu_vfio_pci.h 
>>> b/drivers/vfio/pci/nvlink2gpu_vfio_pci.h
>>> new file mode 100644
>>> index 000000000000..ebd5b600b190
>>> --- /dev/null
>>> +++ b/drivers/vfio/pci/nvlink2gpu_vfio_pci.h
>>> @@ -0,0 +1,24 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +/*
>>> + * Copyright (c) 2020, Mellanox Technologies, Ltd.  All rights 
>>> reserved.
>>> + *     Author: Max Gurtovoy <mgurtovoy@nvidia.com>
>>> + */
>>> +
>>> +#ifndef NVLINK2GPU_VFIO_PCI_H
>>> +#define NVLINK2GPU_VFIO_PCI_H
>>> +
>>> +#include <linux/pci.h>
>>> +#include <linux/module.h>
>>> +
>>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>>> +#if defined(CONFIG_VFIO_PCI_NVLINK2GPU) || 
>>> defined(CONFIG_VFIO_PCI_NVLINK2GPU_MODULE)
>>> +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev 
>>> *pdev);
>>> +#else
>>> +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev)
>>> +{
>>> +    return NULL;
>>> +}
>>> +#endif
>>> +#endif
>>> +
>>> +#endif /* NVLINK2GPU_VFIO_PCI_H */
>>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>>> index dbc0a6559914..8e81ea039f31 100644
>>> --- a/drivers/vfio/pci/vfio_pci.c
>>> +++ b/drivers/vfio/pci/vfio_pci.c
>>> @@ -27,6 +27,10 @@
>>>   #include <linux/uaccess.h>
>>>     #include "vfio_pci_core.h"
>>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>>> +#include "npu2_vfio_pci.h"
>>> +#include "nvlink2gpu_vfio_pci.h"
>>> +#endif
>>>     #define DRIVER_VERSION  "0.2"
>>>   #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
>>> @@ -142,14 +146,48 @@ static const struct vfio_device_ops 
>>> vfio_pci_ops = {
>>>       .match        = vfio_pci_core_match,
>>>   };
>>>   +/*
>>> + * This layer is used for backward compatibility. Hopefully it will be
>>> + * removed in the future.
>>> + */
>>> +static struct pci_driver *vfio_pci_get_compat_driver(struct pci_dev 
>>> *pdev)
>>> +{
>>> +    switch (pdev->vendor) {
>>> +    case PCI_VENDOR_ID_NVIDIA:
>>> +        switch (pdev->device) {
>>> +        case 0x1db1:
>>> +        case 0x1db5:
>>> +        case 0x1db8:
>>> +        case 0x1df5:
>>> +            return get_nvlink2gpu_vfio_pci_driver(pdev);
>>
>> This does not really need a switch, could simply call these 
>> get_xxxx_vfio_pci_driver. Thanks,
> 
> maybe the result will be the same but I don't think we need to send all 
> NVIDIA devices or IBM devices to this function.

We can tolerate this on POWER (the check is really cheap) and for 
everybody else this driver won't even compile.


> we can maybe export the tables from the vfio_vendor driver and match it 
> here.


I am still missing the point of device matching. It won't bind by 
default at the boot time and it won't make the existing user life any 
easier as they use libvirt which overrides this anyway.


>>
>>
>>> +        default:
>>> +            return NULL;
>>> +        }
>>> +    case PCI_VENDOR_ID_IBM:
>>> +        switch (pdev->device) {
>>> +        case 0x04ea:
>>> +            return get_npu2_vfio_pci_driver(pdev);
>>> +        default:
>>> +            return NULL;
>>> +        }
>>> +    }
>>> +
>>> +    return NULL;
>>> +}
>>> +
>>>   static int vfio_pci_probe(struct pci_dev *pdev, const struct 
>>> pci_device_id *id)
>>>   {
>>>       struct vfio_pci_device *vpdev;
>>> +    struct pci_driver *driver;
>>>       int ret;
>>>         if (vfio_pci_is_denylisted(pdev))
>>>           return -EINVAL;
>>>   +    driver = vfio_pci_get_compat_driver(pdev);
>>> +    if (driver)
>>> +        return driver->probe(pdev, id);
>>> +
>>>       vpdev = kzalloc(sizeof(*vpdev), GFP_KERNEL);
>>>       if (!vpdev)
>>>           return -ENOMEM;
>>> @@ -167,14 +205,21 @@ static int vfio_pci_probe(struct pci_dev *pdev, 
>>> const struct pci_device_id *id)
>>>     static void vfio_pci_remove(struct pci_dev *pdev)
>>>   {
>>> -    struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
>>> -    struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
>>> -    struct vfio_pci_device *vpdev;
>>> -
>>> -    vpdev = container_of(core_vpdev, struct vfio_pci_device, vdev);
>>> -
>>> -    vfio_pci_core_unregister_device(core_vpdev);
>>> -    kfree(vpdev);
>>> +    struct pci_driver *driver;
>>> +
>>> +    driver = vfio_pci_get_compat_driver(pdev);
>>> +    if (driver) {
>>> +        driver->remove(pdev);
>>> +    } else {
>>> +        struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
>>> +        struct vfio_pci_core_device *core_vpdev;
>>> +        struct vfio_pci_device *vpdev;
>>> +
>>> +        core_vpdev = vfio_device_data(vdev);
>>> +        vpdev = container_of(core_vpdev, struct vfio_pci_device, vdev);
>>> +        vfio_pci_core_unregister_device(core_vpdev);
>>> +        kfree(vpdev);
>>> +    }
>>>   }
>>>     static int vfio_pci_sriov_configure(struct pci_dev *pdev, int 
>>> nr_virtfn)
>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c 
>>> b/drivers/vfio/pci/vfio_pci_core.c
>>> index 4de8e352df9c..f9b39abe54cb 100644
>>> --- a/drivers/vfio/pci/vfio_pci_core.c
>>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>>> @@ -354,24 +354,6 @@ int vfio_pci_core_enable(struct 
>>> vfio_pci_core_device *vdev)
>>>           }
>>>       }
>>>   -    if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
>>> -        IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
>>> -        ret = vfio_pci_nvidia_v100_nvlink2_init(vdev);
>>> -        if (ret && ret != -ENODEV) {
>>> -            pci_warn(pdev, "Failed to setup NVIDIA NV2 RAM region\n");
>>> -            goto disable_exit;
>>> -        }
>>> -    }
>>> -
>>> -    if (pdev->vendor == PCI_VENDOR_ID_IBM &&
>>> -        IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
>>> -        ret = vfio_pci_ibm_npu2_init(vdev);
>>> -        if (ret && ret != -ENODEV) {
>>> -            pci_warn(pdev, "Failed to setup NVIDIA NV2 ATSD region\n");
>>> -            goto disable_exit;
>>> -        }
>>> -    }
>>> -
>>>       return 0;
>>>     disable_exit:
>>> diff --git a/drivers/vfio/pci/vfio_pci_core.h 
>>> b/drivers/vfio/pci/vfio_pci_core.h
>>> index 8989443c3086..31f3836e606e 100644
>>> --- a/drivers/vfio/pci/vfio_pci_core.h
>>> +++ b/drivers/vfio/pci/vfio_pci_core.h
>>> @@ -204,20 +204,6 @@ static inline int vfio_pci_igd_init(struct 
>>> vfio_pci_core_device *vdev)
>>>       return -ENODEV;
>>>   }
>>>   #endif
>>> -#ifdef CONFIG_VFIO_PCI_NVLINK2
>>> -extern int vfio_pci_nvidia_v100_nvlink2_init(struct 
>>> vfio_pci_core_device *vdev);
>>> -extern int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev);
>>> -#else
>>> -static inline int vfio_pci_nvidia_v100_nvlink2_init(struct 
>>> vfio_pci_core_device *vdev)
>>> -{
>>> -    return -ENODEV;
>>> -}
>>> -
>>> -static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device 
>>> *vdev)
>>> -{
>>> -    return -ENODEV;
>>> -}
>>> -#endif
>>>     #ifdef CONFIG_S390
>>>   extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device 
>>> *vdev,
>>>
>>

-- 
Alexey

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-10 13:02       ` Jason Gunthorpe
@ 2021-03-10 14:24         ` Alexey Kardashevskiy
  2021-03-10 19:40           ` Jason Gunthorpe
  0 siblings, 1 reply; 53+ messages in thread
From: Alexey Kardashevskiy @ 2021-03-10 14:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Max Gurtovoy
  Cc: alex.williamson, cohuck, kvm, linux-kernel, liranl, oren, tzahio,
	leonro, yarong, aviadye, shahafs, artemp, kwankhede, ACurrid,
	cjia, yishaih, mjrosato, hch



On 11/03/2021 00:02, Jason Gunthorpe wrote:
> On Wed, Mar 10, 2021 at 02:57:57PM +0200, Max Gurtovoy wrote:
> 
>>>> +    .err_handler        = &vfio_pci_core_err_handlers,
>>>> +};
>>>> +
>>>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>>>> +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev)
>>>> +{
>>>> +    if (pci_match_id(nvlink2gpu_vfio_pci_driver.id_table, pdev))
>>>> +        return &nvlink2gpu_vfio_pci_driver;
>>>
>>>
>>> Why do we need matching PCI ids here instead of looking at the FDT which
>>> will work better?
>>
>> what is FDT ? any is it better to use it instead of match_id ?
> 
> This is emulating the device_driver match for the pci_driver.


No it is not, it is a device tree info which lets to skip the linux PCI 
discovery part (the firmware does it anyway) but it tells nothing about 
which drivers to bind.


> I don't think we can combine FDT matching with pci_driver, can we?

It is a c function calling another c function, all within vfio-pci, this 
is not called by the generic pci code.



-- 
Alexey

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-10 14:24         ` Alexey Kardashevskiy
@ 2021-03-10 19:40           ` Jason Gunthorpe
  2021-03-11  1:20             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-10 19:40 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Max Gurtovoy, alex.williamson, cohuck, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, hch

On Thu, Mar 11, 2021 at 01:24:47AM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 11/03/2021 00:02, Jason Gunthorpe wrote:
> > On Wed, Mar 10, 2021 at 02:57:57PM +0200, Max Gurtovoy wrote:
> > 
> > > > > +    .err_handler        = &vfio_pci_core_err_handlers,
> > > > > +};
> > > > > +
> > > > > +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
> > > > > +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev)
> > > > > +{
> > > > > +    if (pci_match_id(nvlink2gpu_vfio_pci_driver.id_table, pdev))
> > > > > +        return &nvlink2gpu_vfio_pci_driver;
> > > > 
> > > > 
> > > > Why do we need matching PCI ids here instead of looking at the FDT which
> > > > will work better?
> > > 
> > > what is FDT ? any is it better to use it instead of match_id ?
> > 
> > This is emulating the device_driver match for the pci_driver.
> 
> No it is not, it is a device tree info which lets to skip the linux PCI
> discovery part (the firmware does it anyway) but it tells nothing about
> which drivers to bind.

I mean get_nvlink2gpu_vfio_pci_driver() is emulating the PCI match.

Max added a pci driver for NPU here:

+static struct pci_driver npu2_vfio_pci_driver = {
+	.name			= "npu2-vfio-pci",
+	.id_table		= npu2_vfio_pci_table,
+	.probe			= npu2_vfio_pci_probe,


new userspace should use driver_override with "npu-vfio-pci" as the
string not "vfio-pci"

The point of the get_npu2_vfio_pci_driver() is only optional
compatibility to redirect old userspace using "vfio-pci" in the
driver_override to the now split driver code so userspace doesn't see
any change in behavior.

If we don't do this then the vfio-pci driver override will disable the
npu2 special stuff, since Max took it all out of vfio-pci's
pci_driver.

It is supposed to match exactly the same match table as the pci_driver
above. We *don't* want different behavior from what the standrd PCI
driver matcher will do.

Since we don't have any way to mix in FDT discovery to the standard
PCI driver match it will still attach the npu driver but not enable
any special support. This seems OK.

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-10 14:19       ` Alexey Kardashevskiy
@ 2021-03-11  1:10         ` Max Gurtovoy
  0 siblings, 0 replies; 53+ messages in thread
From: Max Gurtovoy @ 2021-03-11  1:10 UTC (permalink / raw)
  To: Alexey Kardashevskiy, jgg, alex.williamson, cohuck, kvm, linux-kernel
  Cc: liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, hch


On 3/10/2021 4:19 PM, Alexey Kardashevskiy wrote:
>
>
> On 10/03/2021 23:57, Max Gurtovoy wrote:
>>
>> On 3/10/2021 8:39 AM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 09/03/2021 19:33, Max Gurtovoy wrote:
>>>> The new drivers introduced are nvlink2gpu_vfio_pci.ko and
>>>> npu2_vfio_pci.ko.
>>>> The first will be responsible for providing special extensions for
>>>> NVIDIA GPUs with NVLINK2 support for P9 platform (and others in the
>>>> future). The last will be responsible for POWER9 NPU2 unit (NVLink2 
>>>> host
>>>> bus adapter).
>>>>
>>>> Also, preserve backward compatibility for users that were binding
>>>> NVLINK2 devices to vfio_pci.ko. Hopefully this compatibility layer 
>>>> will
>>>> be dropped in the future
>>>>
>>>> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
>>>> ---
>>>>   drivers/vfio/pci/Kconfig                      |  28 +++-
>>>>   drivers/vfio/pci/Makefile                     |   7 +-
>>>>   .../pci/{vfio_pci_npu2.c => npu2_vfio_pci.c}  | 144 
>>>> ++++++++++++++++-
>>>>   drivers/vfio/pci/npu2_vfio_pci.h              |  24 +++
>>>>   ...pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} | 149 
>>>> +++++++++++++++++-
>>>>   drivers/vfio/pci/nvlink2gpu_vfio_pci.h        |  24 +++
>>>>   drivers/vfio/pci/vfio_pci.c                   |  61 ++++++-
>>>>   drivers/vfio/pci/vfio_pci_core.c              |  18 ---
>>>>   drivers/vfio/pci/vfio_pci_core.h              |  14 --
>>>>   9 files changed, 422 insertions(+), 47 deletions(-)
>>>>   rename drivers/vfio/pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} (64%)
>>>>   create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h
>>>>   rename drivers/vfio/pci/{vfio_pci_nvlink2gpu.c => 
>>>> nvlink2gpu_vfio_pci.c} (67%)
>>>>   create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h
>>>>
>>>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
>>>> index 829e90a2e5a3..88c89863a205 100644
>>>> --- a/drivers/vfio/pci/Kconfig
>>>> +++ b/drivers/vfio/pci/Kconfig
>>>> @@ -48,8 +48,30 @@ config VFIO_PCI_IGD
>>>>           To enable Intel IGD assignment through vfio-pci, say Y.
>>>>   -config VFIO_PCI_NVLINK2
>>>> -    def_bool y
>>>> +config VFIO_PCI_NVLINK2GPU
>>>> +    tristate "VFIO support for NVIDIA NVLINK2 GPUs"
>>>>       depends on VFIO_PCI_CORE && PPC_POWERNV
>>>>       help
>>>> -      VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 
>>>> GPUs
>>>> +      VFIO PCI driver for NVIDIA NVLINK2 GPUs with specific 
>>>> extensions
>>>> +      for P9 Witherspoon machine.
>>>> +
>>>> +config VFIO_PCI_NPU2
>>>> +    tristate "VFIO support for IBM NPU host bus adapter on P9"
>>>> +    depends on VFIO_PCI_NVLINK2GPU && PPC_POWERNV
>>>> +    help
>>>> +      VFIO PCI specific extensions for IBM NVLink2 host bus 
>>>> adapter on P9
>>>> +      Witherspoon machine.
>>>> +
>>>> +config VFIO_PCI_DRIVER_COMPAT
>>>> +    bool "VFIO PCI backward compatibility for vendor specific 
>>>> extensions"
>>>> +    default y
>>>> +    depends on VFIO_PCI
>>>> +    help
>>>> +      Say Y here if you want to preserve VFIO PCI backward
>>>> +      compatibility. vfio_pci.ko will continue to automatically use
>>>> +      the NVLINK2, NPU2 and IGD VFIO drivers when it is attached to
>>>> +      a compatible device.
>>>> +
>>>> +      When N is selected the user must bind explicity to the module
>>>> +      they want to handle the device and vfio_pci.ko will have no
>>>> +      device specific special behaviors.
>>>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>>>> index f539f32c9296..86fb62e271fc 100644
>>>> --- a/drivers/vfio/pci/Makefile
>>>> +++ b/drivers/vfio/pci/Makefile
>>>> @@ -2,10 +2,15 @@
>>>>     obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
>>>>   obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>>>> +obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o
>>>> +obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o
>>>>     vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o 
>>>> vfio_pci_rdwr.o vfio_pci_config.o
>>>>   vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>>>> -vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o 
>>>> vfio_pci_npu2.o
>>>>   vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
>>>>     vfio-pci-y := vfio_pci.o
>>>> +
>>>> +npu2-vfio-pci-y := npu2_vfio_pci.o
>>>> +
>>>> +nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o
>>>> diff --git a/drivers/vfio/pci/vfio_pci_npu2.c 
>>>> b/drivers/vfio/pci/npu2_vfio_pci.c
>>>> similarity index 64%
>>>> rename from drivers/vfio/pci/vfio_pci_npu2.c
>>>> rename to drivers/vfio/pci/npu2_vfio_pci.c
>>>> index 717745256ab3..7071bda0f2b6 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_npu2.c
>>>> +++ b/drivers/vfio/pci/npu2_vfio_pci.c
>>>> @@ -14,19 +14,28 @@
>>>>    *    Author: Alex Williamson <alex.williamson@redhat.com>
>>>>    */
>>>>   +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>>>> +
>>>> +#include <linux/module.h>
>>>>   #include <linux/io.h>
>>>>   #include <linux/pci.h>
>>>>   #include <linux/uaccess.h>
>>>>   #include <linux/vfio.h>
>>>> +#include <linux/list.h>
>>>>   #include <linux/sched/mm.h>
>>>>   #include <linux/mmu_context.h>
>>>>   #include <asm/kvm_ppc.h>
>>>>     #include "vfio_pci_core.h"
>>>> +#include "npu2_vfio_pci.h"
>>>>     #define CREATE_TRACE_POINTS
>>>>   #include "npu2_trace.h"
>>>>   +#define DRIVER_VERSION  "0.1"
>>>> +#define DRIVER_AUTHOR   "Alexey Kardashevskiy <aik@ozlabs.ru>"
>>>> +#define DRIVER_DESC     "NPU2 VFIO PCI - User Level meta-driver 
>>>> for POWER9 NPU NVLink2 HBA"
>>>> +
>>>>   EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap);
>>>>     struct vfio_pci_npu2_data {
>>>> @@ -36,6 +45,10 @@ struct vfio_pci_npu2_data {
>>>>       unsigned int link_speed; /* The link speed from DT's 
>>>> ibm,nvlink-speed */
>>>>   };
>>>>   +struct npu2_vfio_pci_device {
>>>> +    struct vfio_pci_core_device    vdev;
>>>> +};
>>>> +
>>>>   static size_t vfio_pci_npu2_rw(struct vfio_pci_core_device *vdev,
>>>>           char __user *buf, size_t count, loff_t *ppos, bool iswrite)
>>>>   {
>>>> @@ -120,7 +133,7 @@ static const struct vfio_pci_regops 
>>>> vfio_pci_npu2_regops = {
>>>>       .add_capability = vfio_pci_npu2_add_capability,
>>>>   };
>>>>   -int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
>>>> +static int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev)
>>>>   {
>>>>       int ret;
>>>>       struct vfio_pci_npu2_data *data;
>>>> @@ -220,3 +233,132 @@ int vfio_pci_ibm_npu2_init(struct 
>>>> vfio_pci_core_device *vdev)
>>>>         return ret;
>>>>   }
>>>> +
>>>> +static void npu2_vfio_pci_release(void *device_data)
>>>> +{
>>>> +    struct vfio_pci_core_device *vdev = device_data;
>>>> +
>>>> +    mutex_lock(&vdev->reflck->lock);
>>>> +    if (!(--vdev->refcnt)) {
>>>> +        vfio_pci_vf_token_user_add(vdev, -1);
>>>> +        vfio_pci_core_spapr_eeh_release(vdev);
>>>> +        vfio_pci_core_disable(vdev);
>>>> +    }
>>>> +    mutex_unlock(&vdev->reflck->lock);
>>>> +
>>>> +    module_put(THIS_MODULE);
>>>> +}
>>>> +
>>>> +static int npu2_vfio_pci_open(void *device_data)
>>>> +{
>>>> +    struct vfio_pci_core_device *vdev = device_data;
>>>> +    int ret = 0;
>>>> +
>>>> +    if (!try_module_get(THIS_MODULE))
>>>> +        return -ENODEV;
>>>> +
>>>> +    mutex_lock(&vdev->reflck->lock);
>>>> +
>>>> +    if (!vdev->refcnt) {
>>>> +        ret = vfio_pci_core_enable(vdev);
>>>> +        if (ret)
>>>> +            goto error;
>>>> +
>>>> +        ret = vfio_pci_ibm_npu2_init(vdev);
>>>> +        if (ret && ret != -ENODEV) {
>>>> +            pci_warn(vdev->pdev,
>>>> +                 "Failed to setup NVIDIA NV2 ATSD region\n");
>>>> +            vfio_pci_core_disable(vdev);
>>>> +            goto error;
>>>> +        }
>>>> +        ret = 0;
>>>> +        vfio_pci_probe_mmaps(vdev);
>>>> +        vfio_pci_core_spapr_eeh_open(vdev);
>>>> +        vfio_pci_vf_token_user_add(vdev, 1);
>>>> +    }
>>>> +    vdev->refcnt++;
>>>> +error:
>>>> +    mutex_unlock(&vdev->reflck->lock);
>>>> +    if (ret)
>>>> +        module_put(THIS_MODULE);
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static const struct vfio_device_ops npu2_vfio_pci_ops = {
>>>> +    .name        = "npu2-vfio-pci",
>>>> +    .open        = npu2_vfio_pci_open,
>>>> +    .release    = npu2_vfio_pci_release,
>>>> +    .ioctl        = vfio_pci_core_ioctl,
>>>> +    .read        = vfio_pci_core_read,
>>>> +    .write        = vfio_pci_core_write,
>>>> +    .mmap        = vfio_pci_core_mmap,
>>>> +    .request    = vfio_pci_core_request,
>>>> +    .match        = vfio_pci_core_match,
>>>> +};
>>>> +
>>>> +static int npu2_vfio_pci_probe(struct pci_dev *pdev,
>>>> +        const struct pci_device_id *id)
>>>> +{
>>>> +    struct npu2_vfio_pci_device *npvdev;
>>>> +    int ret;
>>>> +
>>>> +    npvdev = kzalloc(sizeof(*npvdev), GFP_KERNEL);
>>>> +    if (!npvdev)
>>>> +        return -ENOMEM;
>>>> +
>>>> +    ret = vfio_pci_core_register_device(&npvdev->vdev, pdev,
>>>> +            &npu2_vfio_pci_ops);
>>>> +    if (ret)
>>>> +        goto out_free;
>>>> +
>>>> +    return 0;
>>>> +
>>>> +out_free:
>>>> +    kfree(npvdev);
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static void npu2_vfio_pci_remove(struct pci_dev *pdev)
>>>> +{
>>>> +    struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
>>>> +    struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
>>>> +    struct npu2_vfio_pci_device *npvdev;
>>>> +
>>>> +    npvdev = container_of(core_vpdev, struct npu2_vfio_pci_device, 
>>>> vdev);
>>>> +
>>>> +    vfio_pci_core_unregister_device(core_vpdev);
>>>> +    kfree(npvdev);
>>>> +}
>>>> +
>>>> +static const struct pci_device_id npu2_vfio_pci_table[] = {
>>>> +    { PCI_VDEVICE(IBM, 0x04ea) },
>>>> +    { 0, }
>>>> +};
>>>> +
>>>> +static struct pci_driver npu2_vfio_pci_driver = {
>>>> +    .name            = "npu2-vfio-pci",
>>>> +    .id_table        = npu2_vfio_pci_table,
>>>> +    .probe            = npu2_vfio_pci_probe,
>>>> +    .remove            = npu2_vfio_pci_remove,
>>>> +#ifdef CONFIG_PCI_IOV
>>>> +    .sriov_configure    = vfio_pci_core_sriov_configure,
>>>> +#endif
>>>> +    .err_handler        = &vfio_pci_core_err_handlers,
>>>> +};
>>>> +
>>>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>>>> +struct pci_driver *get_npu2_vfio_pci_driver(struct pci_dev *pdev)
>>>> +{
>>>> +    if (pci_match_id(npu2_vfio_pci_driver.id_table, pdev))
>>>> +        return &npu2_vfio_pci_driver;
>>>> +    return NULL;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(get_npu2_vfio_pci_driver);
>>>> +#endif
>>>> +
>>>> +module_pci_driver(npu2_vfio_pci_driver);
>>>> +
>>>> +MODULE_VERSION(DRIVER_VERSION);
>>>> +MODULE_LICENSE("GPL v2");
>>>> +MODULE_AUTHOR(DRIVER_AUTHOR);
>>>> +MODULE_DESCRIPTION(DRIVER_DESC);
>>>> diff --git a/drivers/vfio/pci/npu2_vfio_pci.h 
>>>> b/drivers/vfio/pci/npu2_vfio_pci.h
>>>> new file mode 100644
>>>> index 000000000000..92010d340346
>>>> --- /dev/null
>>>> +++ b/drivers/vfio/pci/npu2_vfio_pci.h
>>>> @@ -0,0 +1,24 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +/*
>>>> + * Copyright (c) 2020, Mellanox Technologies, Ltd.  All rights 
>>>> reserved.
>>>> + *     Author: Max Gurtovoy <mgurtovoy@nvidia.com>
>>>> + */
>>>> +
>>>> +#ifndef NPU2_VFIO_PCI_H
>>>> +#define NPU2_VFIO_PCI_H
>>>> +
>>>> +#include <linux/pci.h>
>>>> +#include <linux/module.h>
>>>> +
>>>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>>>> +#if defined(CONFIG_VFIO_PCI_NPU2) || 
>>>> defined(CONFIG_VFIO_PCI_NPU2_MODULE)
>>>> +struct pci_driver *get_npu2_vfio_pci_driver(struct pci_dev *pdev);
>>>> +#else
>>>> +struct pci_driver *get_npu2_vfio_pci_driver(struct pci_dev *pdev)
>>>> +{
>>>> +    return NULL;
>>>> +}
>>>> +#endif
>>>> +#endif
>>>> +
>>>> +#endif /* NPU2_VFIO_PCI_H */
>>>> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2gpu.c 
>>>> b/drivers/vfio/pci/nvlink2gpu_vfio_pci.c
>>>> similarity index 67%
>>>> rename from drivers/vfio/pci/vfio_pci_nvlink2gpu.c
>>>> rename to drivers/vfio/pci/nvlink2gpu_vfio_pci.c
>>>> index 6dce1e78ee82..84a5ac1ce8ac 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_nvlink2gpu.c
>>>> +++ b/drivers/vfio/pci/nvlink2gpu_vfio_pci.c
>>>> @@ -1,6 +1,6 @@
>>>>   // SPDX-License-Identifier: GPL-2.0-only
>>>>   /*
>>>> - * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
>>>> + * VFIO PCI NVIDIA NVLink2 GPUs support.
>>>>    *
>>>>    * Copyright (C) 2018 IBM Corp.  All rights reserved.
>>>>    *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> @@ -12,6 +12,9 @@
>>>>    *    Author: Alex Williamson <alex.williamson@redhat.com>
>>>>    */
>>>>   +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>>>> +
>>>> +#include <linux/module.h>
>>>>   #include <linux/io.h>
>>>>   #include <linux/pci.h>
>>>>   #include <linux/uaccess.h>
>>>> @@ -21,10 +24,15 @@
>>>>   #include <asm/kvm_ppc.h>
>>>>     #include "vfio_pci_core.h"
>>>> +#include "nvlink2gpu_vfio_pci.h"
>>>>     #define CREATE_TRACE_POINTS
>>>>   #include "nvlink2gpu_trace.h"
>>>>   +#define DRIVER_VERSION  "0.1"
>>>> +#define DRIVER_AUTHOR   "Alexey Kardashevskiy <aik@ozlabs.ru>"
>>>> +#define DRIVER_DESC     "NVLINK2GPU VFIO PCI - User Level 
>>>> meta-driver for NVIDIA NVLink2 GPUs"
>>>> +
>>>>   EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap_fault);
>>>>   EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap);
>>>>   @@ -39,6 +47,10 @@ struct vfio_pci_nvgpu_data {
>>>>       struct notifier_block group_notifier;
>>>>   };
>>>>   +struct nv_vfio_pci_device {
>>>> +    struct vfio_pci_core_device    vdev;
>>>> +};
>>>> +
>>>>   static size_t vfio_pci_nvgpu_rw(struct vfio_pci_core_device *vdev,
>>>>           char __user *buf, size_t count, loff_t *ppos, bool iswrite)
>>>>   {
>>>> @@ -207,7 +219,8 @@ static int vfio_pci_nvgpu_group_notifier(struct 
>>>> notifier_block *nb,
>>>>       return NOTIFY_OK;
>>>>   }
>>>>   -int vfio_pci_nvidia_v100_nvlink2_init(struct 
>>>> vfio_pci_core_device *vdev)
>>>> +static int
>>>> +vfio_pci_nvidia_v100_nvlink2_init(struct vfio_pci_core_device *vdev)
>>>>   {
>>>>       int ret;
>>>>       u64 reg[2];
>>>> @@ -293,3 +306,135 @@ int vfio_pci_nvidia_v100_nvlink2_init(struct 
>>>> vfio_pci_core_device *vdev)
>>>>         return ret;
>>>>   }
>>>> +
>>>> +static void nvlink2gpu_vfio_pci_release(void *device_data)
>>>> +{
>>>> +    struct vfio_pci_core_device *vdev = device_data;
>>>> +
>>>> +    mutex_lock(&vdev->reflck->lock);
>>>> +    if (!(--vdev->refcnt)) {
>>>> +        vfio_pci_vf_token_user_add(vdev, -1);
>>>> +        vfio_pci_core_spapr_eeh_release(vdev);
>>>> +        vfio_pci_core_disable(vdev);
>>>> +    }
>>>> +    mutex_unlock(&vdev->reflck->lock);
>>>> +
>>>> +    module_put(THIS_MODULE);
>>>> +}
>>>> +
>>>> +static int nvlink2gpu_vfio_pci_open(void *device_data)
>>>> +{
>>>> +    struct vfio_pci_core_device *vdev = device_data;
>>>> +    int ret = 0;
>>>> +
>>>> +    if (!try_module_get(THIS_MODULE))
>>>> +        return -ENODEV;
>>>> +
>>>> +    mutex_lock(&vdev->reflck->lock);
>>>> +
>>>> +    if (!vdev->refcnt) {
>>>> +        ret = vfio_pci_core_enable(vdev);
>>>> +        if (ret)
>>>> +            goto error;
>>>> +
>>>> +        ret = vfio_pci_nvidia_v100_nvlink2_init(vdev);
>>>> +        if (ret && ret != -ENODEV) {
>>>> +            pci_warn(vdev->pdev,
>>>> +                 "Failed to setup NVIDIA NV2 RAM region\n");
>>>> +            vfio_pci_core_disable(vdev);
>>>> +            goto error;
>>>> +        }
>>>> +        ret = 0;
>>>> +        vfio_pci_probe_mmaps(vdev);
>>>> +        vfio_pci_core_spapr_eeh_open(vdev);
>>>> +        vfio_pci_vf_token_user_add(vdev, 1);
>>>> +    }
>>>> +    vdev->refcnt++;
>>>> +error:
>>>> +    mutex_unlock(&vdev->reflck->lock);
>>>> +    if (ret)
>>>> +        module_put(THIS_MODULE);
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static const struct vfio_device_ops nvlink2gpu_vfio_pci_ops = {
>>>> +    .name        = "nvlink2gpu-vfio-pci",
>>>> +    .open        = nvlink2gpu_vfio_pci_open,
>>>> +    .release    = nvlink2gpu_vfio_pci_release,
>>>> +    .ioctl        = vfio_pci_core_ioctl,
>>>> +    .read        = vfio_pci_core_read,
>>>> +    .write        = vfio_pci_core_write,
>>>> +    .mmap        = vfio_pci_core_mmap,
>>>> +    .request    = vfio_pci_core_request,
>>>> +    .match        = vfio_pci_core_match,
>>>> +};
>>>> +
>>>> +static int nvlink2gpu_vfio_pci_probe(struct pci_dev *pdev,
>>>> +        const struct pci_device_id *id)
>>>> +{
>>>> +    struct nv_vfio_pci_device *nvdev;
>>>> +    int ret;
>>>> +
>>>> +    nvdev = kzalloc(sizeof(*nvdev), GFP_KERNEL);
>>>> +    if (!nvdev)
>>>> +        return -ENOMEM;
>>>> +
>>>> +    ret = vfio_pci_core_register_device(&nvdev->vdev, pdev,
>>>> +            &nvlink2gpu_vfio_pci_ops);
>>>> +    if (ret)
>>>> +        goto out_free;
>>>> +
>>>> +    return 0;
>>>> +
>>>> +out_free:
>>>> +    kfree(nvdev);
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static void nvlink2gpu_vfio_pci_remove(struct pci_dev *pdev)
>>>> +{
>>>> +    struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
>>>> +    struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
>>>> +    struct nv_vfio_pci_device *nvdev;
>>>> +
>>>> +    nvdev = container_of(core_vpdev, struct nv_vfio_pci_device, 
>>>> vdev);
>>>> +
>>>> +    vfio_pci_core_unregister_device(core_vpdev);
>>>> +    kfree(nvdev);
>>>> +}
>>>> +
>>>> +static const struct pci_device_id nvlink2gpu_vfio_pci_table[] = {
>>>> +    { PCI_VDEVICE(NVIDIA, 0x1DB1) }, /* GV100GL-A NVIDIA Tesla 
>>>> V100-SXM2-16GB */
>>>> +    { PCI_VDEVICE(NVIDIA, 0x1DB5) }, /* GV100GL-A NVIDIA Tesla 
>>>> V100-SXM2-32GB */
>>>> +    { PCI_VDEVICE(NVIDIA, 0x1DB8) }, /* GV100GL-A NVIDIA Tesla 
>>>> V100-SXM3-32GB */
>>>> +    { PCI_VDEVICE(NVIDIA, 0x1DF5) }, /* GV100GL-B NVIDIA Tesla 
>>>> V100-SXM2-16GB */
>>>
>>>
>>> Where is this list from?
>>>
>>> Also, how is this supposed to work at the boot time? Will the kernel 
>>> try binding let's say this one and nouveau? Which one is going to win?
>>
>> At boot time nouveau driver will win since the vfio drivers don't 
>> declare MODULE_DEVICE_TABLE
>
>
> ok but where is the list from anyway?

I did some checkings and was told that the SXM devices where the ones 
that were integrated into P9.

If you or anyone on the mailing list has some comments here, please add 
them and I'll do double check.

>
>
>>
>>
>>>
>>>
>>>> +    { 0, }
>>>
>>>
>>> Why a comma?
>>
>> I'll remove the comma.
>>
>>
>>>
>>>> +};
>>>
>>>
>>>
>>>> +
>>>> +static struct pci_driver nvlink2gpu_vfio_pci_driver = {
>>>> +    .name            = "nvlink2gpu-vfio-pci",
>>>> +    .id_table        = nvlink2gpu_vfio_pci_table,
>>>> +    .probe            = nvlink2gpu_vfio_pci_probe,
>>>> +    .remove            = nvlink2gpu_vfio_pci_remove,
>>>> +#ifdef CONFIG_PCI_IOV
>>>> +    .sriov_configure    = vfio_pci_core_sriov_configure,
>>>> +#endif
>>>
>>>
>>> What is this IOV business about?
>>
>> from vfio_pci
>>
>> #ifdef CONFIG_PCI_IOV
>> module_param(enable_sriov, bool, 0644);
>> MODULE_PARM_DESC(enable_sriov, "Enable support for SR-IOV 
>> configuration.  Enabling SR-IOV on a PF typically requires support of 
>> the userspace PF driver, enabling VFs without such support may result 
>> in non-functional VFs or PF.");
>> #endif
>
>
> I know what IOV is in general :) What I meant to say was that I am 
> pretty sure these GPUs cannot do IOV so this does not need to be in 
> these NVLink drivers.

Thanks.

I'll verify it and remove for v4.


>
>
>
>>
>>
>>>
>>>
>>>> +    .err_handler        = &vfio_pci_core_err_handlers,
>>>> +};
>>>> +
>>>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>>>> +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev 
>>>> *pdev)
>>>> +{
>>>> +    if (pci_match_id(nvlink2gpu_vfio_pci_driver.id_table, pdev))
>>>> +        return &nvlink2gpu_vfio_pci_driver;
>>>
>>>
>>> Why do we need matching PCI ids here instead of looking at the FDT 
>>> which will work better?
>>
>> what is FDT ? any is it better to use it instead of match_id ?
>
>
> Flattened Device Tree - a way for the firmware to pass the 
> configuration to the OS. This data tells if there are NVLinks and what 
> they are linked to. This defines if the feature is available as it 
> should work with any GPU in this form factor.
>
>
>>
>>>
>>>
>>>> +    return NULL;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(get_nvlink2gpu_vfio_pci_driver);
>>>> +#endif
>>>> +
>>>> +module_pci_driver(nvlink2gpu_vfio_pci_driver);
>>>> +
>>>> +MODULE_VERSION(DRIVER_VERSION);
>>>> +MODULE_LICENSE("GPL v2");
>>>> +MODULE_AUTHOR(DRIVER_AUTHOR);
>>>> +MODULE_DESCRIPTION(DRIVER_DESC);
>>>> diff --git a/drivers/vfio/pci/nvlink2gpu_vfio_pci.h 
>>>> b/drivers/vfio/pci/nvlink2gpu_vfio_pci.h
>>>> new file mode 100644
>>>> index 000000000000..ebd5b600b190
>>>> --- /dev/null
>>>> +++ b/drivers/vfio/pci/nvlink2gpu_vfio_pci.h
>>>> @@ -0,0 +1,24 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +/*
>>>> + * Copyright (c) 2020, Mellanox Technologies, Ltd.  All rights 
>>>> reserved.
>>>> + *     Author: Max Gurtovoy <mgurtovoy@nvidia.com>
>>>> + */
>>>> +
>>>> +#ifndef NVLINK2GPU_VFIO_PCI_H
>>>> +#define NVLINK2GPU_VFIO_PCI_H
>>>> +
>>>> +#include <linux/pci.h>
>>>> +#include <linux/module.h>
>>>> +
>>>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>>>> +#if defined(CONFIG_VFIO_PCI_NVLINK2GPU) || 
>>>> defined(CONFIG_VFIO_PCI_NVLINK2GPU_MODULE)
>>>> +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev 
>>>> *pdev);
>>>> +#else
>>>> +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev 
>>>> *pdev)
>>>> +{
>>>> +    return NULL;
>>>> +}
>>>> +#endif
>>>> +#endif
>>>> +
>>>> +#endif /* NVLINK2GPU_VFIO_PCI_H */
>>>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>>>> index dbc0a6559914..8e81ea039f31 100644
>>>> --- a/drivers/vfio/pci/vfio_pci.c
>>>> +++ b/drivers/vfio/pci/vfio_pci.c
>>>> @@ -27,6 +27,10 @@
>>>>   #include <linux/uaccess.h>
>>>>     #include "vfio_pci_core.h"
>>>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>>>> +#include "npu2_vfio_pci.h"
>>>> +#include "nvlink2gpu_vfio_pci.h"
>>>> +#endif
>>>>     #define DRIVER_VERSION  "0.2"
>>>>   #define DRIVER_AUTHOR   "Alex Williamson 
>>>> <alex.williamson@redhat.com>"
>>>> @@ -142,14 +146,48 @@ static const struct vfio_device_ops 
>>>> vfio_pci_ops = {
>>>>       .match        = vfio_pci_core_match,
>>>>   };
>>>>   +/*
>>>> + * This layer is used for backward compatibility. Hopefully it 
>>>> will be
>>>> + * removed in the future.
>>>> + */
>>>> +static struct pci_driver *vfio_pci_get_compat_driver(struct 
>>>> pci_dev *pdev)
>>>> +{
>>>> +    switch (pdev->vendor) {
>>>> +    case PCI_VENDOR_ID_NVIDIA:
>>>> +        switch (pdev->device) {
>>>> +        case 0x1db1:
>>>> +        case 0x1db5:
>>>> +        case 0x1db8:
>>>> +        case 0x1df5:
>>>> +            return get_nvlink2gpu_vfio_pci_driver(pdev);
>>>
>>> This does not really need a switch, could simply call these 
>>> get_xxxx_vfio_pci_driver. Thanks,
>>
>> maybe the result will be the same but I don't think we need to send 
>> all NVIDIA devices or IBM devices to this function.
>
> We can tolerate this on POWER (the check is really cheap) and for 
> everybody else this driver won't even compile.

I'll improve this function.

Thanks.


>
>
>> we can maybe export the tables from the vfio_vendor driver and match 
>> it here.
>
>
> I am still missing the point of device matching. It won't bind by 
> default at the boot time and it won't make the existing user life any 
> easier as they use libvirt which overrides this anyway.

we're trying to improve the subsystem to be more flexible and we still 
want to preserve backward compatibility for the near future.


>
>
>>>
>>>
>>>> +        default:
>>>> +            return NULL;
>>>> +        }
>>>> +    case PCI_VENDOR_ID_IBM:
>>>> +        switch (pdev->device) {
>>>> +        case 0x04ea:
>>>> +            return get_npu2_vfio_pci_driver(pdev);
>>>> +        default:
>>>> +            return NULL;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return NULL;
>>>> +}
>>>> +
>>>>   static int vfio_pci_probe(struct pci_dev *pdev, const struct 
>>>> pci_device_id *id)
>>>>   {
>>>>       struct vfio_pci_device *vpdev;
>>>> +    struct pci_driver *driver;
>>>>       int ret;
>>>>         if (vfio_pci_is_denylisted(pdev))
>>>>           return -EINVAL;
>>>>   +    driver = vfio_pci_get_compat_driver(pdev);
>>>> +    if (driver)
>>>> +        return driver->probe(pdev, id);
>>>> +
>>>>       vpdev = kzalloc(sizeof(*vpdev), GFP_KERNEL);
>>>>       if (!vpdev)
>>>>           return -ENOMEM;
>>>> @@ -167,14 +205,21 @@ static int vfio_pci_probe(struct pci_dev 
>>>> *pdev, const struct pci_device_id *id)
>>>>     static void vfio_pci_remove(struct pci_dev *pdev)
>>>>   {
>>>> -    struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
>>>> -    struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
>>>> -    struct vfio_pci_device *vpdev;
>>>> -
>>>> -    vpdev = container_of(core_vpdev, struct vfio_pci_device, vdev);
>>>> -
>>>> -    vfio_pci_core_unregister_device(core_vpdev);
>>>> -    kfree(vpdev);
>>>> +    struct pci_driver *driver;
>>>> +
>>>> +    driver = vfio_pci_get_compat_driver(pdev);
>>>> +    if (driver) {
>>>> +        driver->remove(pdev);
>>>> +    } else {
>>>> +        struct vfio_device *vdev = dev_get_drvdata(&pdev->dev);
>>>> +        struct vfio_pci_core_device *core_vpdev;
>>>> +        struct vfio_pci_device *vpdev;
>>>> +
>>>> +        core_vpdev = vfio_device_data(vdev);
>>>> +        vpdev = container_of(core_vpdev, struct vfio_pci_device, 
>>>> vdev);
>>>> +        vfio_pci_core_unregister_device(core_vpdev);
>>>> +        kfree(vpdev);
>>>> +    }
>>>>   }
>>>>     static int vfio_pci_sriov_configure(struct pci_dev *pdev, int 
>>>> nr_virtfn)
>>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c 
>>>> b/drivers/vfio/pci/vfio_pci_core.c
>>>> index 4de8e352df9c..f9b39abe54cb 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_core.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>>>> @@ -354,24 +354,6 @@ int vfio_pci_core_enable(struct 
>>>> vfio_pci_core_device *vdev)
>>>>           }
>>>>       }
>>>>   -    if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
>>>> -        IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
>>>> -        ret = vfio_pci_nvidia_v100_nvlink2_init(vdev);
>>>> -        if (ret && ret != -ENODEV) {
>>>> -            pci_warn(pdev, "Failed to setup NVIDIA NV2 RAM 
>>>> region\n");
>>>> -            goto disable_exit;
>>>> -        }
>>>> -    }
>>>> -
>>>> -    if (pdev->vendor == PCI_VENDOR_ID_IBM &&
>>>> -        IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
>>>> -        ret = vfio_pci_ibm_npu2_init(vdev);
>>>> -        if (ret && ret != -ENODEV) {
>>>> -            pci_warn(pdev, "Failed to setup NVIDIA NV2 ATSD 
>>>> region\n");
>>>> -            goto disable_exit;
>>>> -        }
>>>> -    }
>>>> -
>>>>       return 0;
>>>>     disable_exit:
>>>> diff --git a/drivers/vfio/pci/vfio_pci_core.h 
>>>> b/drivers/vfio/pci/vfio_pci_core.h
>>>> index 8989443c3086..31f3836e606e 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_core.h
>>>> +++ b/drivers/vfio/pci/vfio_pci_core.h
>>>> @@ -204,20 +204,6 @@ static inline int vfio_pci_igd_init(struct 
>>>> vfio_pci_core_device *vdev)
>>>>       return -ENODEV;
>>>>   }
>>>>   #endif
>>>> -#ifdef CONFIG_VFIO_PCI_NVLINK2
>>>> -extern int vfio_pci_nvidia_v100_nvlink2_init(struct 
>>>> vfio_pci_core_device *vdev);
>>>> -extern int vfio_pci_ibm_npu2_init(struct vfio_pci_core_device *vdev);
>>>> -#else
>>>> -static inline int vfio_pci_nvidia_v100_nvlink2_init(struct 
>>>> vfio_pci_core_device *vdev)
>>>> -{
>>>> -    return -ENODEV;
>>>> -}
>>>> -
>>>> -static inline int vfio_pci_ibm_npu2_init(struct 
>>>> vfio_pci_core_device *vdev)
>>>> -{
>>>> -    return -ENODEV;
>>>> -}
>>>> -#endif
>>>>     #ifdef CONFIG_S390
>>>>   extern int vfio_pci_info_zdev_add_caps(struct 
>>>> vfio_pci_core_device *vdev,
>>>>
>>>
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-10 19:40           ` Jason Gunthorpe
@ 2021-03-11  1:20             ` Alexey Kardashevskiy
  2021-03-11  1:34               ` Jason Gunthorpe
  0 siblings, 1 reply; 53+ messages in thread
From: Alexey Kardashevskiy @ 2021-03-11  1:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Max Gurtovoy, alex.williamson, cohuck, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, hch



On 11/03/2021 06:40, Jason Gunthorpe wrote:
> On Thu, Mar 11, 2021 at 01:24:47AM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 11/03/2021 00:02, Jason Gunthorpe wrote:
>>> On Wed, Mar 10, 2021 at 02:57:57PM +0200, Max Gurtovoy wrote:
>>>
>>>>>> +    .err_handler        = &vfio_pci_core_err_handlers,
>>>>>> +};
>>>>>> +
>>>>>> +#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
>>>>>> +struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev)
>>>>>> +{
>>>>>> +    if (pci_match_id(nvlink2gpu_vfio_pci_driver.id_table, pdev))
>>>>>> +        return &nvlink2gpu_vfio_pci_driver;
>>>>>
>>>>>
>>>>> Why do we need matching PCI ids here instead of looking at the FDT which
>>>>> will work better?
>>>>
>>>> what is FDT ? any is it better to use it instead of match_id ?
>>>
>>> This is emulating the device_driver match for the pci_driver.
>>
>> No it is not, it is a device tree info which lets to skip the linux PCI
>> discovery part (the firmware does it anyway) but it tells nothing about
>> which drivers to bind.
> 
> I mean get_nvlink2gpu_vfio_pci_driver() is emulating the PCI match.
> 
> Max added a pci driver for NPU here:
> 
> +static struct pci_driver npu2_vfio_pci_driver = {
> +	.name			= "npu2-vfio-pci",
> +	.id_table		= npu2_vfio_pci_table,
> +	.probe			= npu2_vfio_pci_probe,
> 
> 
> new userspace should use driver_override with "npu-vfio-pci" as the
> string not "vfio-pci"
> 
> The point of the get_npu2_vfio_pci_driver() is only optional
> compatibility to redirect old userspace using "vfio-pci" in the
> driver_override to the now split driver code so userspace doesn't see
> any change in behavior.
> 
> If we don't do this then the vfio-pci driver override will disable the
> npu2 special stuff, since Max took it all out of vfio-pci's
> pci_driver.
> 
> It is supposed to match exactly the same match table as the pci_driver
> above. We *don't* want different behavior from what the standrd PCI
> driver matcher will do.


This is not a standard PCI driver though and the main vfio-pci won't 
have a list to match ever. IBM NPU PCI id is unlikely to change ever but 
NVIDIA keeps making new devices which work in those P9 boxes, are you 
going to keep adding those ids to nvlink2gpu_vfio_pci_table? btw can the 
id list have only vendor ids and not have device ids?


> Since we don't have any way to mix in FDT discovery to the standard
> PCI driver match it will still attach the npu driver but not enable
> any special support. This seems OK.



-- 
Alexey

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-11  1:20             ` Alexey Kardashevskiy
@ 2021-03-11  1:34               ` Jason Gunthorpe
  2021-03-11  1:42                 ` Alexey Kardashevskiy
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-11  1:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Max Gurtovoy, alex.williamson, cohuck, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, hch

On Thu, Mar 11, 2021 at 12:20:33PM +1100, Alexey Kardashevskiy wrote:

> > It is supposed to match exactly the same match table as the pci_driver
> > above. We *don't* want different behavior from what the standrd PCI
> > driver matcher will do.
> 
> This is not a standard PCI driver though 

It is now, that is what this patch makes it into. This is why it now
has a struct pci_driver.

> and the main vfio-pci won't have a
> list to match ever.

?? vfio-pci uses driver_override or new_id to manage its match list

> IBM NPU PCI id is unlikely to change ever but NVIDIA keeps making
> new devices which work in those P9 boxes, are you going to keep
> adding those ids to nvlink2gpu_vfio_pci_table?

Certainly, as needed. PCI list updates is normal for the kernel.

> btw can the id list have only vendor ids and not have device ids?

The PCI matcher is quite flexable, see the other patch from Max for
the igd

But best practice is to be as narrow as possible as I hope this will
eventually impact module autoloading and other details.

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-11  1:34               ` Jason Gunthorpe
@ 2021-03-11  1:42                 ` Alexey Kardashevskiy
  2021-03-11  2:00                   ` Jason Gunthorpe
  0 siblings, 1 reply; 53+ messages in thread
From: Alexey Kardashevskiy @ 2021-03-11  1:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Max Gurtovoy, alex.williamson, cohuck, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, hch



On 11/03/2021 12:34, Jason Gunthorpe wrote:
> On Thu, Mar 11, 2021 at 12:20:33PM +1100, Alexey Kardashevskiy wrote:
> 
>>> It is supposed to match exactly the same match table as the pci_driver
>>> above. We *don't* want different behavior from what the standrd PCI
>>> driver matcher will do.
>>
>> This is not a standard PCI driver though
> 
> It is now, that is what this patch makes it into. This is why it now
> has a struct pci_driver.
> 
>> and the main vfio-pci won't have a
>> list to match ever.
> 
> ?? vfio-pci uses driver_override or new_id to manage its match list


Exactly, no list to update.


>> IBM NPU PCI id is unlikely to change ever but NVIDIA keeps making
>> new devices which work in those P9 boxes, are you going to keep
>> adding those ids to nvlink2gpu_vfio_pci_table?
> 
> Certainly, as needed. PCI list updates is normal for the kernel.
> 
>> btw can the id list have only vendor ids and not have device ids?
> 
> The PCI matcher is quite flexable, see the other patch from Max for
> the igd


ah cool, do this for NVIDIA GPUs then please, I just discovered another 
P9 system sold with NVIDIA T4s which is not in your list.


> But best practice is to be as narrow as possible as I hope this will
> eventually impact module autoloading and other details.

The amount of device specific knowledge is too little to tie it up to 
device ids, it is a generic PCI driver with quirks. We do not have a 
separate drivers for the hardware which requires quirks.

And how do you hope this should impact autoloading?



-- 
Alexey

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-11  1:42                 ` Alexey Kardashevskiy
@ 2021-03-11  2:00                   ` Jason Gunthorpe
  2021-03-11  7:54                     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-11  2:00 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Max Gurtovoy, alex.williamson, cohuck, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, hch

On Thu, Mar 11, 2021 at 12:42:56PM +1100, Alexey Kardashevskiy wrote:
> > > btw can the id list have only vendor ids and not have device ids?
> > 
> > The PCI matcher is quite flexable, see the other patch from Max for
> > the igd
>  
> ah cool, do this for NVIDIA GPUs then please, I just discovered another P9
> system sold with NVIDIA T4s which is not in your list.

I think it will make things easier down the road if you maintain an
exact list <shrug>

> > But best practice is to be as narrow as possible as I hope this will
> > eventually impact module autoloading and other details.
> 
> The amount of device specific knowledge is too little to tie it up to device
> ids, it is a generic PCI driver with quirks. We do not have a separate
> drivers for the hardware which requires quirks.

It provides its own capability structure exposed to userspace, that is
absolutely not a "quirk"

> And how do you hope this should impact autoloading?

I would like to autoload the most specific vfio driver for the target
hardware.

If you someday need to support new GPU HW that needs a different VFIO
driver then you are really stuck because things become indeterminate
if there are two devices claiming the ID. We don't have the concept of
"best match", driver core works on exact match.

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-11  2:00                   ` Jason Gunthorpe
@ 2021-03-11  7:54                     ` Alexey Kardashevskiy
  2021-03-11  9:44                       ` Max Gurtovoy
  2021-03-11 17:01                       ` Jason Gunthorpe
  0 siblings, 2 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2021-03-11  7:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Max Gurtovoy, alex.williamson, cohuck, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, hch



On 11/03/2021 13:00, Jason Gunthorpe wrote:
> On Thu, Mar 11, 2021 at 12:42:56PM +1100, Alexey Kardashevskiy wrote:
>>>> btw can the id list have only vendor ids and not have device ids?
>>>
>>> The PCI matcher is quite flexable, see the other patch from Max for
>>> the igd
>>   
>> ah cool, do this for NVIDIA GPUs then please, I just discovered another P9
>> system sold with NVIDIA T4s which is not in your list.
> 
> I think it will make things easier down the road if you maintain an
> exact list <shrug>


Then why do not you do the exact list for Intel IGD? The commit log does 
not explain this detail.


>>> But best practice is to be as narrow as possible as I hope this will
>>> eventually impact module autoloading and other details.
>>
>> The amount of device specific knowledge is too little to tie it up to device
>> ids, it is a generic PCI driver with quirks. We do not have a separate
>> drivers for the hardware which requires quirks.
> 
> It provides its own capability structure exposed to userspace, that is
> absolutely not a "quirk"
> 
>> And how do you hope this should impact autoloading?
> 
> I would like to autoload the most specific vfio driver for the target
> hardware.


Is there an idea how it is going to work? For example, the Intel IGD 
driver and vfio-pci-igd - how should the system pick one? If there is no 
MODULE_DEVICE_TABLE in vfio-pci-xxx, is the user supposed to try binding 
all vfio-pci-xxx drivers until some binds?


> If you someday need to support new GPU HW that needs a different VFIO
> driver then you are really stuck because things become indeterminate
> if there are two devices claiming the ID. We don't have the concept of
> "best match", driver core works on exact match.



-- 
Alexey

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-11  7:54                     ` Alexey Kardashevskiy
@ 2021-03-11  9:44                       ` Max Gurtovoy
  2021-03-11 16:51                         ` Jason Gunthorpe
  2021-03-11 17:01                       ` Jason Gunthorpe
  1 sibling, 1 reply; 53+ messages in thread
From: Max Gurtovoy @ 2021-03-11  9:44 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Jason Gunthorpe
  Cc: alex.williamson, cohuck, kvm, linux-kernel, liranl, oren, tzahio,
	leonro, yarong, aviadye, shahafs, artemp, kwankhede, ACurrid,
	cjia, yishaih, mjrosato, hch


On 3/11/2021 9:54 AM, Alexey Kardashevskiy wrote:
>
>
> On 11/03/2021 13:00, Jason Gunthorpe wrote:
>> On Thu, Mar 11, 2021 at 12:42:56PM +1100, Alexey Kardashevskiy wrote:
>>>>> btw can the id list have only vendor ids and not have device ids?
>>>>
>>>> The PCI matcher is quite flexable, see the other patch from Max for
>>>> the igd
>>>   ah cool, do this for NVIDIA GPUs then please, I just discovered 
>>> another P9
>>> system sold with NVIDIA T4s which is not in your list.
>>
>> I think it will make things easier down the road if you maintain an
>> exact list <shrug>
>
>
> Then why do not you do the exact list for Intel IGD? The commit log 
> does not explain this detail.

I expect Intel team to review this series and give a more precise list.

I did the best I could in finding a proper configuration for igd.


>
>
>>>> But best practice is to be as narrow as possible as I hope this will
>>>> eventually impact module autoloading and other details.
>>>
>>> The amount of device specific knowledge is too little to tie it up 
>>> to device
>>> ids, it is a generic PCI driver with quirks. We do not have a separate
>>> drivers for the hardware which requires quirks.
>>
>> It provides its own capability structure exposed to userspace, that is
>> absolutely not a "quirk"
>>
>>> And how do you hope this should impact autoloading?
>>
>> I would like to autoload the most specific vfio driver for the target
>> hardware.
>
>
> Is there an idea how it is going to work? For example, the Intel IGD 
> driver and vfio-pci-igd - how should the system pick one? If there is 
> no MODULE_DEVICE_TABLE in vfio-pci-xxx, is the user supposed to try 
> binding all vfio-pci-xxx drivers until some binds?

For example, in my local setup I did a POC patch that convert some 
drivers to be "manual binding only drivers".

So the IGD driver will have the priority, user will unbind the device 
from it, load igd-vfio-pci, bind the device to it, ends with probing.

For now we separated the driver core stuff until we all agree that this 
series is the right way to go + we also make sure it's backward compatible.

>
>
>> If you someday need to support new GPU HW that needs a different VFIO
>> driver then you are really stuck because things become indeterminate
>> if there are two devices claiming the ID. We don't have the concept of
>> "best match", driver core works on exact match.
>
>
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 9/9] vfio/pci: export igd support into vendor vfio_pci driver
  2021-03-10 12:31     ` Jason Gunthorpe
@ 2021-03-11 11:37       ` Christoph Hellwig
  2021-03-11 12:09         ` Max Gurtovoy
  2021-03-11 15:43         ` Jason Gunthorpe
  0 siblings, 2 replies; 53+ messages in thread
From: Christoph Hellwig @ 2021-03-11 11:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Max Gurtovoy, alex.williamson, cohuck, kvm,
	linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato,
	aik

On Wed, Mar 10, 2021 at 08:31:27AM -0400, Jason Gunthorpe wrote:
> Yes, that needs more refactoring. I'm viewing this series as a
> "statement of intent" and once we commit to doing this we can go
> through the bigger effort to split up vfio_pci_core and tidy its API.
> 
> Obviously this is a big project, given the past comments I don't want
> to send more effort here until we see a community consensus emerge
> that this is what we want to do. If we build a sub-driver instead the
> work is all in the trash bin.

So my viewpoint here is that this work doesn't seem very useful for
the existing subdrivers given how much compat pain there is.  It
defintively is the right way to go for a new driver.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 9/9] vfio/pci: export igd support into vendor vfio_pci driver
  2021-03-11 11:37       ` Christoph Hellwig
@ 2021-03-11 12:09         ` Max Gurtovoy
  2021-03-11 15:43         ` Jason Gunthorpe
  1 sibling, 0 replies; 53+ messages in thread
From: Max Gurtovoy @ 2021-03-11 12:09 UTC (permalink / raw)
  To: Christoph Hellwig, Jason Gunthorpe
  Cc: alex.williamson, cohuck, kvm, linux-kernel, liranl, oren, tzahio,
	leonro, yarong, aviadye, shahafs, artemp, kwankhede, ACurrid,
	cjia, yishaih, mjrosato, aik


On 3/11/2021 1:37 PM, Christoph Hellwig wrote:
> On Wed, Mar 10, 2021 at 08:31:27AM -0400, Jason Gunthorpe wrote:
>> Yes, that needs more refactoring. I'm viewing this series as a
>> "statement of intent" and once we commit to doing this we can go
>> through the bigger effort to split up vfio_pci_core and tidy its API.
>>
>> Obviously this is a big project, given the past comments I don't want
>> to send more effort here until we see a community consensus emerge
>> that this is what we want to do. If we build a sub-driver instead the
>> work is all in the trash bin.
> So my viewpoint here is that this work doesn't seem very useful for
> the existing subdrivers given how much compat pain there is.  It
> defintively is the right way to go for a new driver.

This bring us back to the first series that introduced mlx5_vfio_pci 
driver without the igd, nvlink2 drivers.

if we leave the subdrivers/extensions in vfio_pci_core it won't be 
logically right.

If we put it in vfio_pci we'll need to maintain it and extend it if new 
functionality or bugs will be reported.

if we create a new drivers for these devices, we'll use the compat layer 
and hopefully after few years these users will be using only 
my_driver_vfio_pci and we'll be able to remove the compat layer (that is 
not so big).

We tried almost all the options and now we need to progress and agree on 
the design.

Effort is big and I wish we won't continue with experiments without a 
clear view of what exactly should be done.

So we need a plan how Jason's series and my series can live together and 
how can we start merging it gradually.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 9/9] vfio/pci: export igd support into vendor vfio_pci driver
  2021-03-11 11:37       ` Christoph Hellwig
  2021-03-11 12:09         ` Max Gurtovoy
@ 2021-03-11 15:43         ` Jason Gunthorpe
  1 sibling, 0 replies; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-11 15:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Max Gurtovoy, alex.williamson, cohuck, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, aik

On Thu, Mar 11, 2021 at 12:37:06PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 10, 2021 at 08:31:27AM -0400, Jason Gunthorpe wrote:
> > Yes, that needs more refactoring. I'm viewing this series as a
> > "statement of intent" and once we commit to doing this we can go
> > through the bigger effort to split up vfio_pci_core and tidy its API.
> > 
> > Obviously this is a big project, given the past comments I don't want
> > to send more effort here until we see a community consensus emerge
> > that this is what we want to do. If we build a sub-driver instead the
> > work is all in the trash bin.
> 
> So my viewpoint here is that this work doesn't seem very useful for
> the existing subdrivers given how much compat pain there is.  It
> defintively is the right way to go for a new driver.

Right, I don't think the three little drivers get much benifit, what
they are doing is giving some guidance on how to structure the vfio
pci core module. The reflck duplication you pointed at, for instance,
will be in the future drivers too.

What do you see as the most compat pain? 

Max made them full proper drivers, but we could do a half way and just
split the code like Max has done but remove the pci_driver and related
compat. It would static link in as today but still be essentially
structured like a "new driver"

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-11  9:44                       ` Max Gurtovoy
@ 2021-03-11 16:51                         ` Jason Gunthorpe
  0 siblings, 0 replies; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-11 16:51 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Alexey Kardashevskiy, alex.williamson, cohuck, kvm, linux-kernel,
	liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, hch

On Thu, Mar 11, 2021 at 11:44:38AM +0200, Max Gurtovoy wrote:
> 
> On 3/11/2021 9:54 AM, Alexey Kardashevskiy wrote:
> > 
> > 
> > On 11/03/2021 13:00, Jason Gunthorpe wrote:
> > > On Thu, Mar 11, 2021 at 12:42:56PM +1100, Alexey Kardashevskiy wrote:
> > > > > > btw can the id list have only vendor ids and not have device ids?
> > > > > 
> > > > > The PCI matcher is quite flexable, see the other patch from Max for
> > > > > the igd
> > > >   ah cool, do this for NVIDIA GPUs then please, I just
> > > > discovered another P9
> > > > system sold with NVIDIA T4s which is not in your list.
> > > 
> > > I think it will make things easier down the road if you maintain an
> > > exact list <shrug>
> > 
> > 
> > Then why do not you do the exact list for Intel IGD? The commit log does
> > not explain this detail.
> 
> I expect Intel team to review this series and give a more precise list.
> 
> I did the best I could in finding a proper configuration for igd.

Right. Doing this retroactively is really hard.

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-11  7:54                     ` Alexey Kardashevskiy
  2021-03-11  9:44                       ` Max Gurtovoy
@ 2021-03-11 17:01                       ` Jason Gunthorpe
  1 sibling, 0 replies; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-11 17:01 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Max Gurtovoy, alex.williamson, cohuck, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, hch

On Thu, Mar 11, 2021 at 06:54:09PM +1100, Alexey Kardashevskiy wrote:

> Is there an idea how it is going to work? For example, the Intel IGD driver
> and vfio-pci-igd - how should the system pick one? If there is no
> MODULE_DEVICE_TABLE in vfio-pci-xxx, is the user supposed to try binding all
> vfio-pci-xxx drivers until some binds?

We must expose some MODULE_DEVICE_TABLE like thing to userspace.

Compiling everything into one driver and using if statements was only
managable with these tiny drivers - the stuff that is coming are big
things that are infeasible to link directly to vfio_pci.ko

I'm feeling some general consensus around this approach (vs trying to
make a subdriver) so we will start looking at exactly what form that
could take soon.

The general idea would be to have a selection of extended VFIO drivers
for PCI devices that can be loaded as an alternative to vfio-pci and
they provide additional uapi and behaviors that only work on specific
hardware. nvlink is a good example because it does provide new API and
additional HW specific behavior.

A way for userspace to learn about the drivers automatically and sort
out how to load and bind them.

I was thinking about your earlier question about FDT - do you think we
could switch this to a platform_device and provide an of_match_table
that would select correctly? Did IBM enforce a useful compatible
string in the DT for these things?

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-10 12:57     ` Max Gurtovoy
  2021-03-10 13:02       ` Jason Gunthorpe
  2021-03-10 14:19       ` Alexey Kardashevskiy
@ 2021-03-19 15:23       ` Alex Williamson
  2021-03-19 16:17         ` Jason Gunthorpe
  2 siblings, 1 reply; 53+ messages in thread
From: Alex Williamson @ 2021-03-19 15:23 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Alexey Kardashevskiy, jgg, cohuck, kvm, linux-kernel, liranl,
	oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, hch

On Wed, 10 Mar 2021 14:57:57 +0200
Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> On 3/10/2021 8:39 AM, Alexey Kardashevskiy wrote:
> > On 09/03/2021 19:33, Max Gurtovoy wrote:  
> >> +static const struct pci_device_id nvlink2gpu_vfio_pci_table[] = {
> >> +    { PCI_VDEVICE(NVIDIA, 0x1DB1) }, /* GV100GL-A NVIDIA Tesla 
> >> V100-SXM2-16GB */
> >> +    { PCI_VDEVICE(NVIDIA, 0x1DB5) }, /* GV100GL-A NVIDIA Tesla 
> >> V100-SXM2-32GB */
> >> +    { PCI_VDEVICE(NVIDIA, 0x1DB8) }, /* GV100GL-A NVIDIA Tesla 
> >> V100-SXM3-32GB */
> >> +    { PCI_VDEVICE(NVIDIA, 0x1DF5) }, /* GV100GL-B NVIDIA Tesla 
> >> V100-SXM2-16GB */  
> >
> >
> > Where is this list from?
> >
> > Also, how is this supposed to work at the boot time? Will the kernel 
> > try binding let's say this one and nouveau? Which one is going to win?  
> 
> At boot time nouveau driver will win since the vfio drivers don't 
> declare MODULE_DEVICE_TABLE

This still seems troublesome, AIUI the MODULE_DEVICE_TABLE is
responsible for creating aliases so that kmod can figure out which
modules to load, but what happens if all these vfio-pci modules are
built into the kernel or the modules are already loaded?

In the former case, I think it boils down to link order while the
latter is generally considered even less deterministic since it depends
on module load order.  So if one of these vfio modules should get
loaded before the native driver, I think devices could bind here first.

Are there tricks/extensions we could use in driver overrides, for
example maybe a compatibility alias such that one of these vfio-pci
variants could match "vfio-pci"?  Perhaps that, along with some sort of
priority scheme to probe variants ahead of the base driver, though I'm
not sure how we'd get these variants loaded without something like
module aliases.  I know we're trying to avoid creating another level of
driver matching, but that's essentially what we have in the compat
option enabled here, and I'm not sure I see how userspace makes the
leap to understand what driver to use for a given device.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-19 15:23       ` Alex Williamson
@ 2021-03-19 16:17         ` Jason Gunthorpe
  2021-03-19 16:20           ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-19 16:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Max Gurtovoy, Alexey Kardashevskiy, cohuck, kvm, linux-kernel,
	liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato, hch

On Fri, Mar 19, 2021 at 09:23:41AM -0600, Alex Williamson wrote:
> On Wed, 10 Mar 2021 14:57:57 +0200
> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> > On 3/10/2021 8:39 AM, Alexey Kardashevskiy wrote:
> > > On 09/03/2021 19:33, Max Gurtovoy wrote:  
> > >> +static const struct pci_device_id nvlink2gpu_vfio_pci_table[] = {
> > >> +    { PCI_VDEVICE(NVIDIA, 0x1DB1) }, /* GV100GL-A NVIDIA Tesla 
> > >> V100-SXM2-16GB */
> > >> +    { PCI_VDEVICE(NVIDIA, 0x1DB5) }, /* GV100GL-A NVIDIA Tesla 
> > >> V100-SXM2-32GB */
> > >> +    { PCI_VDEVICE(NVIDIA, 0x1DB8) }, /* GV100GL-A NVIDIA Tesla 
> > >> V100-SXM3-32GB */
> > >> +    { PCI_VDEVICE(NVIDIA, 0x1DF5) }, /* GV100GL-B NVIDIA Tesla 
> > >> V100-SXM2-16GB */  
> > >
> > >
> > > Where is this list from?
> > >
> > > Also, how is this supposed to work at the boot time? Will the kernel 
> > > try binding let's say this one and nouveau? Which one is going to win?  
> > 
> > At boot time nouveau driver will win since the vfio drivers don't 
> > declare MODULE_DEVICE_TABLE
> 
> This still seems troublesome, AIUI the MODULE_DEVICE_TABLE is
> responsible for creating aliases so that kmod can figure out which
> modules to load, but what happens if all these vfio-pci modules are
> built into the kernel or the modules are already loaded?

I think we talked about this.. We still need a better way to control
binding of VFIO modules - now that we have device-specific modules we
must have these match tables to control what devices they connect
to.

Previously things used the binding of vfio_pci as the "switch" and
hardcoded all the matches inside it.

I'm still keen to try the "driver flavour" idea I outlined earlier,
but it is hard to say what will resonate with Greg.

> In the former case, I think it boils down to link order while the
> latter is generally considered even less deterministic since it depends
> on module load order.  So if one of these vfio modules should get
> loaded before the native driver, I think devices could bind here first.

At this point - "don't link these statically", we could have a kconfig
to prevent it.

> Are there tricks/extensions we could use in driver overrides, for
> example maybe a compatibility alias such that one of these vfio-pci
> variants could match "vfio-pci"?

driver override is not really useful as soon as you have a match table
as its operation is to defeat the match table entirely. :(

Again, this is still more of a outline how things will look as we must
get through this before we can attempt to do something in the driver
core with Greg.

We could revise this series to not register drivers at all and keep
the uAPI view exactly as is today. This would allow enough code to
show Greg how some driver flavour thing would work.

If soemthing can't be done in the driver core, I'd propse to keep the
same basic outline Max has here, but make registering the "compat"
dynamic - it is basically a sub-driver design at that point and we
give up achieving module autoloading.

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-19 16:17         ` Jason Gunthorpe
@ 2021-03-19 16:20           ` Christoph Hellwig
  2021-03-19 16:28             ` Jason Gunthorpe
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2021-03-19 16:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Max Gurtovoy, Alexey Kardashevskiy, cohuck, kvm,
	linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato,
	hch

On Fri, Mar 19, 2021 at 01:17:22PM -0300, Jason Gunthorpe wrote:
> I think we talked about this.. We still need a better way to control
> binding of VFIO modules - now that we have device-specific modules we
> must have these match tables to control what devices they connect
> to.
> 
> Previously things used the binding of vfio_pci as the "switch" and
> hardcoded all the matches inside it.
> 
> I'm still keen to try the "driver flavour" idea I outlined earlier,
> but it is hard to say what will resonate with Greg.

IMHO the only model that really works and makes sense is to turn the
whole model around and make vfio a library called by the actual driver
for the device.  That is any device that needs device specific
funtionality simply needs a proper in-kernel driver, which then can be
switched to a vfio mode where all the normal subsystems are unbound
from the device, and VFIO functionality is found to it all while _the_
driver that controls the PCI ID is still in charge of it.

vfio_pci remains a separate driver not binding to any ID by default
and not having any device specific functionality.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-19 16:20           ` Christoph Hellwig
@ 2021-03-19 16:28             ` Jason Gunthorpe
  2021-03-19 16:34               ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-19 16:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alex Williamson, Max Gurtovoy, Alexey Kardashevskiy, cohuck, kvm,
	linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato

On Fri, Mar 19, 2021 at 05:20:33PM +0100, Christoph Hellwig wrote:
> On Fri, Mar 19, 2021 at 01:17:22PM -0300, Jason Gunthorpe wrote:
> > I think we talked about this.. We still need a better way to control
> > binding of VFIO modules - now that we have device-specific modules we
> > must have these match tables to control what devices they connect
> > to.
> > 
> > Previously things used the binding of vfio_pci as the "switch" and
> > hardcoded all the matches inside it.
> > 
> > I'm still keen to try the "driver flavour" idea I outlined earlier,
> > but it is hard to say what will resonate with Greg.
> 
> IMHO the only model that really works and makes sense is to turn the
> whole model around and make vfio a library called by the actual driver
> for the device.  That is any device that needs device specific
> funtionality simply needs a proper in-kernel driver, which then can be
> switched to a vfio mode where all the normal subsystems are unbound
> from the device, and VFIO functionality is found to it all while _the_
> driver that controls the PCI ID is still in charge of it.

Yes, this is what I want to strive for with Greg.

It would also resolve alot of the uncomfortable code I see in VFIO
using the driver core. For instance, when a device is moved to 'vfio
mode' it can go through and *lock* the entire group of devices to
'vfio mode' or completely fail.

This would replace all the protective code that is all about ensuring
the admin doesn't improperly mix & match in-kernel and vfio drivers
within a security domain.

The wrinkle I don't yet have an easy answer to is how to load vfio_pci
as a universal "default" within the driver core lazy bind scheme and
still have working module autoloading... I'm hoping to get some
research into this..

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-19 16:28             ` Jason Gunthorpe
@ 2021-03-19 16:34               ` Christoph Hellwig
  2021-03-19 17:36                 ` Alex Williamson
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2021-03-19 16:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Alex Williamson, Max Gurtovoy,
	Alexey Kardashevskiy, cohuck, kvm, linux-kernel, liranl, oren,
	tzahio, leonro, yarong, aviadye, shahafs, artemp, kwankhede,
	ACurrid, cjia, yishaih, mjrosato

On Fri, Mar 19, 2021 at 01:28:48PM -0300, Jason Gunthorpe wrote:
> The wrinkle I don't yet have an easy answer to is how to load vfio_pci
> as a universal "default" within the driver core lazy bind scheme and
> still have working module autoloading... I'm hoping to get some
> research into this..

Should we even load it by default?  One answer would be that the sysfs
file to switch to vfio mode goes into the core PCI layer, and that core
PCI code would contain a hack^H^H^H^Hhook to first load and bind vfio_pci
for that device.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-19 16:34               ` Christoph Hellwig
@ 2021-03-19 17:36                 ` Alex Williamson
  2021-03-19 20:07                   ` Jason Gunthorpe
  0 siblings, 1 reply; 53+ messages in thread
From: Alex Williamson @ 2021-03-19 17:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Max Gurtovoy, Alexey Kardashevskiy, cohuck, kvm,
	linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato

On Fri, 19 Mar 2021 17:34:49 +0100
Christoph Hellwig <hch@lst.de> wrote:

> On Fri, Mar 19, 2021 at 01:28:48PM -0300, Jason Gunthorpe wrote:
> > The wrinkle I don't yet have an easy answer to is how to load vfio_pci
> > as a universal "default" within the driver core lazy bind scheme and
> > still have working module autoloading... I'm hoping to get some
> > research into this..  

What about using MODULE_SOFTDEP("pre: ...") in the vfio-pci base
driver, which would load all the known variants in order to influence
the match, and therefore probe ordering?

If we coupled that with wildcard support in driver_override, ex.
"vfio_pci*", and used consistent module naming, I think we'd only need
to teach userspace about this wildcard and binding to a specific module
would come for free.  This assumes we drop the per-variant id_table and
use the probe function to skip devices without the necessary
requirements, either wrong device or missing the tables we expect to
expose.

> Should we even load it by default?  One answer would be that the sysfs
> file to switch to vfio mode goes into the core PCI layer, and that core
> PCI code would contain a hack^H^H^H^Hhook to first load and bind vfio_pci
> for that device.

Generally we don't want to be the default driver for anything (I think
mdev devices are the exception).  Assignment to userspace or VM is a
niche use case.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-19 17:36                 ` Alex Williamson
@ 2021-03-19 20:07                   ` Jason Gunthorpe
  2021-03-19 21:08                     ` Alex Williamson
  2021-03-22 15:11                     ` Christoph Hellwig
  0 siblings, 2 replies; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-19 20:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Christoph Hellwig, Max Gurtovoy, Alexey Kardashevskiy, cohuck,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato

On Fri, Mar 19, 2021 at 11:36:42AM -0600, Alex Williamson wrote:
> On Fri, 19 Mar 2021 17:34:49 +0100
> Christoph Hellwig <hch@lst.de> wrote:
> 
> > On Fri, Mar 19, 2021 at 01:28:48PM -0300, Jason Gunthorpe wrote:
> > > The wrinkle I don't yet have an easy answer to is how to load vfio_pci
> > > as a universal "default" within the driver core lazy bind scheme and
> > > still have working module autoloading... I'm hoping to get some
> > > research into this..  
> 
> What about using MODULE_SOFTDEP("pre: ...") in the vfio-pci base
> driver, which would load all the known variants in order to influence
> the match, and therefore probe ordering?

The way the driver core works is to first match against the already
loaded driver list, then trigger an event for module loading and when
new drivers are registered they bind to unbound devices.

So, the trouble is the event through userspace because the kernel
can't just go on to use vfio_pci until it knows userspace has failed
to satisfy the load request.

One answer is to have userspace udev have the "hook" here and when a
vfio flavour mod alias is requested on a PCI device it swaps in
vfio_pci if it can't find an alternative.

The dream would be a system with no vfio modules loaded could do some

 echo "vfio" > /sys/bus/pci/xxx/driver_flavour

And a module would be loaded and a struct vfio_device is created for
that device. Very easy for the user.

> If we coupled that with wildcard support in driver_override, ex.
> "vfio_pci*", and used consistent module naming, I think we'd only need
> to teach userspace about this wildcard and binding to a specific module
> would come for free.

What would the wildcard do?

> This assumes we drop the per-variant id_table and use the probe
> function to skip devices without the necessary requirements, either
> wrong device or missing the tables we expect to expose.

Without a module table how do we know which driver is which? 

Open coding a match table in probe() and returning failure feels hacky
to me.

> > Should we even load it by default?  One answer would be that the sysfs
> > file to switch to vfio mode goes into the core PCI layer, and that core
> > PCI code would contain a hack^H^H^H^Hhook to first load and bind vfio_pci
> > for that device.
> 
> Generally we don't want to be the default driver for anything (I think
> mdev devices are the exception).  Assignment to userspace or VM is a
> niche use case.  Thanks,

By "default" I mean if the user says device A is in "vfio" mode then
the kernel should
 - Search for a specific driver for this device and autoload it
 - If no specific driver is found then attach a default "universal"
   driver for it. vfio_pci is a universal driver.

vfio_platform is also a "universal" driver when in ACPI mode, in some
cases.

For OF cases platform it builts its own little subsystem complete with
autoloading:

                request_module("vfio-reset:%s", vdev->compat);
                vdev->of_reset = vfio_platform_lookup_reset(vdev->compat,
                                                        &vdev->reset_module);

And it is a good example of why I don't like this subsystem design
because vfio_platform doesn't do the driver loading for OF entirely
right, vdev->compat is a single string derived from the compatible
property:

        ret = device_property_read_string(dev, "compatible",
                                          &vdev->compat);
        if (ret)
                dev_err(dev, "Cannot retrieve compat for %s\n", vdev->name);

Unfortunately OF requires that compatible is a *list* of strings and a
correct driver is supposed to evaluate all of them. The driver core
does this all correctly, and this was lost when it was open coded
here.

We should NOT be avoiding the standard infrastructure for matching
drivers to devices by re-implementing it poorly.

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-19 20:07                   ` Jason Gunthorpe
@ 2021-03-19 21:08                     ` Alex Williamson
  2021-03-19 22:59                       ` Jason Gunthorpe
  2021-03-22 15:11                     ` Christoph Hellwig
  1 sibling, 1 reply; 53+ messages in thread
From: Alex Williamson @ 2021-03-19 21:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Max Gurtovoy, Alexey Kardashevskiy, cohuck,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato

On Fri, 19 Mar 2021 17:07:49 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Mar 19, 2021 at 11:36:42AM -0600, Alex Williamson wrote:
> > On Fri, 19 Mar 2021 17:34:49 +0100
> > Christoph Hellwig <hch@lst.de> wrote:
> >   
> > > On Fri, Mar 19, 2021 at 01:28:48PM -0300, Jason Gunthorpe wrote:  
> > > > The wrinkle I don't yet have an easy answer to is how to load vfio_pci
> > > > as a universal "default" within the driver core lazy bind scheme and
> > > > still have working module autoloading... I'm hoping to get some
> > > > research into this..    
> > 
> > What about using MODULE_SOFTDEP("pre: ...") in the vfio-pci base
> > driver, which would load all the known variants in order to influence
> > the match, and therefore probe ordering?  
> 
> The way the driver core works is to first match against the already
> loaded driver list, then trigger an event for module loading and when
> new drivers are registered they bind to unbound devices.

The former is based on id_tables, the latter on MODULE_DEVICE_TABLE, we
don't have either of those.  As noted to Christoph, the cases where we
want a vfio driver to bind to anything automatically is the exception.
 
> So, the trouble is the event through userspace because the kernel
> can't just go on to use vfio_pci until it knows userspace has failed
> to satisfy the load request.

Given that we don't use MODULE_DEVICE_TABLE, vfio-pci doesn't autoload.
AFAIK, all tools like libvirt and driverctl that typically bind devices
to vfio-pci will manually load vfio-pci.  I think we can take advantage
of that.

> One answer is to have userspace udev have the "hook" here and when a
> vfio flavour mod alias is requested on a PCI device it swaps in
> vfio_pci if it can't find an alternative.
> 
> The dream would be a system with no vfio modules loaded could do some
> 
>  echo "vfio" > /sys/bus/pci/xxx/driver_flavour
> 
> And a module would be loaded and a struct vfio_device is created for
> that device. Very easy for the user.

This is like switching a device to a parallel universe where we do
want vfio drivers to bind automatically to devices.
 
> > If we coupled that with wildcard support in driver_override, ex.
> > "vfio_pci*", and used consistent module naming, I think we'd only need
> > to teach userspace about this wildcard and binding to a specific module
> > would come for free.  
> 
> What would the wildcard do?

It allows a driver_override to match more than one driver, not too
dissimilar to your driver_flavor above.  In this case it would match
all driver names starting with "vfio_pci".  For example if we had:

softdep vfio-pci pre: vfio-pci-foo vfio-pci-bar

Then we'd pre-seed the condition that drivers foo and bar precede the
base vfio-pci driver, each will match the device to the driver and have
an opportunity in their probe function to either claim or skip the
device.  Userspace could also set and exact driver_override, for
example if they want to force using the base vfio-pci driver or go
directly to a specific variant.
 
> > This assumes we drop the per-variant id_table and use the probe
> > function to skip devices without the necessary requirements, either
> > wrong device or missing the tables we expect to expose.  
> 
> Without a module table how do we know which driver is which? 
> 
> Open coding a match table in probe() and returning failure feels hacky
> to me.

How's it any different than Max's get_foo_vfio_pci_driver() that calls
pci_match_id() with an internal match table?  It seems a better fit for
the existing use cases, for example the IGD variant can use a single
line table to exclude all except Intel VGA class devices in its probe
callback, then test availability of the extra regions we'd expose,
otherwise return -ENODEV.  The NVLink variant can use pci_match_id() in
the probe callback to filter out anything other than NVIDIA VGA or 3D
accelerator class devices, then check for associated FDT table, or
return -ENODEV.  We already use the vfio_pci probe function to exclude
devices in the deny-list and non-endpoint devices.  Many drivers
clearly place implicit trust in their id_table, others don't.  In the
case of meta drivers, I think it's fair to make use of the latter
approach.

> > > Should we even load it by default?  One answer would be that the sysfs
> > > file to switch to vfio mode goes into the core PCI layer, and that core
> > > PCI code would contain a hack^H^H^H^Hhook to first load and bind vfio_pci
> > > for that device.  
> > 
> > Generally we don't want to be the default driver for anything (I think
> > mdev devices are the exception).  Assignment to userspace or VM is a
> > niche use case.  Thanks,  
> 
> By "default" I mean if the user says device A is in "vfio" mode then
> the kernel should
>  - Search for a specific driver for this device and autoload it
>  - If no specific driver is found then attach a default "universal"
>    driver for it. vfio_pci is a universal driver.
> 
> vfio_platform is also a "universal" driver when in ACPI mode, in some
> cases.
> 
> For OF cases platform it builts its own little subsystem complete with
> autoloading:
> 
>                 request_module("vfio-reset:%s", vdev->compat);
>                 vdev->of_reset = vfio_platform_lookup_reset(vdev->compat,
>                                                         &vdev->reset_module);
> 
> And it is a good example of why I don't like this subsystem design
> because vfio_platform doesn't do the driver loading for OF entirely
> right, vdev->compat is a single string derived from the compatible
> property:
> 
>         ret = device_property_read_string(dev, "compatible",
>                                           &vdev->compat);
>         if (ret)
>                 dev_err(dev, "Cannot retrieve compat for %s\n", vdev->name);
> 
> Unfortunately OF requires that compatible is a *list* of strings and a
> correct driver is supposed to evaluate all of them. The driver core
> does this all correctly, and this was lost when it was open coded
> here.
> 
> We should NOT be avoiding the standard infrastructure for matching
> drivers to devices by re-implementing it poorly.

I take some blame for the request_module() behavior of vfio-platform,
but I think we're on the same page that we don't want to turn vfio-pci
into a nexus for loading variant drivers.  Whatever solution we use for
vfio-pci might translate to replacing that vfio-platform behavior.  As
above, I think it's possible to create that alternate universe of
driver matching with a simple wildcard and load ordering approach,
performing the more specific filtering in the probe callback with fall
through to the next matching driver.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-19 21:08                     ` Alex Williamson
@ 2021-03-19 22:59                       ` Jason Gunthorpe
  2021-03-20  4:40                         ` Alex Williamson
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-19 22:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Christoph Hellwig, Max Gurtovoy, Alexey Kardashevskiy, cohuck,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato

On Fri, Mar 19, 2021 at 03:08:09PM -0600, Alex Williamson wrote:
> On Fri, 19 Mar 2021 17:07:49 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Fri, Mar 19, 2021 at 11:36:42AM -0600, Alex Williamson wrote:
> > > On Fri, 19 Mar 2021 17:34:49 +0100
> > > Christoph Hellwig <hch@lst.de> wrote:
> > >   
> > > > On Fri, Mar 19, 2021 at 01:28:48PM -0300, Jason Gunthorpe wrote:  
> > > > > The wrinkle I don't yet have an easy answer to is how to load vfio_pci
> > > > > as a universal "default" within the driver core lazy bind scheme and
> > > > > still have working module autoloading... I'm hoping to get some
> > > > > research into this..    
> > > 
> > > What about using MODULE_SOFTDEP("pre: ...") in the vfio-pci base
> > > driver, which would load all the known variants in order to influence
> > > the match, and therefore probe ordering?  
> > 
> > The way the driver core works is to first match against the already
> > loaded driver list, then trigger an event for module loading and when
> > new drivers are registered they bind to unbound devices.
> 
> The former is based on id_tables, the latter on MODULE_DEVICE_TABLE, we
> don't have either of those.

Well, today we don't, but Max here adds id_table's to the special
devices and a MODULE_DEVICE_TABLE would come too if we do the flavours
thing below.

My starting thinking is that everything should have these tables and
they should work properly..

> As noted to Christoph, the cases where we want a vfio driver to
> bind to anything automatically is the exception.

I agree vfio should not automatically claim devices, but once vfio is
told to claim a device everything from there after should be
automatic.

> > One answer is to have userspace udev have the "hook" here and when a
> > vfio flavour mod alias is requested on a PCI device it swaps in
> > vfio_pci if it can't find an alternative.
> > 
> > The dream would be a system with no vfio modules loaded could do some
> > 
> >  echo "vfio" > /sys/bus/pci/xxx/driver_flavour
> > 
> > And a module would be loaded and a struct vfio_device is created for
> > that device. Very easy for the user.
> 
> This is like switching a device to a parallel universe where we do
> want vfio drivers to bind automatically to devices.

Yes.

If we do this I'd probably suggest that driver_override be bumped down
to some user compat and 'vfio > driver_override' would just set the
flavour.

As-is driver_override seems dangerous as overriding the matching table
could surely allow root userspace to crash the machine. In situations
with trusted boot/signed modules this shouldn't be.

> > > If we coupled that with wildcard support in driver_override, ex.
> > > "vfio_pci*", and used consistent module naming, I think we'd only need
> > > to teach userspace about this wildcard and binding to a specific module
> > > would come for free.  
> > 
> > What would the wildcard do?
> 
> It allows a driver_override to match more than one driver, not too
> dissimilar to your driver_flavor above.  In this case it would match
> all driver names starting with "vfio_pci".  For example if we had:
> 
> softdep vfio-pci pre: vfio-pci-foo vfio-pci-bar
>
> Then we'd pre-seed the condition that drivers foo and bar precede the
> base vfio-pci driver, each will match the device to the driver and have
> an opportunity in their probe function to either claim or skip the
> device.  Userspace could also set and exact driver_override, for
> example if they want to force using the base vfio-pci driver or go
> directly to a specific variant.

Okay, I see. The problem is that this makes 'vfio-pci' monolithic, in
normal situations it will load *everything*.

While that might not seem too bad with these simple drivers, at least
the mlx5 migration driver will have a large dependency tree and pull
in lots of other modules. Even Max's sample from v1 pulls in mlx5_core.ko
and a bunch of other stuff in its orbit.

This is why I want to try for fine grained autoloading first. It
really is the elegant solution if we can work it out.

> > Open coding a match table in probe() and returning failure feels hacky
> > to me.
> 
> How's it any different than Max's get_foo_vfio_pci_driver() that calls
> pci_match_id() with an internal match table?  

Well, I think that is hacky too - but it is hacky only to service user
space compatability so lets put that aside

> It seems a better fit for the existing use cases, for example the
> IGD variant can use a single line table to exclude all except Intel
> VGA class devices in its probe callback, then test availability of
> the extra regions we'd expose, otherwise return -ENODEV.

I don't think we should over-focus on these two firmware triggered
examples. I looked at the Intel GPU driver and it already only reads
the firmware thing for certain PCI ID's, we can absolutely generate a
narrow match table for it. Same is true for the NVIDIA GPU.

The fact this is hard or whatever is beside the point - future drivers
in this scheme should have exact match tables. 

The mlx5 sample is a good example, as it matches a very narrow NVMe
device that is properly labeled with a subvendor ID. It does not match
every NVMe device and then run code to figure it out. I think this is
the right thing to do as it is the only thing that would give us fine
grained module loading.

Even so, I'm not *so* worried about "over matching" - if IGD or the
nvidia stuff load on a wide set of devices then they can just not
enable their extended stuff. It wastes some kernel memory, but it is
OK.

And if some driver *really* gets stuck here the true answer is to
improve the driver core match capability.

> devices in the deny-list and non-endpoint devices.  Many drivers
> clearly place implicit trust in their id_table, others don't.  In the
> case of meta drivers, I think it's fair to make use of the latter
> approach.

Well, AFAIK, the driver core doesn't have a 'try probe, if it fails
then try another driver' approach. One device, one driver. Am I
missing something?

I would prefer not to propose to Greg such a radical change to how
driver loading works..

I also think the softdep/implicit loading/ordering will not be
welcomed, it feels weird to me.

> > We should NOT be avoiding the standard infrastructure for matching
> > drivers to devices by re-implementing it poorly.
> 
> I take some blame for the request_module() behavior of vfio-platform,
> but I think we're on the same page that we don't want to turn
> vfio-pci

Okay, that's good, we can explore the driver core side and see what
could work to decide if we can do fine-grained loading or not.

The question is now how to stage all this work? It is too big to do
all in one shot - can we reform Max's series into something mergable
without the driver core part? For instance removing the
pci_device_driver and dedicated modules and only doing the compat
path? It would look the same from userspace, but the internals would
be split to a library mode.

The patch to add in the device driver would then be small enough to go
along with future driver core changes, and if it fails we still have
several fallback plans to use the librarizided version,.

> into a nexus for loading variant drivers.  Whatever solution we use for
> vfio-pci might translate to replacing that vfio-platform behavior.  

Yes, I'd want to see this too. It shows it is a general enough
idea. Greg has been big on asking for lots of users for driver core
changes, so it is good we have more things.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-19 22:59                       ` Jason Gunthorpe
@ 2021-03-20  4:40                         ` Alex Williamson
  2021-03-21 12:58                           ` Jason Gunthorpe
  0 siblings, 1 reply; 53+ messages in thread
From: Alex Williamson @ 2021-03-20  4:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Max Gurtovoy, Alexey Kardashevskiy, cohuck,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato

On Fri, 19 Mar 2021 19:59:43 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Mar 19, 2021 at 03:08:09PM -0600, Alex Williamson wrote:
> > On Fri, 19 Mar 2021 17:07:49 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Fri, Mar 19, 2021 at 11:36:42AM -0600, Alex Williamson wrote:  
> > > > On Fri, 19 Mar 2021 17:34:49 +0100
> > > > Christoph Hellwig <hch@lst.de> wrote:
> > > >     
> > > > > On Fri, Mar 19, 2021 at 01:28:48PM -0300, Jason Gunthorpe wrote:    
> > > > > > The wrinkle I don't yet have an easy answer to is how to load vfio_pci
> > > > > > as a universal "default" within the driver core lazy bind scheme and
> > > > > > still have working module autoloading... I'm hoping to get some
> > > > > > research into this..      
> > > > 
> > > > What about using MODULE_SOFTDEP("pre: ...") in the vfio-pci base
> > > > driver, which would load all the known variants in order to influence
> > > > the match, and therefore probe ordering?    
> > > 
> > > The way the driver core works is to first match against the already
> > > loaded driver list, then trigger an event for module loading and when
> > > new drivers are registered they bind to unbound devices.  
> > 
> > The former is based on id_tables, the latter on MODULE_DEVICE_TABLE, we
> > don't have either of those.  
> 
> Well, today we don't, but Max here adds id_table's to the special
> devices and a MODULE_DEVICE_TABLE would come too if we do the flavours
> thing below.

I think the id_tables are the wrong approach for IGD and NVLink
variants.
 
> My starting thinking is that everything should have these tables and
> they should work properly..

id_tables require ongoing maintenance whereas the existing variants
require only vendor + device class and some platform feature, like a
firmware or fdt table.  They're meant to only add extra regions to
vfio-pci base support, not extensively modify the device interface.
 
> > As noted to Christoph, the cases where we want a vfio driver to
> > bind to anything automatically is the exception.  
> 
> I agree vfio should not automatically claim devices, but once vfio is
> told to claim a device everything from there after should be
> automatic.
> 
> > > One answer is to have userspace udev have the "hook" here and when a
> > > vfio flavour mod alias is requested on a PCI device it swaps in
> > > vfio_pci if it can't find an alternative.
> > > 
> > > The dream would be a system with no vfio modules loaded could do some
> > > 
> > >  echo "vfio" > /sys/bus/pci/xxx/driver_flavour
> > > 
> > > And a module would be loaded and a struct vfio_device is created for
> > > that device. Very easy for the user.  
> > 
> > This is like switching a device to a parallel universe where we do
> > want vfio drivers to bind automatically to devices.  
> 
> Yes.
> 
> If we do this I'd probably suggest that driver_override be bumped down
> to some user compat and 'vfio > driver_override' would just set the
> flavour.
> 
> As-is driver_override seems dangerous as overriding the matching table
> could surely allow root userspace to crash the machine. In situations
> with trusted boot/signed modules this shouldn't be.

When we're dealing with meta-drivers that can bind to anything, we
shouldn't rely on the match, but should instead verify the driver is
appropriate in the probe callback.  Even without driver_override,
there's the new_id mechanism.  Either method allows the root user to
break driver binding.  Greg has previously stated something to the
effect that users get to keep all the pieces when they break something
by manipulating driver binding.

> > > > If we coupled that with wildcard support in driver_override, ex.
> > > > "vfio_pci*", and used consistent module naming, I think we'd only need
> > > > to teach userspace about this wildcard and binding to a specific module
> > > > would come for free.    
> > > 
> > > What would the wildcard do?  
> > 
> > It allows a driver_override to match more than one driver, not too
> > dissimilar to your driver_flavor above.  In this case it would match
> > all driver names starting with "vfio_pci".  For example if we had:
> > 
> > softdep vfio-pci pre: vfio-pci-foo vfio-pci-bar
> >
> > Then we'd pre-seed the condition that drivers foo and bar precede the
> > base vfio-pci driver, each will match the device to the driver and have
> > an opportunity in their probe function to either claim or skip the
> > device.  Userspace could also set and exact driver_override, for
> > example if they want to force using the base vfio-pci driver or go
> > directly to a specific variant.  
> 
> Okay, I see. The problem is that this makes 'vfio-pci' monolithic, in
> normal situations it will load *everything*.
> 
> While that might not seem too bad with these simple drivers, at least
> the mlx5 migration driver will have a large dependency tree and pull
> in lots of other modules. Even Max's sample from v1 pulls in mlx5_core.ko
> and a bunch of other stuff in its orbit.

Luckily the mlx5 driver doesn't need to be covered by compatibility
support, so we don't need to set a softdep for it and the module could
be named such that a wildcard driver_override of vfio_pci* shouldn't
logically include that driver.  Users can manually create their own
modprobe.d softdep entry if they'd like to include it.  Otherwise
userspace would need to know to bind to it specifically.
 
> This is why I want to try for fine grained autoloading first. It
> really is the elegant solution if we can work it out.

I just don't see how we create a manageable change to userspace.

> > > Open coding a match table in probe() and returning failure feels hacky
> > > to me.  
> > 
> > How's it any different than Max's get_foo_vfio_pci_driver() that calls
> > pci_match_id() with an internal match table?    
> 
> Well, I think that is hacky too - but it is hacky only to service user
> space compatability so lets put that aside

I don't see that dropping incompatible devices in the probe function
rather than the match via id_table is necessarily a hack.  I think
driver-core explicitly supports this (see below).

> > It seems a better fit for the existing use cases, for example the
> > IGD variant can use a single line table to exclude all except Intel
> > VGA class devices in its probe callback, then test availability of
> > the extra regions we'd expose, otherwise return -ENODEV.  
> 
> I don't think we should over-focus on these two firmware triggered
> examples. I looked at the Intel GPU driver and it already only reads
> the firmware thing for certain PCI ID's, we can absolutely generate a
> narrow match table for it. Same is true for the NVIDIA GPU.

I'm not sure we can make this assertion, both only care about the type
of device and existence of associated firmware tables.  No PCI IDs are
currently involved.

> The fact this is hard or whatever is beside the point - future drivers
> in this scheme should have exact match tables. 
> 
> The mlx5 sample is a good example, as it matches a very narrow NVMe
> device that is properly labeled with a subvendor ID. It does not match
> every NVMe device and then run code to figure it out. I think this is
> the right thing to do as it is the only thing that would give us fine
> grained module loading.

Sounds like the right thing to do for that device, if it's only designed
to run in this framework.  That's more like the mdev device model

> Even so, I'm not *so* worried about "over matching" - if IGD or the
> nvidia stuff load on a wide set of devices then they can just not
> enable their extended stuff. It wastes some kernel memory, but it is
> OK.

I'd rather they bind to the base vfio-pci driver if their extended
features are not available.

> And if some driver *really* gets stuck here the true answer is to
> improve the driver core match capability.
> 
> > devices in the deny-list and non-endpoint devices.  Many drivers
> > clearly place implicit trust in their id_table, others don't.  In the
> > case of meta drivers, I think it's fair to make use of the latter
> > approach.  
> 
> Well, AFAIK, the driver core doesn't have a 'try probe, if it fails
> then try another driver' approach. One device, one driver. Am I
> missing something?

If the driver probe callback fails, really_probe() returns 0 with the
comment:

        /*
         * Ignore errors returned by ->probe so that the next driver can try
         * its luck.
         */
        ret = 0;

That allows bus_for_each_drv() to continue to iterate.

> 
> I would prefer not to propose to Greg such a radical change to how
> driver loading works..

Seems to be how it works already.
 
> I also think the softdep/implicit loading/ordering will not be
> welcomed, it feels weird to me.

AFAICT, it works within the existing driver-core, it's largely an
extension to pci-core driver_override support to enable wildcard
matching, ideally along with adding the same for all buses that support
driver_override.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-20  4:40                         ` Alex Williamson
@ 2021-03-21 12:58                           ` Jason Gunthorpe
  2021-03-22 16:40                             ` Alex Williamson
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-21 12:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Christoph Hellwig, Max Gurtovoy, Alexey Kardashevskiy, cohuck,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato

On Fri, Mar 19, 2021 at 10:40:28PM -0600, Alex Williamson wrote:

> > Well, today we don't, but Max here adds id_table's to the special
> > devices and a MODULE_DEVICE_TABLE would come too if we do the flavours
> > thing below.
> 
> I think the id_tables are the wrong approach for IGD and NVLink
> variants.

I really disagree with this. Checking for some random bits in firmware
and assuming that every device made forever into the future works with
this check is not a good way to do compatibility. Christoph made the
same point.

We have good processes to maintain id tables, I don't see this as a
problem.

> > As-is driver_override seems dangerous as overriding the matching table
> > could surely allow root userspace to crash the machine. In situations
> > with trusted boot/signed modules this shouldn't be.
> 
> When we're dealing with meta-drivers that can bind to anything, we
> shouldn't rely on the match, but should instead verify the driver is
> appropriate in the probe callback.  Even without driver_override,
> there's the new_id mechanism.  Either method allows the root user to
> break driver binding.  Greg has previously stated something to the
> effect that users get to keep all the pieces when they break something
> by manipulating driver binding.

Yes, but that is a view where root is allowed to break the kernel, we
now have this optional other world where that is not allowed and root
access to lots of dangerous things are now disabled.

new_id and driver_override should probably be in that disable list
too..

> > While that might not seem too bad with these simple drivers, at least
> > the mlx5 migration driver will have a large dependency tree and pull
> > in lots of other modules. Even Max's sample from v1 pulls in mlx5_core.ko
> > and a bunch of other stuff in its orbit.
> 
> Luckily the mlx5 driver doesn't need to be covered by compatibility
> support, so we don't need to set a softdep for it and the module could
> be named such that a wildcard driver_override of vfio_pci* shouldn't
> logically include that driver.  Users can manually create their own
> modprobe.d softdep entry if they'd like to include it.  Otherwise
> userspace would need to know to bind to it specifically.

But now you are giving up on the whole point, which was to
automatically load the correct specific module without special admin
involvement!

> > This is why I want to try for fine grained autoloading first. It
> > really is the elegant solution if we can work it out.
> 
> I just don't see how we create a manageable change to userspace.

I'm not sure I understand. Even if we add a new sysfs to set some
flavour then that is a pretty trivial change for userspace to move
from driver_override?

> > I don't think we should over-focus on these two firmware triggered
> > examples. I looked at the Intel GPU driver and it already only reads
> > the firmware thing for certain PCI ID's, we can absolutely generate a
> > narrow match table for it. Same is true for the NVIDIA GPU.
> 
> I'm not sure we can make this assertion, both only care about the type
> of device and existence of associated firmware tables.  

Well, I read through the Intel GPU driver and this is how I felt it
works. It doesn't even check the firmware bit unless certain PCI IDs
are matched first.

For NVIDIA GPU Max checked internally and we saw it looks very much
like how Intel GPU works. Only some PCI IDs trigger checking on the
feature the firmware thing is linked to.

My point is: the actual *drivers* consuming these firmware features do
*not* blindly match every PCI device and check for the firmware
bit. They all have narrow matches and further only try to use the
firmware thing for some subset of PCI IDs that the entire driver
supports.

Given that the actual drivers work this way there is no technical
reason vfio-pci can't do this as well.

We don't have to change them of course, they can stay as is if people
feel really strongly.

> > Even so, I'm not *so* worried about "over matching" - if IGD or the
> > nvidia stuff load on a wide set of devices then they can just not
> > enable their extended stuff. It wastes some kernel memory, but it is
> > OK.
> 
> I'd rather they bind to the base vfio-pci driver if their extended
> features are not available.

Sure it would be nice, but functionally it is no different.

> > And if some driver *really* gets stuck here the true answer is to
> > improve the driver core match capability.
> > 
> > > devices in the deny-list and non-endpoint devices.  Many drivers
> > > clearly place implicit trust in their id_table, others don't.  In the
> > > case of meta drivers, I think it's fair to make use of the latter
> > > approach.  
> > 
> > Well, AFAIK, the driver core doesn't have a 'try probe, if it fails
> > then try another driver' approach. One device, one driver. Am I
> > missing something?
> 
> If the driver probe callback fails, really_probe() returns 0 with the
> comment:
> 
>         /*
>          * Ignore errors returned by ->probe so that the next driver can try
>          * its luck.
>          */
>         ret = 0;
> 
> That allows bus_for_each_drv() to continue to iterate.

Er, but we have no reliable way to order drivers in the list so this
still assumes the system has exactly one driver match (even if some of
the match is now in code).

It won't work with a "universal" driver without more changes.

(and I couldn't find out why Cornelia added this long ago, or how or
even if it actually ended up being used)

> > I also think the softdep/implicit loading/ordering will not be
> > welcomed, it feels weird to me.
> 
> AFAICT, it works within the existing driver-core, it's largely an
> extension to pci-core driver_override support to enable wildcard
> matching, ideally along with adding the same for all buses that support
> driver_override.  Thanks,

It is the implicit ordering of module loading that is trouble.

Regards,
Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-19 20:07                   ` Jason Gunthorpe
  2021-03-19 21:08                     ` Alex Williamson
@ 2021-03-22 15:11                     ` Christoph Hellwig
  2021-03-22 16:44                       ` Jason Gunthorpe
  1 sibling, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2021-03-22 15:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Christoph Hellwig, Max Gurtovoy,
	Alexey Kardashevskiy, cohuck, kvm, linux-kernel, liranl, oren,
	tzahio, leonro, yarong, aviadye, shahafs, artemp, kwankhede,
	ACurrid, cjia, yishaih, mjrosato

On Fri, Mar 19, 2021 at 05:07:49PM -0300, Jason Gunthorpe wrote:
> The way the driver core works is to first match against the already
> loaded driver list, then trigger an event for module loading and when
> new drivers are registered they bind to unbound devices.
> 
> So, the trouble is the event through userspace because the kernel
> can't just go on to use vfio_pci until it knows userspace has failed
> to satisfy the load request.
> 
> One answer is to have userspace udev have the "hook" here and when a
> vfio flavour mod alias is requested on a PCI device it swaps in
> vfio_pci if it can't find an alternative.
> 
> The dream would be a system with no vfio modules loaded could do some
> 
>  echo "vfio" > /sys/bus/pci/xxx/driver_flavour
> 
> And a module would be loaded and a struct vfio_device is created for
> that device. Very easy for the user.

Maybe I did not communicate my suggestion last week very well.  My
idea is that there are no different pci_drivers vs vfio or not,
but different personalities of the same driver.

So the interface would still look somewhat like your suggestion above,
although I'd prefer something like:

   echo 1 > /sys/bus/pci/xxx/use_vfio

How would the flow look like for the various cases?

 a) if a driver is bound, and it supports the enable_vfio method that
    is called, and everything is controller by the driver, which uses
    symbols exorted from vfio/vfio_pci to implement the functionality
 b) if a driver is bound, but does not support the enable_vfio method
    it is unbound and vfio_pci is bound instead, continue at c)
 c) use the normal current vfio flow

do the reverse on a

echo 0 > /sys/bus/pci/xxx/use_vfio

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-21 12:58                           ` Jason Gunthorpe
@ 2021-03-22 16:40                             ` Alex Williamson
  2021-03-23 19:32                               ` Jason Gunthorpe
  0 siblings, 1 reply; 53+ messages in thread
From: Alex Williamson @ 2021-03-22 16:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Max Gurtovoy, Alexey Kardashevskiy, cohuck,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato

On Sun, 21 Mar 2021 09:58:18 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Mar 19, 2021 at 10:40:28PM -0600, Alex Williamson wrote:
> 
> > > Well, today we don't, but Max here adds id_table's to the special
> > > devices and a MODULE_DEVICE_TABLE would come too if we do the flavours
> > > thing below.  
> > 
> > I think the id_tables are the wrong approach for IGD and NVLink
> > variants.  
> 
> I really disagree with this. Checking for some random bits in firmware
> and assuming that every device made forever into the future works with
> this check is not a good way to do compatibility. Christoph made the
> same point.
> 
> We have good processes to maintain id tables, I don't see this as a
> problem.

The base driver we're discussing here is a meta-driver that binds to
any PCI endpoint as directed by the user.  There is no id_table.  There
can't be any id_table unless you're expecting every device vendor to
submit the exact subset of devices they have tested and condone usage
with this interface.  The IGD extensions here only extend that
interface by providing userspace read-only access to a few additional
pieces of information that we've found to be necessary for certain
userspace drivers.  The actual device interface is unchanged.  In the
case of the NVLink extensions, AIUI these are mostly extensions of a
firmware defined interface for managing aspects of the interconnect to
the device.  It is actually the "random bits in firmware" that we want
to expose, the ID of the device is somewhat tangential, we just only
look for those firmware extensions in association to certain vendor
devices.

Of course if you start looking at features like migration support,
that's more than likely not simply an additional region with optional
information, it would need to interact with the actual state of the
device.  For those, I would very much support use of a specific
id_table.  That's not these.

> > > As-is driver_override seems dangerous as overriding the matching table
> > > could surely allow root userspace to crash the machine. In situations
> > > with trusted boot/signed modules this shouldn't be.  
> > 
> > When we're dealing with meta-drivers that can bind to anything, we
> > shouldn't rely on the match, but should instead verify the driver is
> > appropriate in the probe callback.  Even without driver_override,
> > there's the new_id mechanism.  Either method allows the root user to
> > break driver binding.  Greg has previously stated something to the
> > effect that users get to keep all the pieces when they break something
> > by manipulating driver binding.  
> 
> Yes, but that is a view where root is allowed to break the kernel, we
> now have this optional other world where that is not allowed and root
> access to lots of dangerous things are now disabled.
> 
> new_id and driver_override should probably be in that disable list
> too..

We don't have this other world yet, nor is it clear that we will have
it.  What sort of id_table is the base vfio-pci driver expected to use?
There's always a risk that hardware doesn't adhere to the spec or that
platform firmware might escalate an error that we'd otherwise consider
mundane from a userspace driver.

> > > While that might not seem too bad with these simple drivers, at least
> > > the mlx5 migration driver will have a large dependency tree and pull
> > > in lots of other modules. Even Max's sample from v1 pulls in mlx5_core.ko
> > > and a bunch of other stuff in its orbit.  
> > 
> > Luckily the mlx5 driver doesn't need to be covered by compatibility
> > support, so we don't need to set a softdep for it and the module could
> > be named such that a wildcard driver_override of vfio_pci* shouldn't
> > logically include that driver.  Users can manually create their own
> > modprobe.d softdep entry if they'd like to include it.  Otherwise
> > userspace would need to know to bind to it specifically.  
> 
> But now you are giving up on the whole point, which was to
> automatically load the correct specific module without special admin
> involvement!

This series only exposed a temporary compatibility interface to provide
that anyway.  As I understood it, the long term solution was that
userspace would somehow learn which driver to use for which device.
That "somehow" isn't clear to me.

> > > This is why I want to try for fine grained autoloading first. It
> > > really is the elegant solution if we can work it out.  
> > 
> > I just don't see how we create a manageable change to userspace.  
> 
> I'm not sure I understand. Even if we add a new sysfs to set some
> flavour then that is a pretty trivial change for userspace to move
> from driver_override?

Perhaps for some definition of trivial that I'm not familiar with.
We're talking about changing libvirt and driverctl and every distro and
user that's created a custom script outside of those.  Even changing
from "vfio-pci" to "vfio-pci*" is a hurdle.

> > > I don't think we should over-focus on these two firmware triggered
> > > examples. I looked at the Intel GPU driver and it already only reads
> > > the firmware thing for certain PCI ID's, we can absolutely generate a
> > > narrow match table for it. Same is true for the NVIDIA GPU.  
> > 
> > I'm not sure we can make this assertion, both only care about the type
> > of device and existence of associated firmware tables.    
> 
> Well, I read through the Intel GPU driver and this is how I felt it
> works. It doesn't even check the firmware bit unless certain PCI IDs
> are matched first.

The IDs being only the PCI vendor ID and class code.  The entire IGD
extension is only meant to expose a vendor specific, graphics related
firmware table and collateral config space, so of course we'd restrict
it to that vendor for a graphics class device in approximately the
right location in the system.  There's a big difference between that
and a fixed id_table.
 
> For NVIDIA GPU Max checked internally and we saw it looks very much
> like how Intel GPU works. Only some PCI IDs trigger checking on the
> feature the firmware thing is linked to.

And as Alexey noted, the table came up incomplete.  But also those same
devices exist on platforms where this extension is completely
irrelevant.

> My point is: the actual *drivers* consuming these firmware features do
> *not* blindly match every PCI device and check for the firmware
> bit. They all have narrow matches and further only try to use the
> firmware thing for some subset of PCI IDs that the entire driver
> supports.

So because we don't check for an Intel specific graphics firmware table
when binding to Realtek NIC, we can leap to the conclusion that there
must be a concise id_table we can create for IGD support?

> Given that the actual drivers work this way there is no technical
> reason vfio-pci can't do this as well.

There's a giant assumption above that I'm missing.  Are you expecting
that vendors are actually going to keep up with submitting device IDs
that they claim to have tested and support with vfio-pci and all other
devices won't be allowed to bind?  That would single handedly destroy
any non-enterprise use cases of vfio-pci.

> We don't have to change them of course, they can stay as is if people
> feel really strongly.
> 
> > > Even so, I'm not *so* worried about "over matching" - if IGD or the
> > > nvidia stuff load on a wide set of devices then they can just not
> > > enable their extended stuff. It wastes some kernel memory, but it is
> > > OK.  
> > 
> > I'd rather they bind to the base vfio-pci driver if their extended
> > features are not available.  
> 
> Sure it would be nice, but functionally it is no different.

Exactly, the device interface is not changed, so why is it such a
heinous misstep that we should test for the feature we're trying to
expose rather than a specific ID and fall through if we don't find it?

> > > And if some driver *really* gets stuck here the true answer is to
> > > improve the driver core match capability.
> > >   
> > > > devices in the deny-list and non-endpoint devices.  Many drivers
> > > > clearly place implicit trust in their id_table, others don't.  In the
> > > > case of meta drivers, I think it's fair to make use of the latter
> > > > approach.    
> > > 
> > > Well, AFAIK, the driver core doesn't have a 'try probe, if it fails
> > > then try another driver' approach. One device, one driver. Am I
> > > missing something?  
> > 
> > If the driver probe callback fails, really_probe() returns 0 with the
> > comment:
> > 
> >         /*
> >          * Ignore errors returned by ->probe so that the next driver can try
> >          * its luck.
> >          */
> >         ret = 0;
> > 
> > That allows bus_for_each_drv() to continue to iterate.  
> 
> Er, but we have no reliable way to order drivers in the list so this
> still assumes the system has exactly one driver match (even if some of
> the match is now in code).
> 
> It won't work with a "universal" driver without more changes.
> 
> (and I couldn't find out why Cornelia added this long ago, or how or
> even if it actually ended up being used)

You'd need to go further back than Conny touching it, the original
import into git had:

void driver_attach(struct device_driver * drv)
{
        struct bus_type * bus = drv->bus;
        struct list_head * entry;
        int error;

        if (!bus->match)
                return;

        list_for_each(entry, &bus->devices.list) {
                struct device * dev = container_of(entry, struct device, bus_list);
                if (!dev->driver) {
                        error = driver_probe_device(drv, dev);
                        if (error && (error != -ENODEV))
                                /* driver matched but the probe failed */
                                printk(KERN_WARNING
                                    "%s: probe of %s failed with error %d\n",
                                    drv->name, dev->bus_id, error);
                }
        }
}

So unless you want to do some bitkeeper archaeology, we've always
allowed driver probes to fail and fall through to the next one, not
even complaining with -ENODEV.  In practice it hasn't been an issue
because how many drivers do you expect to have that would even try to
claim a device.  Ordering is only important when there's a catch-all so
we need to figure out how to make that last among a class of drivers
that will attempt to claim a device.  The softdep is a bit of a hack to
do that, I'll admit, but I don't see how the alternate driver flavor
universe solves having a catch-all either.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-22 15:11                     ` Christoph Hellwig
@ 2021-03-22 16:44                       ` Jason Gunthorpe
  2021-03-23 13:17                         ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-22 16:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alex Williamson, Max Gurtovoy, Alexey Kardashevskiy, cohuck, kvm,
	linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato

On Mon, Mar 22, 2021 at 04:11:25PM +0100, Christoph Hellwig wrote:
> On Fri, Mar 19, 2021 at 05:07:49PM -0300, Jason Gunthorpe wrote:
> > The way the driver core works is to first match against the already
> > loaded driver list, then trigger an event for module loading and when
> > new drivers are registered they bind to unbound devices.
> > 
> > So, the trouble is the event through userspace because the kernel
> > can't just go on to use vfio_pci until it knows userspace has failed
> > to satisfy the load request.
> > 
> > One answer is to have userspace udev have the "hook" here and when a
> > vfio flavour mod alias is requested on a PCI device it swaps in
> > vfio_pci if it can't find an alternative.
> > 
> > The dream would be a system with no vfio modules loaded could do some
> > 
> >  echo "vfio" > /sys/bus/pci/xxx/driver_flavour
> > 
> > And a module would be loaded and a struct vfio_device is created for
> > that device. Very easy for the user.
> 
> Maybe I did not communicate my suggestion last week very well.  My
> idea is that there are no different pci_drivers vs vfio or not,
> but different personalities of the same driver.

This isn't quite the scenario that needs solving. Lets go back to
Max's V1 posting:

The mlx5_vfio_pci.c pci_driver matches this:

+	{ PCI_DEVICE_SUB(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1042,
+			 PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID) }, /* Virtio SNAP controllers */

This overlaps with the match table in
drivers/virtio/virtio_pci_common.c:

        { PCI_DEVICE(PCI_VENDOR_ID_REDHAT_QUMRANET, PCI_ANY_ID) },

So, if we do as you propose we have to add something mellanox specific
to virtio_pci_common which seems to me to just repeating this whole
problem except in more drivers.

The general thing that that is happening is people are adding VM
migration capability to existing standard PCI interfaces like VFIO,
NVMe, etc

At least in this mlx5 situation the PF driver provides the HW access
to do the migration and the vfio mlx5 driver provides all the protocol
and machinery specific to the PCI standard being migrated. They are
all a little different.

But you could imagine some other implemetnation where the VF might
have an extra non-standard BAR that is the migration control.

This is why I like having a full stand alone pci_driver as everyone
implementing this can provide the vfio_device that is appropriate for
the HW.

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-22 16:44                       ` Jason Gunthorpe
@ 2021-03-23 13:17                         ` Christoph Hellwig
  2021-03-23 13:42                           ` Jason Gunthorpe
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2021-03-23 13:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Alex Williamson, Max Gurtovoy,
	Alexey Kardashevskiy, cohuck, kvm, linux-kernel, liranl, oren,
	tzahio, leonro, yarong, aviadye, shahafs, artemp, kwankhede,
	ACurrid, cjia, yishaih, mjrosato

On Mon, Mar 22, 2021 at 01:44:11PM -0300, Jason Gunthorpe wrote:
> This isn't quite the scenario that needs solving. Lets go back to
> Max's V1 posting:
> 
> The mlx5_vfio_pci.c pci_driver matches this:
> 
> +	{ PCI_DEVICE_SUB(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1042,
> +			 PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID) }, /* Virtio SNAP controllers */
> 
> This overlaps with the match table in
> drivers/virtio/virtio_pci_common.c:
> 
>         { PCI_DEVICE(PCI_VENDOR_ID_REDHAT_QUMRANET, PCI_ANY_ID) },
> 
> So, if we do as you propose we have to add something mellanox specific
> to virtio_pci_common which seems to me to just repeating this whole
> problem except in more drivers.

Oh, yikes.  

> The general thing that that is happening is people are adding VM
> migration capability to existing standard PCI interfaces like VFIO,
> NVMe, etc

Well, if a migration capability is added to virtio (or NVMe) it should
be standardized and not vendor specific.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-23 13:17                         ` Christoph Hellwig
@ 2021-03-23 13:42                           ` Jason Gunthorpe
  0 siblings, 0 replies; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-23 13:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alex Williamson, Max Gurtovoy, Alexey Kardashevskiy, cohuck, kvm,
	linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato

On Tue, Mar 23, 2021 at 02:17:09PM +0100, Christoph Hellwig wrote:
> On Mon, Mar 22, 2021 at 01:44:11PM -0300, Jason Gunthorpe wrote:
> > This isn't quite the scenario that needs solving. Lets go back to
> > Max's V1 posting:
> > 
> > The mlx5_vfio_pci.c pci_driver matches this:
> > 
> > +	{ PCI_DEVICE_SUB(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1042,
> > +			 PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID) }, /* Virtio SNAP controllers */
> > 
> > This overlaps with the match table in
> > drivers/virtio/virtio_pci_common.c:
> > 
> >         { PCI_DEVICE(PCI_VENDOR_ID_REDHAT_QUMRANET, PCI_ANY_ID) },
> > 
> > So, if we do as you propose we have to add something mellanox specific
> > to virtio_pci_common which seems to me to just repeating this whole
> > problem except in more drivers.
> 
> Oh, yikes.  

This is why I keep saying it is a VFIO driver - it has no relation to
the normal kernel drivers on the hypervisor. Even loading a normal
kernel driver and switching to a VFIO mode would be unacceptably
slow/disruptive.

The goal is to go directly to a VFIO mode driver with PCI driver auto
probing disabled to avoid attaching a regular driver. Big servers will
have 1000's of these things.

> > The general thing that that is happening is people are adding VM
> > migration capability to existing standard PCI interfaces like VFIO,
> > NVMe, etc
> 
> Well, if a migration capability is added to virtio (or NVMe) it should
> be standardized and not vendor specific.

It would be nice, but it would be a challenging standard to write.

I think the industry is still in the pre-standards mode of trying to
even figure out how this stuff should work.

IMHO PCI sig needs to tackle a big part of this as we can't embed any
migration controls in the VF itself, it has to be secure for only
hypervisor use.

What we've got now is a Linux standard in VFIO where the uAPI to
manage migration is multi-vendor and we want to plug drivers into
that.

If in a few years the industry also develops HW standards then I
imagine using the same mechanism to plug in these standards based
implementation.

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-22 16:40                             ` Alex Williamson
@ 2021-03-23 19:32                               ` Jason Gunthorpe
  2021-03-24  2:39                                 ` Alexey Kardashevskiy
  2021-03-29 23:10                                 ` Alex Williamson
  0 siblings, 2 replies; 53+ messages in thread
From: Jason Gunthorpe @ 2021-03-23 19:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Christoph Hellwig, Max Gurtovoy, Alexey Kardashevskiy, cohuck,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato

On Mon, Mar 22, 2021 at 10:40:16AM -0600, Alex Williamson wrote:

> Of course if you start looking at features like migration support,
> that's more than likely not simply an additional region with optional
> information, it would need to interact with the actual state of the
> device.  For those, I would very much support use of a specific
> id_table.  That's not these.

What I don't understand is why do we need two different ways of
inserting vendor code?

> > new_id and driver_override should probably be in that disable list
> > too..
> 
> We don't have this other world yet, nor is it clear that we will have
> it.

We do today, it is obscure, but there is a whole set of config options
designed to disable the unsafe kernel features. Kernels booted with
secure boot and signed modules tend to enable a lot of them, for
instance. The people working on the IMA stuff tend to enable a lot
more as you can defeat the purpose of IMA if you can hijack the
kernel.

> What sort of id_table is the base vfio-pci driver expected to use?

If it has a match table it would be all match, this is why I called it
a "universal driver"

If we have a flavour then the flavour controls the activation of
VFIO, not new_id or driver_override, and in vfio flavour mode we can
have an all match table, if we can resolve how to choose between two
drivers with overlapping matches.

> > > > This is why I want to try for fine grained autoloading first. It
> > > > really is the elegant solution if we can work it out.  
> > > 
> > > I just don't see how we create a manageable change to userspace.  
> > 
> > I'm not sure I understand. Even if we add a new sysfs to set some
> > flavour then that is a pretty trivial change for userspace to move
> > from driver_override?
> 
> Perhaps for some definition of trivial that I'm not familiar with.
> We're talking about changing libvirt and driverctl and every distro and
> user that's created a custom script outside of those.  Even changing
> from "vfio-pci" to "vfio-pci*" is a hurdle.

Sure, but it isn't like a major architectural shift, nor is it
mandatory unless you start using this new hardware class.

Userspace changes when we add kernel functionality.. The kernel just
has to keep working the way it used to for old functionality.

> > Well, I read through the Intel GPU driver and this is how I felt it
> > works. It doesn't even check the firmware bit unless certain PCI IDs
> > are matched first.
> 
> The IDs being only the PCI vendor ID and class code.  

I don't mean how vfio works, I mean how the Intel GPU driver works.

eg:

psb_pci_probe()
 psb_driver_load()
  psb_intel_opregion_setup()
           if (memcmp(base, OPREGION_SIGNATURE, 16)) {

i915_pci_probe()
 i915_driver_probe()
  i915_driver_hw_probe()
   intel_opregion_setup()
	if (memcmp(buf, OPREGION_SIGNATURE, 16)) {

All of these memcmp's are protected by exact id_tables hung off the
pci_driver's id_table.

VFIO is the different case. In this case the ID match confirms that
the config space has the ASLS dword at the fixed offset. If the ID
doesn't match nothing should read the ASLS offset.

> > For NVIDIA GPU Max checked internally and we saw it looks very much
> > like how Intel GPU works. Only some PCI IDs trigger checking on the
> > feature the firmware thing is linked to.
> 
> And as Alexey noted, the table came up incomplete.  But also those same
> devices exist on platforms where this extension is completely
> irrelevant.

I understood he ment that NVIDI GPUs *without* NVLINK can exist, but
the ID table we have here is supposed to be the NVLINK compatible
ID's.

> So because we don't check for an Intel specific graphics firmware table
> when binding to Realtek NIC, we can leap to the conclusion that there
> must be a concise id_table we can create for IGD support?

Concise? No, but we can see *today* what the ID table is supposed to
be by just loooking and the three probe functions that touch
OPREGION_SIGNATURE.

> There's a giant assumption above that I'm missing.  Are you expecting
> that vendors are actually going to keep up with submitting device IDs
> that they claim to have tested and support with vfio-pci and all other
> devices won't be allowed to bind?  That would single handedly destroy
> any non-enterprise use cases of vfio-pci.

Why not? They do it for the in-tree GPU drivers today! The ID table
for Intel GPU is even in a *header file* and we can just #include it
into vfio igd as well.

> So unless you want to do some bitkeeper archaeology, we've always
> allowed driver probes to fail and fall through to the next one, not
> even complaining with -ENODEV.  In practice it hasn't been an issue
> because how many drivers do you expect to have that would even try to
> claim a device.  

Do you know of anything using this ability? It might be helpful

> Ordering is only important when there's a catch-all so we need to
> figure out how to make that last among a class of drivers that will
> attempt to claim a device.  The softdep is a bit of a hack to do
> that, I'll admit, but I don't see how the alternate driver flavor
> universe solves having a catch-all either.

Haven't entirely got there yet, but I think the catch all probably has
to be handled by userspace udev/kmod in some way, as it is the only
thing that knows if there is a more specific module to load. This is
the biggest problem..

And again, I feel this is all a big tangent, especially now that HCH
wants to delete the nvlink stuff we should just leave igd alone.

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-23 19:32                               ` Jason Gunthorpe
@ 2021-03-24  2:39                                 ` Alexey Kardashevskiy
  2021-03-29 23:10                                 ` Alex Williamson
  1 sibling, 0 replies; 53+ messages in thread
From: Alexey Kardashevskiy @ 2021-03-24  2:39 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Christoph Hellwig, Max Gurtovoy, cohuck, kvm, linux-kernel,
	liranl, oren, tzahio, leonro, yarong, aviadye, shahafs, artemp,
	kwankhede, ACurrid, cjia, yishaih, mjrosato



On 24/03/2021 06:32, Jason Gunthorpe wrote:

>>> For NVIDIA GPU Max checked internally and we saw it looks very much
>>> like how Intel GPU works. Only some PCI IDs trigger checking on the
>>> feature the firmware thing is linked to.
>>
>> And as Alexey noted, the table came up incomplete.  But also those same
>> devices exist on platforms where this extension is completely
>> irrelevant.
> 
> I understood he ment that NVIDI GPUs *without* NVLINK can exist, but
> the ID table we have here is supposed to be the NVLINK compatible
> ID's.


I also meant there are more (than in the proposed list)  GPUs with 
NVLink which will work on P9.


-- 
Alexey

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-23 19:32                               ` Jason Gunthorpe
  2021-03-24  2:39                                 ` Alexey Kardashevskiy
@ 2021-03-29 23:10                                 ` Alex Williamson
  2021-04-01 13:04                                   ` Cornelia Huck
  2021-04-01 13:12                                   ` Jason Gunthorpe
  1 sibling, 2 replies; 53+ messages in thread
From: Alex Williamson @ 2021-03-29 23:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Max Gurtovoy, Alexey Kardashevskiy, cohuck,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato

On Tue, 23 Mar 2021 16:32:13 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Mar 22, 2021 at 10:40:16AM -0600, Alex Williamson wrote:
> 
> > Of course if you start looking at features like migration support,
> > that's more than likely not simply an additional region with optional
> > information, it would need to interact with the actual state of the
> > device.  For those, I would very much support use of a specific
> > id_table.  That's not these.  
> 
> What I don't understand is why do we need two different ways of
> inserting vendor code?

Because a PCI id table only identifies the device, these drivers are
looking for a device in the context of firmware dependencies.

> > > new_id and driver_override should probably be in that disable list
> > > too..  
> > 
> > We don't have this other world yet, nor is it clear that we will have
> > it.  
> 
> We do today, it is obscure, but there is a whole set of config options
> designed to disable the unsafe kernel features. Kernels booted with
> secure boot and signed modules tend to enable a lot of them, for
> instance. The people working on the IMA stuff tend to enable a lot
> more as you can defeat the purpose of IMA if you can hijack the
> kernel.
> 
> > What sort of id_table is the base vfio-pci driver expected to use?  
> 
> If it has a match table it would be all match, this is why I called it
> a "universal driver"
> 
> If we have a flavour then the flavour controls the activation of
> VFIO, not new_id or driver_override, and in vfio flavour mode we can
> have an all match table, if we can resolve how to choose between two
> drivers with overlapping matches.
> 
> > > > > This is why I want to try for fine grained autoloading first. It
> > > > > really is the elegant solution if we can work it out.    
> > > > 
> > > > I just don't see how we create a manageable change to userspace.    
> > > 
> > > I'm not sure I understand. Even if we add a new sysfs to set some
> > > flavour then that is a pretty trivial change for userspace to move
> > > from driver_override?  
> > 
> > Perhaps for some definition of trivial that I'm not familiar with.
> > We're talking about changing libvirt and driverctl and every distro and
> > user that's created a custom script outside of those.  Even changing
> > from "vfio-pci" to "vfio-pci*" is a hurdle.  
> 
> Sure, but it isn't like a major architectural shift, nor is it
> mandatory unless you start using this new hardware class.
> 
> Userspace changes when we add kernel functionality.. The kernel just
> has to keep working the way it used to for old functionality.

Seems like we're bound to keep igd in the core as you propose below.

> > > Well, I read through the Intel GPU driver and this is how I felt it
> > > works. It doesn't even check the firmware bit unless certain PCI IDs
> > > are matched first.  
> > 
> > The IDs being only the PCI vendor ID and class code.    
> 
> I don't mean how vfio works, I mean how the Intel GPU driver works.
> 
> eg:
> 
> psb_pci_probe()
>  psb_driver_load()
>   psb_intel_opregion_setup()
>            if (memcmp(base, OPREGION_SIGNATURE, 16)) {
> 
> i915_pci_probe()
>  i915_driver_probe()
>   i915_driver_hw_probe()
>    intel_opregion_setup()
> 	if (memcmp(buf, OPREGION_SIGNATURE, 16)) {
> 
> All of these memcmp's are protected by exact id_tables hung off the
> pci_driver's id_table.
> 
> VFIO is the different case. In this case the ID match confirms that
> the config space has the ASLS dword at the fixed offset. If the ID
> doesn't match nothing should read the ASLS offset.
> 
> > > For NVIDIA GPU Max checked internally and we saw it looks very much
> > > like how Intel GPU works. Only some PCI IDs trigger checking on the
> > > feature the firmware thing is linked to.  
> > 
> > And as Alexey noted, the table came up incomplete.  But also those same
> > devices exist on platforms where this extension is completely
> > irrelevant.  
> 
> I understood he ment that NVIDI GPUs *without* NVLINK can exist, but
> the ID table we have here is supposed to be the NVLINK compatible
> ID's.

Those IDs are just for the SXM2 variants of the device that can
exist on a variety of platforms, only one of which includes the
firmware tables to activate the vfio support.

> > So because we don't check for an Intel specific graphics firmware table
> > when binding to Realtek NIC, we can leap to the conclusion that there
> > must be a concise id_table we can create for IGD support?  
> 
> Concise? No, but we can see *today* what the ID table is supposed to
> be by just loooking and the three probe functions that touch
> OPREGION_SIGNATURE.
> 
> > There's a giant assumption above that I'm missing.  Are you expecting
> > that vendors are actually going to keep up with submitting device IDs
> > that they claim to have tested and support with vfio-pci and all other
> > devices won't be allowed to bind?  That would single handedly destroy
> > any non-enterprise use cases of vfio-pci.  
> 
> Why not? They do it for the in-tree GPU drivers today! The ID table
> for Intel GPU is even in a *header file* and we can just #include it
> into vfio igd as well.

Are you volunteering to maintain the vfio-pci-igd id_table, complete
with the implicit expectation that those devices are known to work?
Part of the disconnect we have here might be the intended level of
support.  There's a Kconfig option around vfio igd support for more
than one reason.

I think you're looking for a significant inflection in vendor's stated
support for vfio use cases, beyond the "best-effort, give it a try",
that we currently have.  In some ways I look forward to that, so long
as users can also use it as they do today (maybe not enterprise users).
I sort of see imposing an id_table on igd support as trying to impose
that "vendor condoned" use case before we actually have a vendor
condoning it (or signing up to maintain an id table).

> > So unless you want to do some bitkeeper archaeology, we've always
> > allowed driver probes to fail and fall through to the next one, not
> > even complaining with -ENODEV.  In practice it hasn't been an issue
> > because how many drivers do you expect to have that would even try to
> > claim a device.    
> 
> Do you know of anything using this ability? It might be helpful

I don't.

> > Ordering is only important when there's a catch-all so we need to
> > figure out how to make that last among a class of drivers that will
> > attempt to claim a device.  The softdep is a bit of a hack to do
> > that, I'll admit, but I don't see how the alternate driver flavor
> > universe solves having a catch-all either.  
> 
> Haven't entirely got there yet, but I think the catch all probably has
> to be handled by userspace udev/kmod in some way, as it is the only
> thing that knows if there is a more specific module to load. This is
> the biggest problem..
> 
> And again, I feel this is all a big tangent, especially now that HCH
> wants to delete the nvlink stuff we should just leave igd alone.

Determining which things stay in vfio-pci-core and which things are
split to variant drivers and how those variant drivers can match the
devices they intend to support seems very inline with this series.  If
igd stays as part of vfio-pci-core then I think we're drawing a
parallel to z-pci support, where a significant part of that support is
a set of extra data structures exposed through capabilities to support
userspace use of the device.  Therefore extra regions or data
structures through capabilities, where we're not changing device
access, except as required for the platform (not the device) seem to be
things that fit within the core, right?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-29 23:10                                 ` Alex Williamson
@ 2021-04-01 13:04                                   ` Cornelia Huck
  2021-04-01 13:12                                   ` Jason Gunthorpe
  1 sibling, 0 replies; 53+ messages in thread
From: Cornelia Huck @ 2021-04-01 13:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Christoph Hellwig, Max Gurtovoy,
	Alexey Kardashevskiy, kvm, linux-kernel, liranl, oren, tzahio,
	leonro, yarong, aviadye, shahafs, artemp, kwankhede, ACurrid,
	cjia, yishaih, mjrosato

On Mon, 29 Mar 2021 17:10:53 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 23 Mar 2021 16:32:13 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Mar 22, 2021 at 10:40:16AM -0600, Alex Williamson wrote:

> > > So unless you want to do some bitkeeper archaeology, we've always
> > > allowed driver probes to fail and fall through to the next one, not
> > > even complaining with -ENODEV.  In practice it hasn't been an issue
> > > because how many drivers do you expect to have that would even try to
> > > claim a device.      
> > 
> > Do you know of anything using this ability? It might be helpful  
> 
> I don't.

I've been trying to remember why I added that patch to ignore all
errors (rather than only -ENODEV), but I suspect it might have been
related to the concurrent probing stuff I tried to implement back then.
The one instance of drivers matching to the same id I recall (s390
ctc/lcs) is actually not handled on the individual device level, but in
the meta ccwgroup driver; I don't remember anything else in the s390
case.

> 
> > > Ordering is only important when there's a catch-all so we need to
> > > figure out how to make that last among a class of drivers that will
> > > attempt to claim a device.  The softdep is a bit of a hack to do
> > > that, I'll admit, but I don't see how the alternate driver flavor
> > > universe solves having a catch-all either.    
> > 
> > Haven't entirely got there yet, but I think the catch all probably has
> > to be handled by userspace udev/kmod in some way, as it is the only
> > thing that knows if there is a more specific module to load. This is
> > the biggest problem..
> > 
> > And again, I feel this is all a big tangent, especially now that HCH
> > wants to delete the nvlink stuff we should just leave igd alone.  
> 
> Determining which things stay in vfio-pci-core and which things are
> split to variant drivers and how those variant drivers can match the
> devices they intend to support seems very inline with this series.  If
> igd stays as part of vfio-pci-core then I think we're drawing a
> parallel to z-pci support, where a significant part of that support is
> a set of extra data structures exposed through capabilities to support
> userspace use of the device.  Therefore extra regions or data
> structures through capabilities, where we're not changing device
> access, except as required for the platform (not the device) seem to be
> things that fit within the core, right?  Thanks,
> 
> Alex

As we are only talking about extra data governed by a capability, I
don't really see a problem with keeping it in the vfio core.

For those devices that need more specialized treatment, maybe we need
some kind of priority-based matching? I.e., if we match a device with
drivers, start with the one with highest priority (the specialized
one), and have the generic driver at the lowest priority. A
higher-priority driver added later one should not affect already bound
devices (and would need manual intervention again.)

[I think this has come up in other places in the past as well, but I
don't have any concrete pointers handy.]


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-03-29 23:10                                 ` Alex Williamson
  2021-04-01 13:04                                   ` Cornelia Huck
@ 2021-04-01 13:12                                   ` Jason Gunthorpe
  2021-04-01 21:49                                     ` Alex Williamson
  1 sibling, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2021-04-01 13:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Christoph Hellwig, Max Gurtovoy, Alexey Kardashevskiy, cohuck,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato

On Mon, Mar 29, 2021 at 05:10:53PM -0600, Alex Williamson wrote:
> On Tue, 23 Mar 2021 16:32:13 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Mar 22, 2021 at 10:40:16AM -0600, Alex Williamson wrote:
> > 
> > > Of course if you start looking at features like migration support,
> > > that's more than likely not simply an additional region with optional
> > > information, it would need to interact with the actual state of the
> > > device.  For those, I would very much support use of a specific
> > > id_table.  That's not these.  
> > 
> > What I don't understand is why do we need two different ways of
> > inserting vendor code?
> 
> Because a PCI id table only identifies the device, these drivers are
> looking for a device in the context of firmware dependencies.

The firmware dependencies only exist for a defined list of ID's, so I
don't entirely agree with this statement. I agree with below though,
so lets leave it be.

> > I understood he ment that NVIDI GPUs *without* NVLINK can exist, but
> > the ID table we have here is supposed to be the NVLINK compatible
> > ID's.
> 
> Those IDs are just for the SXM2 variants of the device that can
> exist on a variety of platforms, only one of which includes the
> firmware tables to activate the vfio support.

AFAIK, SXM2 is a special physical form factor that has the nvlink
physical connection - it is only for this specific generation of power
servers that can accept the specific nvlink those cards have.

> I think you're looking for a significant inflection in vendor's stated
> support for vfio use cases, beyond the "best-effort, give it a try",
> that we currently have.

I see, so they don't want to. Lets leave it then.

Though if Xe breaks everything they need to add/maintain a proper ID
table, not more hackery.

> > And again, I feel this is all a big tangent, especially now that HCH
> > wants to delete the nvlink stuff we should just leave igd alone.
> 
> Determining which things stay in vfio-pci-core and which things are
> split to variant drivers and how those variant drivers can match the
> devices they intend to support seems very inline with this series.  

IMHO, the main litmus test for core is if variant drivers will need it
or not.

No variant driver should be stacked on an igd device, or if it someday
is, it should implement the special igd hackery internally (and have a
proper ID table). So when we split it up igd goes into vfio_pci.ko as
some special behavior vfio_pci.ko's universal driver provides for IGD.

Every variant driver will still need the zdev data to be exposed to
userspace, and every PCI device on s390 has that extra information. So
vdev goes to vfio_pci_core.ko

Future things going into vfio_pci.ko need a really good reason why
they can't be varian drivers instead.

Jason

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  2021-04-01 13:12                                   ` Jason Gunthorpe
@ 2021-04-01 21:49                                     ` Alex Williamson
  0 siblings, 0 replies; 53+ messages in thread
From: Alex Williamson @ 2021-04-01 21:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Max Gurtovoy, Alexey Kardashevskiy, cohuck,
	kvm, linux-kernel, liranl, oren, tzahio, leonro, yarong, aviadye,
	shahafs, artemp, kwankhede, ACurrid, cjia, yishaih, mjrosato

On Thu, 1 Apr 2021 10:12:27 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Mar 29, 2021 at 05:10:53PM -0600, Alex Williamson wrote:
> > On Tue, 23 Mar 2021 16:32:13 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Mon, Mar 22, 2021 at 10:40:16AM -0600, Alex Williamson wrote:
> > >   
> > > > Of course if you start looking at features like migration support,
> > > > that's more than likely not simply an additional region with optional
> > > > information, it would need to interact with the actual state of the
> > > > device.  For those, I would very much support use of a specific
> > > > id_table.  That's not these.    
> > > 
> > > What I don't understand is why do we need two different ways of
> > > inserting vendor code?  
> > 
> > Because a PCI id table only identifies the device, these drivers are
> > looking for a device in the context of firmware dependencies.  
> 
> The firmware dependencies only exist for a defined list of ID's, so I
> don't entirely agree with this statement. I agree with below though,
> so lets leave it be.
> 
> > > I understood he ment that NVIDI GPUs *without* NVLINK can exist, but
> > > the ID table we have here is supposed to be the NVLINK compatible
> > > ID's.  
> > 
> > Those IDs are just for the SXM2 variants of the device that can
> > exist on a variety of platforms, only one of which includes the
> > firmware tables to activate the vfio support.  
> 
> AFAIK, SXM2 is a special physical form factor that has the nvlink
> physical connection - it is only for this specific generation of power
> servers that can accept the specific nvlink those cards have.

SXM2 is not unique to Power, there are various x86 systems that support
the interface, everything from NVIDIA's own line of DGX systems,
various vendor systems, all the way to VARs like Super Micro and
Gigabyte.

> > I think you're looking for a significant inflection in vendor's stated
> > support for vfio use cases, beyond the "best-effort, give it a try",
> > that we currently have.  
> 
> I see, so they don't want to. Lets leave it then.
> 
> Though if Xe breaks everything they need to add/maintain a proper ID
> table, not more hackery.

e4eccb853664 ("vfio/pci: Bypass IGD init in case of -ENODEV") is
supposed to enable Xe, where the IGD code is expected to return -ENODEV
and we go on with the base vfio-pci support.
 
> > > And again, I feel this is all a big tangent, especially now that
> > > HCH wants to delete the nvlink stuff we should just leave igd
> > > alone.  
> > 
> > Determining which things stay in vfio-pci-core and which things are
> > split to variant drivers and how those variant drivers can match the
> > devices they intend to support seems very inline with this series.
> >   
> 
> IMHO, the main litmus test for core is if variant drivers will need it
> or not.
> 
> No variant driver should be stacked on an igd device, or if it someday
> is, it should implement the special igd hackery internally (and have a
> proper ID table). So when we split it up igd goes into vfio_pci.ko as
> some special behavior vfio_pci.ko's universal driver provides for IGD.
> 
> Every variant driver will still need the zdev data to be exposed to
> userspace, and every PCI device on s390 has that extra information. So
> vdev goes to vfio_pci_core.ko
> 
> Future things going into vfio_pci.ko need a really good reason why
> they can't be varian drivers instead.

That sounds fair.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2021-04-01 21:49 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-09  8:33 [PATCH v3 0/9] Introduce vfio-pci-core subsystem Max Gurtovoy
2021-03-09  8:33 ` [PATCH 1/9] vfio-pci: rename vfio_pci.c to vfio_pci_core.c Max Gurtovoy
2021-03-09  8:33 ` [PATCH 2/9] vfio-pci: rename vfio_pci_private.h to vfio_pci_core.h Max Gurtovoy
2021-03-09  8:33 ` [PATCH 3/9] vfio-pci: rename vfio_pci_device to vfio_pci_core_device Max Gurtovoy
2021-03-09  8:33 ` [PATCH 4/9] vfio-pci: introduce vfio_pci_core subsystem driver Max Gurtovoy
2021-03-09  8:33 ` [PATCH 5/9] vfio/pci: introduce vfio_pci_device structure Max Gurtovoy
2021-03-09  8:33 ` [PATCH 6/9] vfio-pci-core: export vfio_pci_register_dev_region function Max Gurtovoy
2021-03-09  8:33 ` [PATCH 7/9] vfio/pci_core: split nvlink2 to nvlink2gpu and npu2 Max Gurtovoy
2021-03-10  8:08   ` Christoph Hellwig
2021-03-09  8:33 ` [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers Max Gurtovoy
2021-03-10  6:39   ` Alexey Kardashevskiy
2021-03-10 12:57     ` Max Gurtovoy
2021-03-10 13:02       ` Jason Gunthorpe
2021-03-10 14:24         ` Alexey Kardashevskiy
2021-03-10 19:40           ` Jason Gunthorpe
2021-03-11  1:20             ` Alexey Kardashevskiy
2021-03-11  1:34               ` Jason Gunthorpe
2021-03-11  1:42                 ` Alexey Kardashevskiy
2021-03-11  2:00                   ` Jason Gunthorpe
2021-03-11  7:54                     ` Alexey Kardashevskiy
2021-03-11  9:44                       ` Max Gurtovoy
2021-03-11 16:51                         ` Jason Gunthorpe
2021-03-11 17:01                       ` Jason Gunthorpe
2021-03-10 14:19       ` Alexey Kardashevskiy
2021-03-11  1:10         ` Max Gurtovoy
2021-03-19 15:23       ` Alex Williamson
2021-03-19 16:17         ` Jason Gunthorpe
2021-03-19 16:20           ` Christoph Hellwig
2021-03-19 16:28             ` Jason Gunthorpe
2021-03-19 16:34               ` Christoph Hellwig
2021-03-19 17:36                 ` Alex Williamson
2021-03-19 20:07                   ` Jason Gunthorpe
2021-03-19 21:08                     ` Alex Williamson
2021-03-19 22:59                       ` Jason Gunthorpe
2021-03-20  4:40                         ` Alex Williamson
2021-03-21 12:58                           ` Jason Gunthorpe
2021-03-22 16:40                             ` Alex Williamson
2021-03-23 19:32                               ` Jason Gunthorpe
2021-03-24  2:39                                 ` Alexey Kardashevskiy
2021-03-29 23:10                                 ` Alex Williamson
2021-04-01 13:04                                   ` Cornelia Huck
2021-04-01 13:12                                   ` Jason Gunthorpe
2021-04-01 21:49                                     ` Alex Williamson
2021-03-22 15:11                     ` Christoph Hellwig
2021-03-22 16:44                       ` Jason Gunthorpe
2021-03-23 13:17                         ` Christoph Hellwig
2021-03-23 13:42                           ` Jason Gunthorpe
2021-03-09  8:33 ` [PATCH 9/9] vfio/pci: export igd support into vendor vfio_pci driver Max Gurtovoy
2021-03-10  8:15   ` Christoph Hellwig
2021-03-10 12:31     ` Jason Gunthorpe
2021-03-11 11:37       ` Christoph Hellwig
2021-03-11 12:09         ` Max Gurtovoy
2021-03-11 15:43         ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).