[PATCH mlx5-next v1 0/5] Dynamically assign MSI-X vectors count

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH mlx5-next v1 0/5] Dynamically assign MSI-X vectors count
@ 2021-01-10 15:07 Leon Romanovsky
  2021-01-10 15:07 ` [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
                   ` (4 more replies)
  0 siblings, 5 replies; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-10 15:07 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Jakub Kicinski, linux-pci,
	linux-rdma, netdev, Don Dutile, Alex Williamson

From: Leon Romanovsky <leonro@nvidia.com>

Changelog
v1:
 * Improved wording and commit messages of first PCI patch
 * Added extra PCI patch to provide total number of MSI-X vectors
 * Prohibited read of vf_msix_vec sysfs file if driver doesn't support write
 * Removed extra function definition in pci.h
v0: https://lore.kernel.org/linux-pci/20210103082440.34994-1-leon@kernel.org

--------------------------------------------------------------------
Hi,

The number of MSI-X vectors is PCI property visible through lspci, that
field is read-only and configured by the device.

The static assignment of an amount of MSI-X vectors doesn't allow utilize
the newly created VF because it is not known to the device the future load
and configuration where that VF will be used.

The VFs are created on the hypervisor and forwarded to the VMs that have
different properties (for example number of CPUs).

To overcome the inefficiency in the spread of such MSI-X vectors, we
allow the kernel to instruct the device with the needed number of such
vectors, before VF is initialized and bounded to the driver.

Before this series:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
        Capabilities: [9c] MSI-X: Enable- Count=12 Masked-

Configuration script:
1. Start fresh
echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
modprobe -q -r mlx5_ib mlx5_core
2. Ensure that driver doesn't run and it is safe to change MSI-X
echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_drivers_autoprobe
3. Load driver for the PF
modprobe mlx5_core
4. Configure one of the VFs with new number
echo 2 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
echo 21 > /sys/bus/pci/devices/0000\:08\:00.2/vf_msix_vec

After this series:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
        Capabilities: [9c] MSI-X: Enable- Count=21 Masked-

Thanks

Leon Romanovsky (5):
  PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  net/mlx5: Add dynamic MSI-X capabilities bits
  net/mlx5: Dynamically assign MSI-X vectors count
  net/mlx5: Allow to the users to configure number of MSI-X vectors

 Documentation/ABI/testing/sysfs-bus-pci       | 34 +++++++
 .../net/ethernet/mellanox/mlx5/core/main.c    |  5 +
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  6 ++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 62 +++++++++++++
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 52 ++++++++++-
 drivers/pci/iov.c                             | 93 +++++++++++++++++++
 drivers/pci/msi.c                             | 29 ++++++
 drivers/pci/pci-sysfs.c                       |  1 +
 drivers/pci/pci.h                             |  5 +
 include/linux/mlx5/mlx5_ifc.h                 | 11 ++-
 include/linux/pci.h                           | 10 +-
 11 files changed, 304 insertions(+), 4 deletions(-)

--
2.29.2

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-10 15:07 [PATCH mlx5-next v1 0/5] Dynamically assign MSI-X vectors count Leon Romanovsky
@ 2021-01-10 15:07 ` Leon Romanovsky
  2021-01-11 19:30   ` Alexander Duyck
  2021-01-13 17:50   ` Alex Williamson
  2021-01-10 15:07 ` [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors Leon Romanovsky
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-10 15:07 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Jakub Kicinski, linux-pci,
	linux-rdma, netdev, Don Dutile, Alex Williamson

From: Leon Romanovsky <leonro@nvidia.com>

Extend PCI sysfs interface with a new callback that allows configure
the number of MSI-X vectors for specific SR-IO VF. This is needed
to optimize the performance of newly bound devices by allocating
the number of vectors based on the administrator knowledge of targeted VM.

This function is applicable for SR-IOV VF because such devices allocate
their MSI-X table before they will run on the VMs and HW can't guess the
right number of vectors, so the HW allocates them statically and equally.

The newly added /sys/bus/pci/devices/.../vf_msix_vec file will be seen
for the VFs and it is writable as long as a driver is not bounded to the VF.

The values accepted are:
 * > 0 - this will be number reported by the VF's MSI-X capability
 * < 0 - not valid
 * = 0 - will reset to the device default value

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 Documentation/ABI/testing/sysfs-bus-pci | 20 ++++++++
 drivers/pci/iov.c                       | 62 +++++++++++++++++++++++++
 drivers/pci/msi.c                       | 29 ++++++++++++
 drivers/pci/pci-sysfs.c                 |  1 +
 drivers/pci/pci.h                       |  2 +
 include/linux/pci.h                     |  8 +++-
 6 files changed, 121 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 25c9c39770c6..05e26e5da54e 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -375,3 +375,23 @@ Description:
 		The value comes from the PCI kernel device state and can be one
 		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
 		The file is read only.
+
+What:		/sys/bus/pci/devices/.../vf_msix_vec
+Date:		December 2020
+Contact:	Leon Romanovsky <leonro@nvidia.com>
+Description:
+		This file is associated with the SR-IOV VFs.
+		It allows configuration of the number of MSI-X vectors for
+		the VF. This is needed to optimize performance of newly bound
+		devices by allocating the number of vectors based on the
+		administrator knowledge of targeted VM.
+
+		The values accepted are:
+		 * > 0 - this will be number reported by the VF's MSI-X
+			 capability
+		 * < 0 - not valid
+		 * = 0 - will reset to the device default value
+
+		The file is writable if the PF is bound to a driver that
+		supports the ->sriov_set_msix_vec_count() callback and there
+		is no driver bound to the VF.
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 4afd4ee4f7f0..42c0df4158d1 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
 	return (dev->devfn + dev->sriov->offset +
 		dev->sriov->stride * vf_id) & 0xff;
 }
+EXPORT_SYMBOL(pci_iov_virtfn_devfn);

 /*
  * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
@@ -426,6 +427,67 @@ const struct attribute_group sriov_dev_attr_group = {
 	.is_visible = sriov_attrs_are_visible,
 };

+#ifdef CONFIG_PCI_MSI
+static ssize_t vf_msix_vec_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	int numb = pci_msix_vec_count(pdev);
+	struct pci_dev *pfdev;
+
+	if (numb < 0)
+		return numb;
+
+	pfdev = pci_physfn(pdev);
+	if (!pfdev->driver || !pfdev->driver->sriov_set_msix_vec_count)
+		return -EOPNOTSUPP;
+
+	return sprintf(buf, "%d\n", numb);
+}
+
+static ssize_t vf_msix_vec_store(struct device *dev,
+				 struct device_attribute *attr, const char *buf,
+				 size_t count)
+{
+	struct pci_dev *vf_dev = to_pci_dev(dev);
+	int val, ret;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	ret = pci_set_msix_vec_count(vf_dev, val);
+	if (ret)
+		return ret;
+
+	return count;
+}
+static DEVICE_ATTR_RW(vf_msix_vec);
+#endif
+
+static struct attribute *sriov_vf_dev_attrs[] = {
+#ifdef CONFIG_PCI_MSI
+	&dev_attr_vf_msix_vec.attr,
+#endif
+	NULL,
+};
+
+static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
+					  struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+
+	if (dev_is_pf(dev))
+		return 0;
+
+	return a->mode;
+}
+
+const struct attribute_group sriov_vf_dev_attr_group = {
+	.attrs = sriov_vf_dev_attrs,
+	.is_visible = sriov_vf_attrs_are_visible,
+};
+
 int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
 {
 	return 0;
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 3162f88fe940..20705ca94666 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -991,6 +991,35 @@ int pci_msix_vec_count(struct pci_dev *dev)
 }
 EXPORT_SYMBOL(pci_msix_vec_count);

+/**
+ * pci_set_msix_vec_count - change the reported number of MSI-X vectors
+ * This function is applicable for SR-IOV VF because such devices allocate
+ * their MSI-X table before they will run on the VMs and HW can't guess the
+ * right number of vectors, so the HW allocates them statically and equally.
+ * @dev: VF device that is going to be changed
+ * @numb: amount of MSI-X vectors
+ **/
+int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
+{
+	struct pci_dev *pdev = pci_physfn(dev);
+
+	if (!dev->msix_cap || !pdev->msix_cap)
+		return -EINVAL;
+
+	if (dev->driver || !pdev->driver ||
+	    !pdev->driver->sriov_set_msix_vec_count)
+		return -EOPNOTSUPP;
+
+	if (numb < 0)
+		/*
+		 * We don't support negative numbers for now,
+		 * but maybe in the future it will make sense.
+		 */
+		return -EINVAL;
+
+	return pdev->driver->sriov_set_msix_vec_count(dev, numb);
+}
+
 static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
 			     int nvec, struct irq_affinity *affd, int flags)
 {
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index fb072f4b3176..0af2222643c2 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1557,6 +1557,7 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
 	&pci_dev_hp_attr_group,
 #ifdef CONFIG_PCI_IOV
 	&sriov_dev_attr_group,
+	&sriov_vf_dev_attr_group,
 #endif
 	&pci_bridge_attr_group,
 	&pcie_dev_attr_group,
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5c59365092fa..1fd273077637 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -183,6 +183,7 @@ extern unsigned int pci_pm_d3hot_delay;

 #ifdef CONFIG_PCI_MSI
 void pci_no_msi(void);
+int pci_set_msix_vec_count(struct pci_dev *dev, int numb);
 #else
 static inline void pci_no_msi(void) { }
 #endif
@@ -502,6 +503,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
 void pci_restore_iov_state(struct pci_dev *dev);
 int pci_iov_bus_range(struct pci_bus *bus);
 extern const struct attribute_group sriov_dev_attr_group;
+extern const struct attribute_group sriov_vf_dev_attr_group;
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b32126d26997..a17cfc28eb66 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -856,6 +856,8 @@ struct module;
  *		e.g. drivers/net/e100.c.
  * @sriov_configure: Optional driver callback to allow configuration of
  *		number of VFs to enable via sysfs "sriov_numvfs" file.
+ * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
+ *              exposed by the sysfs "vf_msix_vec" entry.
  * @err_handler: See Documentation/PCI/pci-error-recovery.rst
  * @groups:	Sysfs attribute groups.
  * @driver:	Driver model structure.
@@ -871,6 +873,7 @@ struct pci_driver {
 	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
 	void (*shutdown)(struct pci_dev *dev);
 	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
+	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
 	const struct pci_error_handlers *err_handler;
 	const struct attribute_group **groups;
 	struct device_driver	driver;
@@ -2057,7 +2060,6 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);

 #ifdef CONFIG_PCI_IOV
 int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
-int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);

 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
@@ -2402,6 +2404,10 @@ static inline bool pci_is_thunderbolt_attached(struct pci_dev *pdev)
 void pci_uevent_ers(struct pci_dev *pdev, enum  pci_ers_result err_type);
 #endif

+#ifdef CONFIG_PCI_IOV
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id);
+#endif
+
 /* Provide the legacy pci_dma_* API */
 #include <linux/pci-dma-compat.h>

--
2.29.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-10 15:07 [PATCH mlx5-next v1 0/5] Dynamically assign MSI-X vectors count Leon Romanovsky
  2021-01-10 15:07 ` [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
@ 2021-01-10 15:07 ` Leon Romanovsky
  2021-01-11 19:30   ` Alexander Duyck
  2021-01-18 17:03   ` Greg KH
  2021-01-10 15:07 ` [PATCH mlx5-next v1 3/5] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-10 15:07 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Jakub Kicinski, linux-pci,
	linux-rdma, netdev, Don Dutile, Alex Williamson

From: Leon Romanovsky <leonro@nvidia.com>

Some SR-IOV capable devices provide an ability to configure specific
number of MSI-X vectors on their VF prior driver is probed on that VF.

In order to make management easy, provide new read-only sysfs file that
returns a total number of possible to configure MSI-X vectors.

cat /sys/bus/pci/devices/.../sriov_vf_total_msix
  = 0 - feature is not supported
  > 0 - total number of MSI-X vectors to consume by the VFs

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 Documentation/ABI/testing/sysfs-bus-pci | 14 +++++++++++
 drivers/pci/iov.c                       | 31 +++++++++++++++++++++++++
 drivers/pci/pci.h                       |  3 +++
 include/linux/pci.h                     |  2 ++
 4 files changed, 50 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 05e26e5da54e..64e9b700acc9 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -395,3 +395,17 @@ Description:
 		The file is writable if the PF is bound to a driver that
 		supports the ->sriov_set_msix_vec_count() callback and there
 		is no driver bound to the VF.
+
+What:		/sys/bus/pci/devices/.../sriov_vf_total_msix
+Date:		January 2021
+Contact:	Leon Romanovsky <leonro@nvidia.com>
+Description:
+		This file is associated with the SR-IOV PFs.
+		It returns a total number of possible to configure MSI-X
+		vectors on the enabled VFs.
+
+		The values returned are:
+		 * > 0 - this will be total number possible to consume by VFs,
+		 * = 0 - feature is not supported
+
+		If no SR-IOV VFs are enabled, this value will return 0.
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 42c0df4158d1..0a6ddf3230fd 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev,
 	return count;
 }

+static ssize_t sriov_vf_total_msix_show(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return sprintf(buf, "%d\n", pdev->sriov->vf_total_msix);
+}
+
 static DEVICE_ATTR_RO(sriov_totalvfs);
 static DEVICE_ATTR_RW(sriov_numvfs);
 static DEVICE_ATTR_RO(sriov_offset);
 static DEVICE_ATTR_RO(sriov_stride);
 static DEVICE_ATTR_RO(sriov_vf_device);
 static DEVICE_ATTR_RW(sriov_drivers_autoprobe);
+static DEVICE_ATTR_RO(sriov_vf_total_msix);

 static struct attribute *sriov_dev_attrs[] = {
 	&dev_attr_sriov_totalvfs.attr,
@@ -408,6 +418,7 @@ static struct attribute *sriov_dev_attrs[] = {
 	&dev_attr_sriov_stride.attr,
 	&dev_attr_sriov_vf_device.attr,
 	&dev_attr_sriov_drivers_autoprobe.attr,
+	&dev_attr_sriov_vf_total_msix.attr,
 	NULL,
 };

@@ -658,6 +669,7 @@ static void sriov_disable(struct pci_dev *dev)
 		sysfs_remove_link(&dev->dev.kobj, "dep_link");

 	iov->num_VFs = 0;
+	iov->vf_total_msix = 0;
 	pci_iov_set_numvfs(dev, 0);
 }

@@ -1116,6 +1128,25 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);

+/**
+ * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
+ * @dev: the PCI PF device
+ * @numb: the total number of MSI-X vector to consume by the VFs
+ *
+ * Sets the number of MSI-X vectors that is possible to consume by the VFs.
+ * This interface is complimentary part of the pci_set_msix_vec_count()
+ * that will be used to configure the required number on the VF.
+ */
+void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb)
+{
+	if (!dev->is_physfn || !dev->driver ||
+	    !dev->driver->sriov_set_msix_vec_count)
+		return;
+
+	dev->sriov->vf_total_msix = numb;
+}
+EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
+
 /**
  * pci_sriov_configure_simple - helper to configure SR-IOV
  * @dev: the PCI device
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 1fd273077637..0fbe291eb0f2 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -327,6 +327,9 @@ struct pci_sriov {
 	u16		subsystem_device; /* VF subsystem device */
 	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
 	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
+	int		vf_total_msix;  /* Total number of MSI-X vectors the VFs
+					 * can consume
+					 */
 };

 /**
diff --git a/include/linux/pci.h b/include/linux/pci.h
index a17cfc28eb66..fd9ff1f42a09 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2074,6 +2074,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev);
 int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
 resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
 void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
+void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb);

 /* Arch may override these (weak) */
 int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs);
@@ -2114,6 +2115,7 @@ static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
 static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
 { return 0; }
 static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { }
+static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb) {}
 #endif

 #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
--
2.29.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH mlx5-next v1 3/5] net/mlx5: Add dynamic MSI-X capabilities bits
  2021-01-10 15:07 [PATCH mlx5-next v1 0/5] Dynamically assign MSI-X vectors count Leon Romanovsky
  2021-01-10 15:07 ` [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
  2021-01-10 15:07 ` [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors Leon Romanovsky
@ 2021-01-10 15:07 ` Leon Romanovsky
  2021-01-10 15:07 ` [PATCH mlx5-next v1 4/5] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
  2021-01-10 15:07 ` [PATCH mlx5-next v1 5/5] net/mlx5: Allow to the users to configure number of MSI-X vectors Leon Romanovsky
  4 siblings, 0 replies; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-10 15:07 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Jakub Kicinski, linux-pci,
	linux-rdma, netdev, Don Dutile, Alex Williamson

From: Leon Romanovsky <leonro@nvidia.com>

These new fields declare the number of MSI-X vectors that is
possible to allocate on the VF through PF configuration.

Value must be in range defined by min_dynamic_vf_msix_table_size
and max_dynamic_vf_msix_table_size.

The driver should continue to query its MSI-X table through PCI
configuration header.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 8c5d5fe58051..7ac614bc592a 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -1656,7 +1656,16 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8	   reserved_at_6e0[0x10];
 	u8	   sf_base_id[0x10];

-	u8	   reserved_at_700[0x80];
+	u8	   reserved_at_700[0x8];
+	u8	   num_total_dynamic_vf_msix[0x18];
+	u8	   reserved_at_720[0x14];
+	u8	   dynamic_msix_table_size[0xc];
+	u8	   reserved_at_740[0xc];
+	u8	   min_dynamic_vf_msix_table_size[0x4];
+	u8	   reserved_at_750[0x4];
+	u8	   max_dynamic_vf_msix_table_size[0xc];
+
+	u8	   reserved_at_760[0x20];
 	u8	   vhca_tunnel_commands[0x40];
 	u8	   reserved_at_7c0[0x40];
 };
--
2.29.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH mlx5-next v1 4/5] net/mlx5: Dynamically assign MSI-X vectors count
  2021-01-10 15:07 [PATCH mlx5-next v1 0/5] Dynamically assign MSI-X vectors count Leon Romanovsky
                   ` (2 preceding siblings ...)
  2021-01-10 15:07 ` [PATCH mlx5-next v1 3/5] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
@ 2021-01-10 15:07 ` Leon Romanovsky
  2021-01-10 15:07 ` [PATCH mlx5-next v1 5/5] net/mlx5: Allow to the users to configure number of MSI-X vectors Leon Romanovsky
  4 siblings, 0 replies; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-10 15:07 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Jakub Kicinski, linux-pci,
	linux-rdma, netdev, Don Dutile, Alex Williamson

From: Leon Romanovsky <leonro@nvidia.com>

The number of MSI-X vectors is PCI property visible through lspci, that
field is read-only and configured by the device. The static assignment
of an amount of MSI-X vectors doesn't allow utilize the newly created
VF because it is not known to the device the future load and configuration
where that VF will be used.

To overcome the inefficiency in the spread of such MSI-X vectors, we
allow the kernel to instruct the device with the needed number of such
vectors.

Such change immediately increases the amount of MSI-X vectors for the
system with @ VFs from 12 vectors per-VF, to be 32 vectors per-VF.

Before this patch:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
	Capabilities: [9c] MSI-X: Enable- Count=12 Masked-

After this patch:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
	Capabilities: [9c] MSI-X: Enable- Count=32 Masked-

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/main.c    |  4 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  5 ++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 62 +++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 14 ++++-
 4 files changed, 83 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index c08315b51fd3..8269cfbfc69d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -567,6 +567,10 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
 	if (MLX5_CAP_GEN_MAX(dev, mkey_by_name))
 		MLX5_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1);

+	if (MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix))
+		MLX5_SET(cmd_hca_cap, set_hca_cap, num_total_dynamic_vf_msix,
+			 MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix));
+
 	return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE);
 }

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 0a0302ce7144..5babb4434a87 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -172,6 +172,11 @@ int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, int vecidx,
 		       struct notifier_block *nb);
 int mlx5_irq_detach_nb(struct mlx5_irq_table *irq_table, int vecidx,
 		       struct notifier_block *nb);
+
+int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int devfn,
+			    int msix_vec_count);
+int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs);
+
 struct cpumask *
 mlx5_irq_get_affinity_mask(struct mlx5_irq_table *irq_table, int vecidx);
 struct cpu_rmap *mlx5_irq_get_rmap(struct mlx5_irq_table *table);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 6fd974920394..135078e8dd55 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -55,6 +55,68 @@ static struct mlx5_irq *mlx5_irq_get(struct mlx5_core_dev *dev, int vecidx)
 	return &irq_table->irq[vecidx];
 }

+int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs)
+{
+	int num_vf_msix, min_msix, max_msix;
+
+	num_vf_msix = MLX5_CAP_GEN(dev, num_total_dynamic_vf_msix);
+	if (!num_vf_msix)
+		return 0;
+
+	min_msix = MLX5_CAP_GEN(dev, min_dynamic_vf_msix_table_size);
+	max_msix = MLX5_CAP_GEN(dev, max_dynamic_vf_msix_table_size);
+
+	return max(min(num_vf_msix / num_vfs, max_msix), min_msix);
+}
+
+/**
+ * mlx5_set_msix_vec_count() - Set dynamically allocated MSI-X to the VF
+ * @dev - PF to work on
+ * @function_id - internal PCI VF function id
+ * @msix_vec_count - Number of MSI-X to set
+ **/
+int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int function_id,
+			    int msix_vec_count)
+{
+	int sz = MLX5_ST_SZ_BYTES(set_hca_cap_in);
+	int num_vf_msix, min_msix, max_msix;
+	void *hca_cap, *cap;
+	int ret;
+
+	num_vf_msix = MLX5_CAP_GEN(dev, num_total_dynamic_vf_msix);
+	if (!num_vf_msix)
+		return 0;
+
+	if (!MLX5_CAP_GEN(dev, vport_group_manager) || !mlx5_core_is_pf(dev))
+		return -EOPNOTSUPP;
+
+	min_msix = MLX5_CAP_GEN(dev, min_dynamic_vf_msix_table_size);
+	max_msix = MLX5_CAP_GEN(dev, max_dynamic_vf_msix_table_size);
+
+	if (msix_vec_count < min_msix)
+		return -EINVAL;
+
+	if (msix_vec_count > max_msix)
+		return -EOVERFLOW;
+
+	hca_cap = kzalloc(sz, GFP_KERNEL);
+	if (!hca_cap)
+		return -ENOMEM;
+
+	cap = MLX5_ADDR_OF(set_hca_cap_in, hca_cap, capability);
+	MLX5_SET(cmd_hca_cap, cap, dynamic_msix_table_size, msix_vec_count);
+
+	MLX5_SET(set_hca_cap_in, hca_cap, opcode, MLX5_CMD_OP_SET_HCA_CAP);
+	MLX5_SET(set_hca_cap_in, hca_cap, other_function, 1);
+	MLX5_SET(set_hca_cap_in, hca_cap, function_id, function_id);
+
+	MLX5_SET(set_hca_cap_in, hca_cap, op_mod,
+		 MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE << 1);
+	ret = mlx5_cmd_exec_in(dev, set_hca_cap, hca_cap);
+	kfree(hca_cap);
+	return ret;
+}
+
 int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, int vecidx,
 		       struct notifier_block *nb)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index 3094d20297a9..c59efb1e7a26 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -71,8 +71,7 @@ static int sriov_restore_guids(struct mlx5_core_dev *dev, int vf)
 static int mlx5_device_enable_sriov(struct mlx5_core_dev *dev, int num_vfs)
 {
 	struct mlx5_core_sriov *sriov = &dev->priv.sriov;
-	int err;
-	int vf;
+	int err, vf, num_msix_count;

 	if (!MLX5_ESWITCH_MANAGER(dev))
 		goto enable_vfs_hca;
@@ -85,12 +84,23 @@ static int mlx5_device_enable_sriov(struct mlx5_core_dev *dev, int num_vfs)
 	}

 enable_vfs_hca:
+	num_msix_count = mlx5_get_default_msix_vec_count(dev, num_vfs);
 	for (vf = 0; vf < num_vfs; vf++) {
 		err = mlx5_core_enable_hca(dev, vf + 1);
 		if (err) {
 			mlx5_core_warn(dev, "failed to enable VF %d (%d)\n", vf, err);
 			continue;
 		}
+
+		err = mlx5_set_msix_vec_count(dev, vf + 1, num_msix_count);
+		if (err) {
+			mlx5_core_warn(
+				dev,
+				"failed to set MSI-X vector counts VF %d, err %d\n",
+				vf, err);
+			continue;
+		}
+
 		sriov->vfs_ctx[vf].enabled = 1;
 		if (MLX5_CAP_GEN(dev, port_type) == MLX5_CAP_PORT_TYPE_IB) {
 			err = sriov_restore_guids(dev, vf);
--
2.29.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH mlx5-next v1 5/5] net/mlx5: Allow to the users to configure number of MSI-X vectors
  2021-01-10 15:07 [PATCH mlx5-next v1 0/5] Dynamically assign MSI-X vectors count Leon Romanovsky
                   ` (3 preceding siblings ...)
  2021-01-10 15:07 ` [PATCH mlx5-next v1 4/5] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
@ 2021-01-10 15:07 ` Leon Romanovsky
  4 siblings, 0 replies; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-10 15:07 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Jakub Kicinski, linux-pci,
	linux-rdma, netdev, Don Dutile, Alex Williamson

From: Leon Romanovsky <leonro@nvidia.com>

Implement ability to configure MSI-X for the SR-IOV VFs.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/main.c    |  1 +
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  1 +
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  4 +-
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 38 +++++++++++++++++++
 4 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 8269cfbfc69d..334b3b5077c5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1647,6 +1647,7 @@ static struct pci_driver mlx5_core_driver = {
 	.shutdown	= shutdown,
 	.err_handler	= &mlx5_err_handler,
 	.sriov_configure   = mlx5_core_sriov_configure,
+	.sriov_set_msix_vec_count = mlx5_core_sriov_set_msix_vec_count,
 };

 static void mlx5_core_verify_params(void)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 5babb4434a87..8a2523d2d43a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -138,6 +138,7 @@ void mlx5_sriov_cleanup(struct mlx5_core_dev *dev);
 int mlx5_sriov_attach(struct mlx5_core_dev *dev);
 void mlx5_sriov_detach(struct mlx5_core_dev *dev);
 int mlx5_core_sriov_configure(struct pci_dev *dev, int num_vfs);
+int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count);
 int mlx5_core_enable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_core_disable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_create_scheduling_element_cmd(struct mlx5_core_dev *dev, u8 hierarchy,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 135078e8dd55..65a761346385 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -59,7 +59,7 @@ int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs)
 {
 	int num_vf_msix, min_msix, max_msix;

-	num_vf_msix = MLX5_CAP_GEN(dev, num_total_dynamic_vf_msix);
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
 	if (!num_vf_msix)
 		return 0;

@@ -83,7 +83,7 @@ int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int function_id,
 	void *hca_cap, *cap;
 	int ret;

-	num_vf_msix = MLX5_CAP_GEN(dev, num_total_dynamic_vf_msix);
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
 	if (!num_vf_msix)
 		return 0;

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index c59efb1e7a26..88b271ca34c2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -145,6 +145,7 @@ mlx5_device_disable_sriov(struct mlx5_core_dev *dev, int num_vfs, bool clear_vf)
 static int mlx5_sriov_enable(struct pci_dev *pdev, int num_vfs)
 {
 	struct mlx5_core_dev *dev  = pci_get_drvdata(pdev);
+	int num_vf_msix;
 	int err;

 	err = mlx5_device_enable_sriov(dev, num_vfs);
@@ -153,6 +154,8 @@ static int mlx5_sriov_enable(struct pci_dev *pdev, int num_vfs)
 		return err;
 	}

+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+	pci_sriov_set_vf_total_msix(pdev, num_vf_msix);
 	err = pci_enable_sriov(pdev, num_vfs);
 	if (err) {
 		mlx5_core_warn(dev, "pci_enable_sriov failed : %d\n", err);
@@ -188,6 +191,41 @@ int mlx5_core_sriov_configure(struct pci_dev *pdev, int num_vfs)
 	return err ? err : num_vfs;
 }

+int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count)
+{
+	struct pci_dev *pf = pci_physfn(vf);
+	struct mlx5_core_sriov *sriov;
+	struct mlx5_core_dev *dev;
+	int num_vf_msix, id;
+
+	dev = pci_get_drvdata(pf);
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+	if (!num_vf_msix)
+		return -EOPNOTSUPP;
+
+	if (!msix_vec_count)
+		msix_vec_count =
+			mlx5_get_default_msix_vec_count(dev, pci_num_vf(pf));
+
+	sriov = &dev->priv.sriov;
+
+	/* Reversed translation of PCI VF function number to the internal
+	 * function_id, which exists in the name of virtfn symlink.
+	 */
+	for (id = 0; id < pci_num_vf(pf); id++) {
+		if (!sriov->vfs_ctx[id].enabled)
+			continue;
+
+		if (vf->devfn == pci_iov_virtfn_devfn(pf, id))
+			break;
+	}
+
+	if (id == pci_num_vf(pf) || !sriov->vfs_ctx[id].enabled)
+		return -EINVAL;
+
+	return mlx5_set_msix_vec_count(dev, id + 1, msix_vec_count);
+}
+
 int mlx5_sriov_attach(struct mlx5_core_dev *dev)
 {
 	if (!mlx5_core_is_pf(dev) || !pci_num_vf(dev->pdev))
--
2.29.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-10 15:07 ` [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
@ 2021-01-11 19:30   ` Alexander Duyck
  2021-01-12  3:25     ` Don Dutile
  2021-01-12  6:39     ` Leon Romanovsky
  2021-01-13 17:50   ` Alex Williamson
  1 sibling, 2 replies; 49+ messages in thread
From: Alexander Duyck @ 2021-01-11 19:30 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile,
	Alex Williamson

On Sun, Jan 10, 2021 at 7:12 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> From: Leon Romanovsky <leonro@nvidia.com>
>
> Extend PCI sysfs interface with a new callback that allows configure
> the number of MSI-X vectors for specific SR-IO VF. This is needed
> to optimize the performance of newly bound devices by allocating
> the number of vectors based on the administrator knowledge of targeted VM.
>
> This function is applicable for SR-IOV VF because such devices allocate
> their MSI-X table before they will run on the VMs and HW can't guess the
> right number of vectors, so the HW allocates them statically and equally.
>
> The newly added /sys/bus/pci/devices/.../vf_msix_vec file will be seen
> for the VFs and it is writable as long as a driver is not bounded to the VF.
>
> The values accepted are:
>  * > 0 - this will be number reported by the VF's MSI-X capability
>  * < 0 - not valid
>  * = 0 - will reset to the device default value
>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci | 20 ++++++++
>  drivers/pci/iov.c                       | 62 +++++++++++++++++++++++++
>  drivers/pci/msi.c                       | 29 ++++++++++++
>  drivers/pci/pci-sysfs.c                 |  1 +
>  drivers/pci/pci.h                       |  2 +
>  include/linux/pci.h                     |  8 +++-
>  6 files changed, 121 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 25c9c39770c6..05e26e5da54e 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -375,3 +375,23 @@ Description:
>                 The value comes from the PCI kernel device state and can be one
>                 of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
>                 The file is read only.
> +
> +What:          /sys/bus/pci/devices/.../vf_msix_vec

So the name for this doesn't seem to match existing SR-IOV naming.  It
seems like this should probably be something like sriov_vf_msix_count
in order to be closer to the actual naming of what is being dealt
with.

> +Date:          December 2020
> +Contact:       Leon Romanovsky <leonro@nvidia.com>
> +Description:
> +               This file is associated with the SR-IOV VFs.
> +               It allows configuration of the number of MSI-X vectors for
> +               the VF. This is needed to optimize performance of newly bound
> +               devices by allocating the number of vectors based on the
> +               administrator knowledge of targeted VM.
> +
> +               The values accepted are:
> +                * > 0 - this will be number reported by the VF's MSI-X
> +                        capability
> +                * < 0 - not valid
> +                * = 0 - will reset to the device default value
> +
> +               The file is writable if the PF is bound to a driver that
> +               supports the ->sriov_set_msix_vec_count() callback and there
> +               is no driver bound to the VF.
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 4afd4ee4f7f0..42c0df4158d1 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
>         return (dev->devfn + dev->sriov->offset +
>                 dev->sriov->stride * vf_id) & 0xff;
>  }
> +EXPORT_SYMBOL(pci_iov_virtfn_devfn);
>
>  /*
>   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
> @@ -426,6 +427,67 @@ const struct attribute_group sriov_dev_attr_group = {
>         .is_visible = sriov_attrs_are_visible,
>  };
>
> +#ifdef CONFIG_PCI_MSI
> +static ssize_t vf_msix_vec_show(struct device *dev,
> +                               struct device_attribute *attr, char *buf)
> +{
> +       struct pci_dev *pdev = to_pci_dev(dev);
> +       int numb = pci_msix_vec_count(pdev);
> +       struct pci_dev *pfdev;
> +
> +       if (numb < 0)
> +               return numb;
> +
> +       pfdev = pci_physfn(pdev);
> +       if (!pfdev->driver || !pfdev->driver->sriov_set_msix_vec_count)
> +               return -EOPNOTSUPP;
> +

This doesn't make sense to me. You are getting the vector count for
the PCI device and reporting that. Are you expecting to call this on
the PF or the VFs? It seems like this should be a PF attribute and not
be called on the individual VFs.

If you are calling this on the VFs then it doesn't really make any
sense anyway since the VF is not a "VF PCI dev representor" and
shouldn't be treated as such. In my opinion if we are going to be
doing per-port resource limiting that is something that might make
more sense as a part of the devlink configuration for the VF since the
actual change won't be visible to an assigned device.

> +       return sprintf(buf, "%d\n", numb);
> +}
> +
> +static ssize_t vf_msix_vec_store(struct device *dev,
> +                                struct device_attribute *attr, const char *buf,
> +                                size_t count)
> +{
> +       struct pci_dev *vf_dev = to_pci_dev(dev);
> +       int val, ret;
> +
> +       ret = kstrtoint(buf, 0, &val);
> +       if (ret)
> +               return ret;
> +
> +       ret = pci_set_msix_vec_count(vf_dev, val);
> +       if (ret)
> +               return ret;
> +
> +       return count;
> +}
> +static DEVICE_ATTR_RW(vf_msix_vec);
> +#endif
> +
> +static struct attribute *sriov_vf_dev_attrs[] = {
> +#ifdef CONFIG_PCI_MSI
> +       &dev_attr_vf_msix_vec.attr,
> +#endif
> +       NULL,
> +};
> +
> +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
> +                                         struct attribute *a, int n)
> +{
> +       struct device *dev = kobj_to_dev(kobj);
> +
> +       if (dev_is_pf(dev))
> +               return 0;
> +
> +       return a->mode;
> +}
> +
> +const struct attribute_group sriov_vf_dev_attr_group = {
> +       .attrs = sriov_vf_dev_attrs,
> +       .is_visible = sriov_vf_attrs_are_visible,
> +};
> +
>  int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
>  {
>         return 0;
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 3162f88fe940..20705ca94666 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -991,6 +991,35 @@ int pci_msix_vec_count(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL(pci_msix_vec_count);
>
> +/**
> + * pci_set_msix_vec_count - change the reported number of MSI-X vectors
> + * This function is applicable for SR-IOV VF because such devices allocate
> + * their MSI-X table before they will run on the VMs and HW can't guess the
> + * right number of vectors, so the HW allocates them statically and equally.
> + * @dev: VF device that is going to be changed
> + * @numb: amount of MSI-X vectors
> + **/
> +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
> +{
> +       struct pci_dev *pdev = pci_physfn(dev);
> +
> +       if (!dev->msix_cap || !pdev->msix_cap)
> +               return -EINVAL;
> +
> +       if (dev->driver || !pdev->driver ||
> +           !pdev->driver->sriov_set_msix_vec_count)
> +               return -EOPNOTSUPP;
> +
> +       if (numb < 0)
> +               /*
> +                * We don't support negative numbers for now,
> +                * but maybe in the future it will make sense.
> +                */
> +               return -EINVAL;
> +
> +       return pdev->driver->sriov_set_msix_vec_count(dev, numb);
> +}
> +

If you are going to have a set operation for this it would make sense
to have a get operation. Your show operation seems unbalanced since
you are expecting to call it on the VF directly which just seems
wrong.

>  static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
>                              int nvec, struct irq_affinity *affd, int flags)
>  {
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index fb072f4b3176..0af2222643c2 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -1557,6 +1557,7 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
>         &pci_dev_hp_attr_group,
>  #ifdef CONFIG_PCI_IOV
>         &sriov_dev_attr_group,
> +       &sriov_vf_dev_attr_group,
>  #endif
>         &pci_bridge_attr_group,
>         &pcie_dev_attr_group,
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 5c59365092fa..1fd273077637 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -183,6 +183,7 @@ extern unsigned int pci_pm_d3hot_delay;
>
>  #ifdef CONFIG_PCI_MSI
>  void pci_no_msi(void);
> +int pci_set_msix_vec_count(struct pci_dev *dev, int numb);
>  #else
>  static inline void pci_no_msi(void) { }
>  #endif
> @@ -502,6 +503,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
>  void pci_restore_iov_state(struct pci_dev *dev);
>  int pci_iov_bus_range(struct pci_bus *bus);
>  extern const struct attribute_group sriov_dev_attr_group;
> +extern const struct attribute_group sriov_vf_dev_attr_group;
>  #else
>  static inline int pci_iov_init(struct pci_dev *dev)
>  {
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index b32126d26997..a17cfc28eb66 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -856,6 +856,8 @@ struct module;
>   *             e.g. drivers/net/e100.c.
>   * @sriov_configure: Optional driver callback to allow configuration of
>   *             number of VFs to enable via sysfs "sriov_numvfs" file.
> + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> + *              exposed by the sysfs "vf_msix_vec" entry.

Hopefully it is doing more than just changing the displayed sysfs
value. What is the effect of changing that value on the actual system
state? I'm assuming this is some limit that is enforced by the PF or
the device firmware?


>   * @err_handler: See Documentation/PCI/pci-error-recovery.rst
>   * @groups:    Sysfs attribute groups.
>   * @driver:    Driver model structure.
> @@ -871,6 +873,7 @@ struct pci_driver {
>         int  (*resume)(struct pci_dev *dev);    /* Device woken up */
>         void (*shutdown)(struct pci_dev *dev);
>         int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
> +       int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
>         const struct pci_error_handlers *err_handler;
>         const struct attribute_group **groups;
>         struct device_driver    driver;
> @@ -2057,7 +2060,6 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
>
>  #ifdef CONFIG_PCI_IOV
>  int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
> -int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
>
>  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>  void pci_disable_sriov(struct pci_dev *dev);
> @@ -2402,6 +2404,10 @@ static inline bool pci_is_thunderbolt_attached(struct pci_dev *pdev)
>  void pci_uevent_ers(struct pci_dev *pdev, enum  pci_ers_result err_type);
>  #endif
>
> +#ifdef CONFIG_PCI_IOV
> +int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id);
> +#endif
> +
>  /* Provide the legacy pci_dma_* API */
>  #include <linux/pci-dma-compat.h>
>
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-10 15:07 ` [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors Leon Romanovsky
@ 2021-01-11 19:30   ` Alexander Duyck
  2021-01-12  6:56     ` Leon Romanovsky
  2021-01-18 17:03   ` Greg KH
  1 sibling, 1 reply; 49+ messages in thread
From: Alexander Duyck @ 2021-01-11 19:30 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile,
	Alex Williamson

On Sun, Jan 10, 2021 at 7:10 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> From: Leon Romanovsky <leonro@nvidia.com>
>
> Some SR-IOV capable devices provide an ability to configure specific
> number of MSI-X vectors on their VF prior driver is probed on that VF.
>
> In order to make management easy, provide new read-only sysfs file that
> returns a total number of possible to configure MSI-X vectors.
>
> cat /sys/bus/pci/devices/.../sriov_vf_total_msix
>   = 0 - feature is not supported
>   > 0 - total number of MSI-X vectors to consume by the VFs
>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci | 14 +++++++++++
>  drivers/pci/iov.c                       | 31 +++++++++++++++++++++++++
>  drivers/pci/pci.h                       |  3 +++
>  include/linux/pci.h                     |  2 ++
>  4 files changed, 50 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 05e26e5da54e..64e9b700acc9 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -395,3 +395,17 @@ Description:
>                 The file is writable if the PF is bound to a driver that
>                 supports the ->sriov_set_msix_vec_count() callback and there
>                 is no driver bound to the VF.
> +
> +What:          /sys/bus/pci/devices/.../sriov_vf_total_msix

In this case I would drop the "vf" and just go with sriov_total_msix
since now you are referring to a global value instead of a per VF
value.

> +Date:          January 2021
> +Contact:       Leon Romanovsky <leonro@nvidia.com>
> +Description:
> +               This file is associated with the SR-IOV PFs.
> +               It returns a total number of possible to configure MSI-X
> +               vectors on the enabled VFs.
> +
> +               The values returned are:
> +                * > 0 - this will be total number possible to consume by VFs,
> +                * = 0 - feature is not supported
> +
> +               If no SR-IOV VFs are enabled, this value will return 0.
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 42c0df4158d1..0a6ddf3230fd 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev,
>         return count;
>  }
>
> +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> +                                       struct device_attribute *attr,
> +                                       char *buf)
> +{
> +       struct pci_dev *pdev = to_pci_dev(dev);
> +
> +       return sprintf(buf, "%d\n", pdev->sriov->vf_total_msix);
> +}
> +

You display it as a signed value, but unsigned values are not
supported, correct?

>  static DEVICE_ATTR_RO(sriov_totalvfs);
>  static DEVICE_ATTR_RW(sriov_numvfs);
>  static DEVICE_ATTR_RO(sriov_offset);
>  static DEVICE_ATTR_RO(sriov_stride);
>  static DEVICE_ATTR_RO(sriov_vf_device);
>  static DEVICE_ATTR_RW(sriov_drivers_autoprobe);
> +static DEVICE_ATTR_RO(sriov_vf_total_msix);
>
>  static struct attribute *sriov_dev_attrs[] = {
>         &dev_attr_sriov_totalvfs.attr,
> @@ -408,6 +418,7 @@ static struct attribute *sriov_dev_attrs[] = {
>         &dev_attr_sriov_stride.attr,
>         &dev_attr_sriov_vf_device.attr,
>         &dev_attr_sriov_drivers_autoprobe.attr,
> +       &dev_attr_sriov_vf_total_msix.attr,
>         NULL,
>  };
>
> @@ -658,6 +669,7 @@ static void sriov_disable(struct pci_dev *dev)
>                 sysfs_remove_link(&dev->dev.kobj, "dep_link");
>
>         iov->num_VFs = 0;
> +       iov->vf_total_msix = 0;
>         pci_iov_set_numvfs(dev, 0);
>  }
>
> @@ -1116,6 +1128,25 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
>
> +/**
> + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
> + * @dev: the PCI PF device
> + * @numb: the total number of MSI-X vector to consume by the VFs
> + *
> + * Sets the number of MSI-X vectors that is possible to consume by the VFs.
> + * This interface is complimentary part of the pci_set_msix_vec_count()
> + * that will be used to configure the required number on the VF.
> + */
> +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb)
> +{
> +       if (!dev->is_physfn || !dev->driver ||
> +           !dev->driver->sriov_set_msix_vec_count)
> +               return;
> +
> +       dev->sriov->vf_total_msix = numb;
> +}
> +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
> +

This seems broken. What validation is being done on the numb value?
You pass it as int, and your documentation all refers to tests for >=
0, but isn't a signed input a possibility as well? Also "numb" doesn't
make for a good abbreviation as it is already a word of its own. It
might make more sense to use count or something like that rather than
trying to abbreviate number.


>  /**
>   * pci_sriov_configure_simple - helper to configure SR-IOV
>   * @dev: the PCI device
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 1fd273077637..0fbe291eb0f2 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -327,6 +327,9 @@ struct pci_sriov {
>         u16             subsystem_device; /* VF subsystem device */
>         resource_size_t barsz[PCI_SRIOV_NUM_BARS];      /* VF BAR size */
>         bool            drivers_autoprobe; /* Auto probing of VFs by driver */
> +       int             vf_total_msix;  /* Total number of MSI-X vectors the VFs
> +                                        * can consume
> +                                        */
>  };
>
>  /**
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index a17cfc28eb66..fd9ff1f42a09 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -2074,6 +2074,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev);
>  int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
>  resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
>  void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
> +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb);
>
>  /* Arch may override these (weak) */
>  int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs);
> @@ -2114,6 +2115,7 @@ static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
>  static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
>  { return 0; }
>  static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { }
> +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb) {}
>  #endif
>
>  #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-11 19:30   ` Alexander Duyck
@ 2021-01-12  3:25     ` Don Dutile
  2021-01-12  6:15       ` Leon Romanovsky
  2021-01-12  6:39     ` Leon Romanovsky
  1 sibling, 1 reply; 49+ messages in thread
From: Don Dutile @ 2021-01-12  3:25 UTC (permalink / raw)
  To: Alexander Duyck, Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Alex Williamson

On 1/11/21 2:30 PM, Alexander Duyck wrote:
> On Sun, Jan 10, 2021 at 7:12 AM Leon Romanovsky <leon@kernel.org> wrote:
>> From: Leon Romanovsky <leonro@nvidia.com>
>>
>> Extend PCI sysfs interface with a new callback that allows configure
>> the number of MSI-X vectors for specific SR-IO VF. This is needed
>> to optimize the performance of newly bound devices by allocating
>> the number of vectors based on the administrator knowledge of targeted VM.
>>
>> This function is applicable for SR-IOV VF because such devices allocate
>> their MSI-X table before they will run on the VMs and HW can't guess the
>> right number of vectors, so the HW allocates them statically and equally.
>>
>> The newly added /sys/bus/pci/devices/.../vf_msix_vec file will be seen
>> for the VFs and it is writable as long as a driver is not bounded to the VF.
>>
>> The values accepted are:
>>   * > 0 - this will be number reported by the VF's MSI-X capability
>>   * < 0 - not valid
>>   * = 0 - will reset to the device default value
>>
>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>> ---
>>   Documentation/ABI/testing/sysfs-bus-pci | 20 ++++++++
>>   drivers/pci/iov.c                       | 62 +++++++++++++++++++++++++
>>   drivers/pci/msi.c                       | 29 ++++++++++++
>>   drivers/pci/pci-sysfs.c                 |  1 +
>>   drivers/pci/pci.h                       |  2 +
>>   include/linux/pci.h                     |  8 +++-
>>   6 files changed, 121 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
>> index 25c9c39770c6..05e26e5da54e 100644
>> --- a/Documentation/ABI/testing/sysfs-bus-pci
>> +++ b/Documentation/ABI/testing/sysfs-bus-pci
>> @@ -375,3 +375,23 @@ Description:
>>                  The value comes from the PCI kernel device state and can be one
>>                  of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
>>                  The file is read only.
>> +
>> +What:          /sys/bus/pci/devices/.../vf_msix_vec
> So the name for this doesn't seem to match existing SR-IOV naming.  It
> seems like this should probably be something like sriov_vf_msix_count
> in order to be closer to the actual naming of what is being dealt
> with.
>
>> +Date:          December 2020
>> +Contact:       Leon Romanovsky <leonro@nvidia.com>
>> +Description:
>> +               This file is associated with the SR-IOV VFs.
>> +               It allows configuration of the number of MSI-X vectors for
>> +               the VF. This is needed to optimize performance of newly bound
>> +               devices by allocating the number of vectors based on the
>> +               administrator knowledge of targeted VM.
>> +
>> +               The values accepted are:
>> +                * > 0 - this will be number reported by the VF's MSI-X
>> +                        capability
>> +                * < 0 - not valid
>> +                * = 0 - will reset to the device default value
>> +
>> +               The file is writable if the PF is bound to a driver that
>> +               supports the ->sriov_set_msix_vec_count() callback and there
>> +               is no driver bound to the VF.
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index 4afd4ee4f7f0..42c0df4158d1 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
>>          return (dev->devfn + dev->sriov->offset +
>>                  dev->sriov->stride * vf_id) & 0xff;
>>   }
>> +EXPORT_SYMBOL(pci_iov_virtfn_devfn);
>>
>>   /*
>>    * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
>> @@ -426,6 +427,67 @@ const struct attribute_group sriov_dev_attr_group = {
>>          .is_visible = sriov_attrs_are_visible,
>>   };
>>
>> +#ifdef CONFIG_PCI_MSI
>> +static ssize_t vf_msix_vec_show(struct device *dev,
>> +                               struct device_attribute *attr, char *buf)
>> +{
>> +       struct pci_dev *pdev = to_pci_dev(dev);
>> +       int numb = pci_msix_vec_count(pdev);
>> +       struct pci_dev *pfdev;
>> +
>> +       if (numb < 0)
>> +               return numb;
>> +
>> +       pfdev = pci_physfn(pdev);
>> +       if (!pfdev->driver || !pfdev->driver->sriov_set_msix_vec_count)
>> +               return -EOPNOTSUPP;
>> +
> This doesn't make sense to me. You are getting the vector count for
> the PCI device and reporting that. Are you expecting to call this on
> the PF or the VFs? It seems like this should be a PF attribute and not
> be called on the individual VFs.
>
> If you are calling this on the VFs then it doesn't really make any
> sense anyway since the VF is not a "VF PCI dev representor" and
> shouldn't be treated as such. In my opinion if we are going to be
> doing per-port resource limiting that is something that might make
> more sense as a part of the devlink configuration for the VF since the
> actual change won't be visible to an assigned device.
if the op were just limited to nic ports, devlink may be used; but I believe Leon is trying to handle it from an sriov/vf perspective for other non-nic devices as well,
e.g., ib ports, nvme vf's (which don't have a port concept at all).

>> +       return sprintf(buf, "%d\n", numb);
>> +}
>> +
>> +static ssize_t vf_msix_vec_store(struct device *dev,
>> +                                struct device_attribute *attr, const char *buf,
>> +                                size_t count)
>> +{
>> +       struct pci_dev *vf_dev = to_pci_dev(dev);
>> +       int val, ret;
>> +
>> +       ret = kstrtoint(buf, 0, &val);
>> +       if (ret)
>> +               return ret;
>> +
>> +       ret = pci_set_msix_vec_count(vf_dev, val);
>> +       if (ret)
>> +               return ret;
>> +
>> +       return count;
>> +}
>> +static DEVICE_ATTR_RW(vf_msix_vec);
>> +#endif
>> +
>> +static struct attribute *sriov_vf_dev_attrs[] = {
>> +#ifdef CONFIG_PCI_MSI
>> +       &dev_attr_vf_msix_vec.attr,
>> +#endif
>> +       NULL,
>> +};
>> +
>> +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
>> +                                         struct attribute *a, int n)
>> +{
>> +       struct device *dev = kobj_to_dev(kobj);
>> +
>> +       if (dev_is_pf(dev))
>> +               return 0;
>> +
>> +       return a->mode;
>> +}
>> +
>> +const struct attribute_group sriov_vf_dev_attr_group = {
>> +       .attrs = sriov_vf_dev_attrs,
>> +       .is_visible = sriov_vf_attrs_are_visible,
>> +};
>> +
>>   int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
>>   {
>>          return 0;
>> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
>> index 3162f88fe940..20705ca94666 100644
>> --- a/drivers/pci/msi.c
>> +++ b/drivers/pci/msi.c
>> @@ -991,6 +991,35 @@ int pci_msix_vec_count(struct pci_dev *dev)
>>   }
>>   EXPORT_SYMBOL(pci_msix_vec_count);
>>
>> +/**
>> + * pci_set_msix_vec_count - change the reported number of MSI-X vectors
>> + * This function is applicable for SR-IOV VF because such devices allocate
>> + * their MSI-X table before they will run on the VMs and HW can't guess the
>> + * right number of vectors, so the HW allocates them statically and equally.
>> + * @dev: VF device that is going to be changed
>> + * @numb: amount of MSI-X vectors
>> + **/
>> +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
>> +{
>> +       struct pci_dev *pdev = pci_physfn(dev);
>> +
>> +       if (!dev->msix_cap || !pdev->msix_cap)
>> +               return -EINVAL;
>> +
>> +       if (dev->driver || !pdev->driver ||
>> +           !pdev->driver->sriov_set_msix_vec_count)
>> +               return -EOPNOTSUPP;
>> +
>> +       if (numb < 0)
>> +               /*
>> +                * We don't support negative numbers for now,
>> +                * but maybe in the future it will make sense.
>> +                */
>> +               return -EINVAL;
>> +
>> +       return pdev->driver->sriov_set_msix_vec_count(dev, numb);
>> +}
>> +
> If you are going to have a set operation for this it would make sense
> to have a get operation. Your show operation seems unbalanced since
> you are expecting to call it on the VF directly which just seems
> wrong.
>
>>   static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
>>                               int nvec, struct irq_affinity *affd, int flags)
>>   {
>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>> index fb072f4b3176..0af2222643c2 100644
>> --- a/drivers/pci/pci-sysfs.c
>> +++ b/drivers/pci/pci-sysfs.c
>> @@ -1557,6 +1557,7 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
>>          &pci_dev_hp_attr_group,
>>   #ifdef CONFIG_PCI_IOV
>>          &sriov_dev_attr_group,
>> +       &sriov_vf_dev_attr_group,
>>   #endif
>>          &pci_bridge_attr_group,
>>          &pcie_dev_attr_group,
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 5c59365092fa..1fd273077637 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -183,6 +183,7 @@ extern unsigned int pci_pm_d3hot_delay;
>>
>>   #ifdef CONFIG_PCI_MSI
>>   void pci_no_msi(void);
>> +int pci_set_msix_vec_count(struct pci_dev *dev, int numb);
>>   #else
>>   static inline void pci_no_msi(void) { }
>>   #endif
>> @@ -502,6 +503,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
>>   void pci_restore_iov_state(struct pci_dev *dev);
>>   int pci_iov_bus_range(struct pci_bus *bus);
>>   extern const struct attribute_group sriov_dev_attr_group;
>> +extern const struct attribute_group sriov_vf_dev_attr_group;
>>   #else
>>   static inline int pci_iov_init(struct pci_dev *dev)
>>   {
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index b32126d26997..a17cfc28eb66 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -856,6 +856,8 @@ struct module;
>>    *             e.g. drivers/net/e100.c.
>>    * @sriov_configure: Optional driver callback to allow configuration of
>>    *             number of VFs to enable via sysfs "sriov_numvfs" file.
>> + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
>> + *              exposed by the sysfs "vf_msix_vec" entry.
> Hopefully it is doing more than just changing the displayed sysfs
> value. What is the effect of changing that value on the actual system
> state? I'm assuming this is some limit that is enforced by the PF or
> the device firmware?
>
>
>>    * @err_handler: See Documentation/PCI/pci-error-recovery.rst
>>    * @groups:    Sysfs attribute groups.
>>    * @driver:    Driver model structure.
>> @@ -871,6 +873,7 @@ struct pci_driver {
>>          int  (*resume)(struct pci_dev *dev);    /* Device woken up */
>>          void (*shutdown)(struct pci_dev *dev);
>>          int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
>> +       int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
>>          const struct pci_error_handlers *err_handler;
>>          const struct attribute_group **groups;
>>          struct device_driver    driver;
>> @@ -2057,7 +2060,6 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
>>
>>   #ifdef CONFIG_PCI_IOV
>>   int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
>> -int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
>>
>>   int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>>   void pci_disable_sriov(struct pci_dev *dev);
>> @@ -2402,6 +2404,10 @@ static inline bool pci_is_thunderbolt_attached(struct pci_dev *pdev)
>>   void pci_uevent_ers(struct pci_dev *pdev, enum  pci_ers_result err_type);
>>   #endif
>>
>> +#ifdef CONFIG_PCI_IOV
>> +int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id);
>> +#endif
>> +
>>   /* Provide the legacy pci_dma_* API */
>>   #include <linux/pci-dma-compat.h>
>>
>> --
>> 2.29.2
>>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-12  3:25     ` Don Dutile
@ 2021-01-12  6:15       ` Leon Romanovsky
  0 siblings, 0 replies; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-12  6:15 UTC (permalink / raw)
  To: Don Dutile, Alexander Duyck
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Alex Williamson

On Mon, Jan 11, 2021 at 10:25:42PM -0500, Don Dutile wrote:
> On 1/11/21 2:30 PM, Alexander Duyck wrote:
> > On Sun, Jan 10, 2021 at 7:12 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > Extend PCI sysfs interface with a new callback that allows configure
> > > the number of MSI-X vectors for specific SR-IO VF. This is needed
> > > to optimize the performance of newly bound devices by allocating
> > > the number of vectors based on the administrator knowledge of targeted VM.
> > >
> > > This function is applicable for SR-IOV VF because such devices allocate
> > > their MSI-X table before they will run on the VMs and HW can't guess the
> > > right number of vectors, so the HW allocates them statically and equally.
> > >
> > > The newly added /sys/bus/pci/devices/.../vf_msix_vec file will be seen
> > > for the VFs and it is writable as long as a driver is not bounded to the VF.
> > >
> > > The values accepted are:
> > >   * > 0 - this will be number reported by the VF's MSI-X capability
> > >   * < 0 - not valid
> > >   * = 0 - will reset to the device default value
> > >
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > ---
> > >   Documentation/ABI/testing/sysfs-bus-pci | 20 ++++++++
> > >   drivers/pci/iov.c                       | 62 +++++++++++++++++++++++++
> > >   drivers/pci/msi.c                       | 29 ++++++++++++
> > >   drivers/pci/pci-sysfs.c                 |  1 +
> > >   drivers/pci/pci.h                       |  2 +
> > >   include/linux/pci.h                     |  8 +++-
> > >   6 files changed, 121 insertions(+), 1 deletion(-)

<...>

> > > +
> > This doesn't make sense to me. You are getting the vector count for
> > the PCI device and reporting that. Are you expecting to call this on
> > the PF or the VFs? It seems like this should be a PF attribute and not
> > be called on the individual VFs.
> >
> > If you are calling this on the VFs then it doesn't really make any
> > sense anyway since the VF is not a "VF PCI dev representor" and
> > shouldn't be treated as such. In my opinion if we are going to be
> > doing per-port resource limiting that is something that might make
> > more sense as a part of the devlink configuration for the VF since the
> > actual change won't be visible to an assigned device.
> if the op were just limited to nic ports, devlink may be used; but I believe Leon is trying to handle it from an sriov/vf perspective for other non-nic devices as well,
> e.g., ib ports, nvme vf's (which don't have a port concept at all).

Right, the SR-IOV VFs are common entities outside of nic/devlink world.
In addition to the netdev world, SR-IOV is used for crypto, storage, FPGA
and IB devices.

From what I see, There are three possible ways to configure MSI-X vector count:
1. PCI device is on the same CPU - regular server/desktop as we know it.
2. PCI device is on remote CPU - SmartNIC use case, the CPUs are
connected through eswitch model.
3. Some direct interface - DEVX for the mlx5_ib.

This implementation handles item #1 for all devices without any exception.
From my point of view, the majority of interested users of this feature
will use this option.

The second item should be solved differently (with devlink) when configuring
eswitch port, but it is orthogonal to the item #1 and I will do it after.

The third item is not managed by the kernel, so not relevant for our discussion.

Thanks

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-11 19:30   ` Alexander Duyck
  2021-01-12  3:25     ` Don Dutile
@ 2021-01-12  6:39     ` Leon Romanovsky
  2021-01-12 21:59       ` Alexander Duyck
  1 sibling, 1 reply; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-12  6:39 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Mon, Jan 11, 2021 at 11:30:33AM -0800, Alexander Duyck wrote:
> On Sun, Jan 10, 2021 at 7:12 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > Extend PCI sysfs interface with a new callback that allows configure
> > the number of MSI-X vectors for specific SR-IO VF. This is needed
> > to optimize the performance of newly bound devices by allocating
> > the number of vectors based on the administrator knowledge of targeted VM.
> >
> > This function is applicable for SR-IOV VF because such devices allocate
> > their MSI-X table before they will run on the VMs and HW can't guess the
> > right number of vectors, so the HW allocates them statically and equally.
> >
> > The newly added /sys/bus/pci/devices/.../vf_msix_vec file will be seen
> > for the VFs and it is writable as long as a driver is not bounded to the VF.
> >
> > The values accepted are:
> >  * > 0 - this will be number reported by the VF's MSI-X capability
> >  * < 0 - not valid
> >  * = 0 - will reset to the device default value
> >
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  Documentation/ABI/testing/sysfs-bus-pci | 20 ++++++++
> >  drivers/pci/iov.c                       | 62 +++++++++++++++++++++++++
> >  drivers/pci/msi.c                       | 29 ++++++++++++
> >  drivers/pci/pci-sysfs.c                 |  1 +
> >  drivers/pci/pci.h                       |  2 +
> >  include/linux/pci.h                     |  8 +++-
> >  6 files changed, 121 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > index 25c9c39770c6..05e26e5da54e 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > @@ -375,3 +375,23 @@ Description:
> >                 The value comes from the PCI kernel device state and can be one
> >                 of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
> >                 The file is read only.
> > +
> > +What:          /sys/bus/pci/devices/.../vf_msix_vec
>
> So the name for this doesn't seem to match existing SR-IOV naming.  It
> seems like this should probably be something like sriov_vf_msix_count
> in order to be closer to the actual naming of what is being dealt
> with.

I'm open for suggestions. I didn't use sriov_vf_msix_count because it
seems too long for me.

>
> > +Date:          December 2020
> > +Contact:       Leon Romanovsky <leonro@nvidia.com>
> > +Description:
> > +               This file is associated with the SR-IOV VFs.
> > +               It allows configuration of the number of MSI-X vectors for
> > +               the VF. This is needed to optimize performance of newly bound
> > +               devices by allocating the number of vectors based on the
> > +               administrator knowledge of targeted VM.
> > +
> > +               The values accepted are:
> > +                * > 0 - this will be number reported by the VF's MSI-X
> > +                        capability
> > +                * < 0 - not valid
> > +                * = 0 - will reset to the device default value
> > +
> > +               The file is writable if the PF is bound to a driver that
> > +               supports the ->sriov_set_msix_vec_count() callback and there
> > +               is no driver bound to the VF.
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index 4afd4ee4f7f0..42c0df4158d1 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
> >         return (dev->devfn + dev->sriov->offset +
> >                 dev->sriov->stride * vf_id) & 0xff;
> >  }
> > +EXPORT_SYMBOL(pci_iov_virtfn_devfn);
> >
> >  /*
> >   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
> > @@ -426,6 +427,67 @@ const struct attribute_group sriov_dev_attr_group = {
> >         .is_visible = sriov_attrs_are_visible,
> >  };
> >
> > +#ifdef CONFIG_PCI_MSI
> > +static ssize_t vf_msix_vec_show(struct device *dev,
> > +                               struct device_attribute *attr, char *buf)
> > +{
> > +       struct pci_dev *pdev = to_pci_dev(dev);
> > +       int numb = pci_msix_vec_count(pdev);
> > +       struct pci_dev *pfdev;
> > +
> > +       if (numb < 0)
> > +               return numb;
> > +
> > +       pfdev = pci_physfn(pdev);
> > +       if (!pfdev->driver || !pfdev->driver->sriov_set_msix_vec_count)
> > +               return -EOPNOTSUPP;
> > +
>
> This doesn't make sense to me. You are getting the vector count for
> the PCI device and reporting that. Are you expecting to call this on
> the PF or the VFs? It seems like this should be a PF attribute and not
> be called on the individual VFs.

We had this discussion over v0 variant of this series.
https://lore.kernel.org/linux-pci/20210108072525.GB31158@unreal/

This is per-VF property, but this VF is not bounded to the driver so you
need some way to convey new number to the HW, so it will update PCI value.

You must change/update this field after VF is created, because all SR-IOV VFs
are created at the same time. The operator (administrator/orchestration
software/e.t.c) will know the right amount of MSI-X vectors right before
he will bind this VF to requested VM.

It means that extending PF sysfs to get both VF index and count will
look very unfriendly for the users.

The PF here is an anchor to the relevant driver.

>
> If you are calling this on the VFs then it doesn't really make any
> sense anyway since the VF is not a "VF PCI dev representor" and
> shouldn't be treated as such. In my opinion if we are going to be
> doing per-port resource limiting that is something that might make
> more sense as a part of the devlink configuration for the VF since the
> actual change won't be visible to an assigned device.

https://lore.kernel.org/linux-pci/20210112061535.GB4678@unreal/

>
> > +       return sprintf(buf, "%d\n", numb);
> > +}
> > +
> > +static ssize_t vf_msix_vec_store(struct device *dev,
> > +                                struct device_attribute *attr, const char *buf,
> > +                                size_t count)
> > +{
> > +       struct pci_dev *vf_dev = to_pci_dev(dev);
> > +       int val, ret;
> > +
> > +       ret = kstrtoint(buf, 0, &val);
> > +       if (ret)
> > +               return ret;
> > +
> > +       ret = pci_set_msix_vec_count(vf_dev, val);
> > +       if (ret)
> > +               return ret;
> > +
> > +       return count;
> > +}
> > +static DEVICE_ATTR_RW(vf_msix_vec);
> > +#endif
> > +
> > +static struct attribute *sriov_vf_dev_attrs[] = {
> > +#ifdef CONFIG_PCI_MSI
> > +       &dev_attr_vf_msix_vec.attr,
> > +#endif
> > +       NULL,
> > +};
> > +
> > +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
> > +                                         struct attribute *a, int n)
> > +{
> > +       struct device *dev = kobj_to_dev(kobj);
> > +
> > +       if (dev_is_pf(dev))
> > +               return 0;
> > +
> > +       return a->mode;
> > +}
> > +
> > +const struct attribute_group sriov_vf_dev_attr_group = {
> > +       .attrs = sriov_vf_dev_attrs,
> > +       .is_visible = sriov_vf_attrs_are_visible,
> > +};
> > +
> >  int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
> >  {
> >         return 0;
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 3162f88fe940..20705ca94666 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -991,6 +991,35 @@ int pci_msix_vec_count(struct pci_dev *dev)
> >  }
> >  EXPORT_SYMBOL(pci_msix_vec_count);
> >
> > +/**
> > + * pci_set_msix_vec_count - change the reported number of MSI-X vectors
> > + * This function is applicable for SR-IOV VF because such devices allocate
> > + * their MSI-X table before they will run on the VMs and HW can't guess the
> > + * right number of vectors, so the HW allocates them statically and equally.
> > + * @dev: VF device that is going to be changed
> > + * @numb: amount of MSI-X vectors
> > + **/
> > +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
> > +{
> > +       struct pci_dev *pdev = pci_physfn(dev);
> > +
> > +       if (!dev->msix_cap || !pdev->msix_cap)
> > +               return -EINVAL;
> > +
> > +       if (dev->driver || !pdev->driver ||
> > +           !pdev->driver->sriov_set_msix_vec_count)
> > +               return -EOPNOTSUPP;
> > +
> > +       if (numb < 0)
> > +               /*
> > +                * We don't support negative numbers for now,
> > +                * but maybe in the future it will make sense.
> > +                */
> > +               return -EINVAL;
> > +
> > +       return pdev->driver->sriov_set_msix_vec_count(dev, numb);
> > +}
> > +
>
> If you are going to have a set operation for this it would make sense
> to have a get operation. Your show operation seems unbalanced since
> you are expecting to call it on the VF directly which just seems
> wrong.

There is already get operator - pci_msix_vec_count().
The same as above, PF is an anchor for the driver. VF doesn't have
driver yet and we can't write directly to this PCI field - it is read-only.

>
> >  static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
> >                              int nvec, struct irq_affinity *affd, int flags)
> >  {
> > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> > index fb072f4b3176..0af2222643c2 100644
> > --- a/drivers/pci/pci-sysfs.c
> > +++ b/drivers/pci/pci-sysfs.c
> > @@ -1557,6 +1557,7 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
> >         &pci_dev_hp_attr_group,
> >  #ifdef CONFIG_PCI_IOV
> >         &sriov_dev_attr_group,
> > +       &sriov_vf_dev_attr_group,
> >  #endif
> >         &pci_bridge_attr_group,
> >         &pcie_dev_attr_group,
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 5c59365092fa..1fd273077637 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -183,6 +183,7 @@ extern unsigned int pci_pm_d3hot_delay;
> >
> >  #ifdef CONFIG_PCI_MSI
> >  void pci_no_msi(void);
> > +int pci_set_msix_vec_count(struct pci_dev *dev, int numb);
> >  #else
> >  static inline void pci_no_msi(void) { }
> >  #endif
> > @@ -502,6 +503,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
> >  void pci_restore_iov_state(struct pci_dev *dev);
> >  int pci_iov_bus_range(struct pci_bus *bus);
> >  extern const struct attribute_group sriov_dev_attr_group;
> > +extern const struct attribute_group sriov_vf_dev_attr_group;
> >  #else
> >  static inline int pci_iov_init(struct pci_dev *dev)
> >  {
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index b32126d26997..a17cfc28eb66 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -856,6 +856,8 @@ struct module;
> >   *             e.g. drivers/net/e100.c.
> >   * @sriov_configure: Optional driver callback to allow configuration of
> >   *             number of VFs to enable via sysfs "sriov_numvfs" file.
> > + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> > + *              exposed by the sysfs "vf_msix_vec" entry.
>
> Hopefully it is doing more than just changing the displayed sysfs
> value. What is the effect of changing that value on the actual system
> state? I'm assuming this is some limit that is enforced by the PF or
> the device firmware?

The whole purpose of this interface it to change read-only field and to
prevent any possible complications, this field is possible to change
only for unprobed device.

The limit checks are done in FW as it is the entity that manages SR-IOV PCI
config space. This is true for mlx5, other drivers can do their checks in the SW.

>
>
> >   * @err_handler: See Documentation/PCI/pci-error-recovery.rst
> >   * @groups:    Sysfs attribute groups.
> >   * @driver:    Driver model structure.
> > @@ -871,6 +873,7 @@ struct pci_driver {
> >         int  (*resume)(struct pci_dev *dev);    /* Device woken up */
> >         void (*shutdown)(struct pci_dev *dev);
> >         int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
> > +       int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
> >         const struct pci_error_handlers *err_handler;
> >         const struct attribute_group **groups;
> >         struct device_driver    driver;
> > @@ -2057,7 +2060,6 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
> >
> >  #ifdef CONFIG_PCI_IOV
> >  int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
> > -int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
> >
> >  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
> >  void pci_disable_sriov(struct pci_dev *dev);
> > @@ -2402,6 +2404,10 @@ static inline bool pci_is_thunderbolt_attached(struct pci_dev *pdev)
> >  void pci_uevent_ers(struct pci_dev *pdev, enum  pci_ers_result err_type);
> >  #endif
> >
> > +#ifdef CONFIG_PCI_IOV
> > +int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id);
> > +#endif
> > +
> >  /* Provide the legacy pci_dma_* API */
> >  #include <linux/pci-dma-compat.h>
> >
> > --
> > 2.29.2
> >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-11 19:30   ` Alexander Duyck
@ 2021-01-12  6:56     ` Leon Romanovsky
  2021-01-12 21:34       ` Alexander Duyck
  0 siblings, 1 reply; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-12  6:56 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Mon, Jan 11, 2021 at 11:30:39AM -0800, Alexander Duyck wrote:
> On Sun, Jan 10, 2021 at 7:10 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > Some SR-IOV capable devices provide an ability to configure specific
> > number of MSI-X vectors on their VF prior driver is probed on that VF.
> >
> > In order to make management easy, provide new read-only sysfs file that
> > returns a total number of possible to configure MSI-X vectors.
> >
> > cat /sys/bus/pci/devices/.../sriov_vf_total_msix
> >   = 0 - feature is not supported
> >   > 0 - total number of MSI-X vectors to consume by the VFs
> >
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  Documentation/ABI/testing/sysfs-bus-pci | 14 +++++++++++
> >  drivers/pci/iov.c                       | 31 +++++++++++++++++++++++++
> >  drivers/pci/pci.h                       |  3 +++
> >  include/linux/pci.h                     |  2 ++
> >  4 files changed, 50 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > index 05e26e5da54e..64e9b700acc9 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > @@ -395,3 +395,17 @@ Description:
> >                 The file is writable if the PF is bound to a driver that
> >                 supports the ->sriov_set_msix_vec_count() callback and there
> >                 is no driver bound to the VF.
> > +
> > +What:          /sys/bus/pci/devices/.../sriov_vf_total_msix
>
> In this case I would drop the "vf" and just go with sriov_total_msix
> since now you are referring to a global value instead of a per VF
> value.

This field indicates the amount of MSI-X available for VFs, it doesn't
include PFs. The missing "_vf_" will mislead users who will believe that
it is all MSI-X vectors available for this device. They will need to take
into consideration amount of PF MSI-X in order to calculate the VF distribution.

So I would leave "_vf_" here.

>
> > +Date:          January 2021
> > +Contact:       Leon Romanovsky <leonro@nvidia.com>
> > +Description:
> > +               This file is associated with the SR-IOV PFs.
> > +               It returns a total number of possible to configure MSI-X
> > +               vectors on the enabled VFs.
> > +
> > +               The values returned are:
> > +                * > 0 - this will be total number possible to consume by VFs,
> > +                * = 0 - feature is not supported
> > +
> > +               If no SR-IOV VFs are enabled, this value will return 0.
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index 42c0df4158d1..0a6ddf3230fd 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev,
> >         return count;
> >  }
> >
> > +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> > +                                       struct device_attribute *attr,
> > +                                       char *buf)
> > +{
> > +       struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > +       return sprintf(buf, "%d\n", pdev->sriov->vf_total_msix);
> > +}
> > +
>
> You display it as a signed value, but unsigned values are not
> supported, correct?

Right, I made it similar to the vf_msix_set. I can change.

>
> >  static DEVICE_ATTR_RO(sriov_totalvfs);
> >  static DEVICE_ATTR_RW(sriov_numvfs);
> >  static DEVICE_ATTR_RO(sriov_offset);
> >  static DEVICE_ATTR_RO(sriov_stride);
> >  static DEVICE_ATTR_RO(sriov_vf_device);
> >  static DEVICE_ATTR_RW(sriov_drivers_autoprobe);
> > +static DEVICE_ATTR_RO(sriov_vf_total_msix);
> >
> >  static struct attribute *sriov_dev_attrs[] = {
> >         &dev_attr_sriov_totalvfs.attr,
> > @@ -408,6 +418,7 @@ static struct attribute *sriov_dev_attrs[] = {
> >         &dev_attr_sriov_stride.attr,
> >         &dev_attr_sriov_vf_device.attr,
> >         &dev_attr_sriov_drivers_autoprobe.attr,
> > +       &dev_attr_sriov_vf_total_msix.attr,
> >         NULL,
> >  };
> >
> > @@ -658,6 +669,7 @@ static void sriov_disable(struct pci_dev *dev)
> >                 sysfs_remove_link(&dev->dev.kobj, "dep_link");
> >
> >         iov->num_VFs = 0;
> > +       iov->vf_total_msix = 0;
> >         pci_iov_set_numvfs(dev, 0);
> >  }
> >
> > @@ -1116,6 +1128,25 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
> >  }
> >  EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> >
> > +/**
> > + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
> > + * @dev: the PCI PF device
> > + * @numb: the total number of MSI-X vector to consume by the VFs
> > + *
> > + * Sets the number of MSI-X vectors that is possible to consume by the VFs.
> > + * This interface is complimentary part of the pci_set_msix_vec_count()
> > + * that will be used to configure the required number on the VF.
> > + */
> > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb)
> > +{
> > +       if (!dev->is_physfn || !dev->driver ||
> > +           !dev->driver->sriov_set_msix_vec_count)
> > +               return;
> > +
> > +       dev->sriov->vf_total_msix = numb;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
> > +
>
> This seems broken. What validation is being done on the numb value?
> You pass it as int, and your documentation all refers to tests for >=
> 0, but isn't a signed input a possibility as well? Also "numb" doesn't
> make for a good abbreviation as it is already a word of its own. It
> might make more sense to use count or something like that rather than
> trying to abbreviate number.

"Broken" is a nice word to describe misunderstanding.

The vf_total_msix is not set by the users and used solely by the drivers
to advertise their capability. This field is needed to give a way to
calculate how much MSI-X VFs can get. The driver code is part of the
kernel and like any other kernel code, it is trusted.

I'm checking < 0 in another _set_ routine to make sure that we will be
able to extend this sysfs entry if at some point of time negative vector
count will make sense.

"Count" instead of "numb" is fine by me.

>
>
> >  /**
> >   * pci_sriov_configure_simple - helper to configure SR-IOV
> >   * @dev: the PCI device
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 1fd273077637..0fbe291eb0f2 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -327,6 +327,9 @@ struct pci_sriov {
> >         u16             subsystem_device; /* VF subsystem device */
> >         resource_size_t barsz[PCI_SRIOV_NUM_BARS];      /* VF BAR size */
> >         bool            drivers_autoprobe; /* Auto probing of VFs by driver */
> > +       int             vf_total_msix;  /* Total number of MSI-X vectors the VFs
> > +                                        * can consume
> > +                                        */
> >  };
> >
> >  /**
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index a17cfc28eb66..fd9ff1f42a09 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -2074,6 +2074,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev);
> >  int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
> >  resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
> >  void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
> > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb);
> >
> >  /* Arch may override these (weak) */
> >  int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs);
> > @@ -2114,6 +2115,7 @@ static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
> >  static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
> >  { return 0; }
> >  static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { }
> > +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb) {}
> >  #endif
> >
> >  #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
> > --
> > 2.29.2
> >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-12  6:56     ` Leon Romanovsky
@ 2021-01-12 21:34       ` Alexander Duyck
  2021-01-13  6:19         ` Leon Romanovsky
  0 siblings, 1 reply; 49+ messages in thread
From: Alexander Duyck @ 2021-01-12 21:34 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Mon, Jan 11, 2021 at 10:56 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Mon, Jan 11, 2021 at 11:30:39AM -0800, Alexander Duyck wrote:
> > On Sun, Jan 10, 2021 at 7:10 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > Some SR-IOV capable devices provide an ability to configure specific
> > > number of MSI-X vectors on their VF prior driver is probed on that VF.
> > >
> > > In order to make management easy, provide new read-only sysfs file that
> > > returns a total number of possible to configure MSI-X vectors.
> > >
> > > cat /sys/bus/pci/devices/.../sriov_vf_total_msix
> > >   = 0 - feature is not supported
> > >   > 0 - total number of MSI-X vectors to consume by the VFs
> > >
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > ---
> > >  Documentation/ABI/testing/sysfs-bus-pci | 14 +++++++++++
> > >  drivers/pci/iov.c                       | 31 +++++++++++++++++++++++++
> > >  drivers/pci/pci.h                       |  3 +++
> > >  include/linux/pci.h                     |  2 ++
> > >  4 files changed, 50 insertions(+)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > index 05e26e5da54e..64e9b700acc9 100644
> > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > @@ -395,3 +395,17 @@ Description:
> > >                 The file is writable if the PF is bound to a driver that
> > >                 supports the ->sriov_set_msix_vec_count() callback and there
> > >                 is no driver bound to the VF.
> > > +
> > > +What:          /sys/bus/pci/devices/.../sriov_vf_total_msix
> >
> > In this case I would drop the "vf" and just go with sriov_total_msix
> > since now you are referring to a global value instead of a per VF
> > value.
>
> This field indicates the amount of MSI-X available for VFs, it doesn't
> include PFs. The missing "_vf_" will mislead users who will believe that
> it is all MSI-X vectors available for this device. They will need to take
> into consideration amount of PF MSI-X in order to calculate the VF distribution.
>
> So I would leave "_vf_" here.

The problem is you aren't indicating how many are available for an
individual VF though, you are indicating how many are available for
use by SR-IOV to give to the VFs. The fact that you are dealing with a
pool makes things confusing in my opinion. For example sriov_vf_device
describes the device ID that will be given to each VF.

> >
> > > +Date:          January 2021
> > > +Contact:       Leon Romanovsky <leonro@nvidia.com>
> > > +Description:
> > > +               This file is associated with the SR-IOV PFs.
> > > +               It returns a total number of possible to configure MSI-X
> > > +               vectors on the enabled VFs.
> > > +
> > > +               The values returned are:
> > > +                * > 0 - this will be total number possible to consume by VFs,
> > > +                * = 0 - feature is not supported
> > > +
> > > +               If no SR-IOV VFs are enabled, this value will return 0.
> > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > index 42c0df4158d1..0a6ddf3230fd 100644
> > > --- a/drivers/pci/iov.c
> > > +++ b/drivers/pci/iov.c
> > > @@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev,
> > >         return count;
> > >  }
> > >
> > > +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> > > +                                       struct device_attribute *attr,
> > > +                                       char *buf)
> > > +{
> > > +       struct pci_dev *pdev = to_pci_dev(dev);
> > > +
> > > +       return sprintf(buf, "%d\n", pdev->sriov->vf_total_msix);
> > > +}
> > > +
> >
> > You display it as a signed value, but unsigned values are not
> > supported, correct?
>
> Right, I made it similar to the vf_msix_set. I can change.
>
> >
> > >  static DEVICE_ATTR_RO(sriov_totalvfs);
> > >  static DEVICE_ATTR_RW(sriov_numvfs);
> > >  static DEVICE_ATTR_RO(sriov_offset);
> > >  static DEVICE_ATTR_RO(sriov_stride);
> > >  static DEVICE_ATTR_RO(sriov_vf_device);
> > >  static DEVICE_ATTR_RW(sriov_drivers_autoprobe);
> > > +static DEVICE_ATTR_RO(sriov_vf_total_msix);
> > >
> > >  static struct attribute *sriov_dev_attrs[] = {
> > >         &dev_attr_sriov_totalvfs.attr,
> > > @@ -408,6 +418,7 @@ static struct attribute *sriov_dev_attrs[] = {
> > >         &dev_attr_sriov_stride.attr,
> > >         &dev_attr_sriov_vf_device.attr,
> > >         &dev_attr_sriov_drivers_autoprobe.attr,
> > > +       &dev_attr_sriov_vf_total_msix.attr,
> > >         NULL,
> > >  };
> > >
> > > @@ -658,6 +669,7 @@ static void sriov_disable(struct pci_dev *dev)
> > >                 sysfs_remove_link(&dev->dev.kobj, "dep_link");
> > >
> > >         iov->num_VFs = 0;
> > > +       iov->vf_total_msix = 0;
> > >         pci_iov_set_numvfs(dev, 0);
> > >  }
> > >
> > > @@ -1116,6 +1128,25 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
> > >  }
> > >  EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> > >
> > > +/**
> > > + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
> > > + * @dev: the PCI PF device
> > > + * @numb: the total number of MSI-X vector to consume by the VFs
> > > + *
> > > + * Sets the number of MSI-X vectors that is possible to consume by the VFs.
> > > + * This interface is complimentary part of the pci_set_msix_vec_count()
> > > + * that will be used to configure the required number on the VF.
> > > + */
> > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb)
> > > +{
> > > +       if (!dev->is_physfn || !dev->driver ||
> > > +           !dev->driver->sriov_set_msix_vec_count)
> > > +               return;
> > > +
> > > +       dev->sriov->vf_total_msix = numb;
> > > +}
> > > +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
> > > +
> >
> > This seems broken. What validation is being done on the numb value?
> > You pass it as int, and your documentation all refers to tests for >=
> > 0, but isn't a signed input a possibility as well? Also "numb" doesn't
> > make for a good abbreviation as it is already a word of its own. It
> > might make more sense to use count or something like that rather than
> > trying to abbreviate number.
>
> "Broken" is a nice word to describe misunderstanding.

Would you prefer "lacking input validation".

I see all this code in there checking for is_physfn and driver and
sriov_set_msix_vec_count before allowing the setting of vf_total_msix.
It just seems like a lot of validation is taking place on the wrong
things if you are just going to be setting a value reporting the total
number of MSI-X vectors in use for SR-IOV.

In addition this value seems like a custom purpose being pushed into
the PCIe code since there isn't anything that defaults the value. It
seems like at a minimum there should be something that programs a
default value for both of these new fields that are being added so
that you pull the maximum number of VFs when SR-IOV is enabled, the
maximum number of MSI-X vectors from a single VF, and then the default
value for this should be the multiple of the two which can then be
overridden later.

> The vf_total_msix is not set by the users and used solely by the drivers
> to advertise their capability. This field is needed to give a way to
> calculate how much MSI-X VFs can get. The driver code is part of the
> kernel and like any other kernel code, it is trusted.
>
> I'm checking < 0 in another _set_ routine to make sure that we will be
> able to extend this sysfs entry if at some point of time negative vector
> count will make sense.

I would rather have a strict interface that doesn't allow for
unintended flexibility. Out-of-tree drivers tend to exploit that kind
of stuff and it is problematic when it can occur.

> "Count" instead of "numb" is fine by me.
> >
> >
> > >  /**
> > >   * pci_sriov_configure_simple - helper to configure SR-IOV
> > >   * @dev: the PCI device
> > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > > index 1fd273077637..0fbe291eb0f2 100644
> > > --- a/drivers/pci/pci.h
> > > +++ b/drivers/pci/pci.h
> > > @@ -327,6 +327,9 @@ struct pci_sriov {
> > >         u16             subsystem_device; /* VF subsystem device */
> > >         resource_size_t barsz[PCI_SRIOV_NUM_BARS];      /* VF BAR size */
> > >         bool            drivers_autoprobe; /* Auto probing of VFs by driver */
> > > +       int             vf_total_msix;  /* Total number of MSI-X vectors the VFs
> > > +                                        * can consume
> > > +                                        */
> > >  };
> > >
> > >  /**
> > > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > > index a17cfc28eb66..fd9ff1f42a09 100644
> > > --- a/include/linux/pci.h
> > > +++ b/include/linux/pci.h
> > > @@ -2074,6 +2074,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev);
> > >  int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
> > >  resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
> > >  void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
> > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb);
> > >
> > >  /* Arch may override these (weak) */
> > >  int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs);
> > > @@ -2114,6 +2115,7 @@ static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
> > >  static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
> > >  { return 0; }
> > >  static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { }
> > > +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb) {}
> > >  #endif
> > >
> > >  #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
> > > --
> > > 2.29.2
> > >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-12  6:39     ` Leon Romanovsky
@ 2021-01-12 21:59       ` Alexander Duyck
  2021-01-13  6:09         ` Leon Romanovsky
  0 siblings, 1 reply; 49+ messages in thread
From: Alexander Duyck @ 2021-01-12 21:59 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Mon, Jan 11, 2021 at 10:39 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Mon, Jan 11, 2021 at 11:30:33AM -0800, Alexander Duyck wrote:
> > On Sun, Jan 10, 2021 at 7:12 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > Extend PCI sysfs interface with a new callback that allows configure
> > > the number of MSI-X vectors for specific SR-IO VF. This is needed
> > > to optimize the performance of newly bound devices by allocating
> > > the number of vectors based on the administrator knowledge of targeted VM.
> > >
> > > This function is applicable for SR-IOV VF because such devices allocate
> > > their MSI-X table before they will run on the VMs and HW can't guess the
> > > right number of vectors, so the HW allocates them statically and equally.
> > >
> > > The newly added /sys/bus/pci/devices/.../vf_msix_vec file will be seen
> > > for the VFs and it is writable as long as a driver is not bounded to the VF.
> > >
> > > The values accepted are:
> > >  * > 0 - this will be number reported by the VF's MSI-X capability
> > >  * < 0 - not valid
> > >  * = 0 - will reset to the device default value
> > >
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > ---
> > >  Documentation/ABI/testing/sysfs-bus-pci | 20 ++++++++
> > >  drivers/pci/iov.c                       | 62 +++++++++++++++++++++++++
> > >  drivers/pci/msi.c                       | 29 ++++++++++++
> > >  drivers/pci/pci-sysfs.c                 |  1 +
> > >  drivers/pci/pci.h                       |  2 +
> > >  include/linux/pci.h                     |  8 +++-
> > >  6 files changed, 121 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > index 25c9c39770c6..05e26e5da54e 100644
> > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > @@ -375,3 +375,23 @@ Description:
> > >                 The value comes from the PCI kernel device state and can be one
> > >                 of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
> > >                 The file is read only.
> > > +
> > > +What:          /sys/bus/pci/devices/.../vf_msix_vec
> >
> > So the name for this doesn't seem to match existing SR-IOV naming.  It
> > seems like this should probably be something like sriov_vf_msix_count
> > in order to be closer to the actual naming of what is being dealt
> > with.
>
> I'm open for suggestions. I didn't use sriov_vf_msix_count because it
> seems too long for me.
>
> >
> > > +Date:          December 2020
> > > +Contact:       Leon Romanovsky <leonro@nvidia.com>
> > > +Description:
> > > +               This file is associated with the SR-IOV VFs.
> > > +               It allows configuration of the number of MSI-X vectors for
> > > +               the VF. This is needed to optimize performance of newly bound
> > > +               devices by allocating the number of vectors based on the
> > > +               administrator knowledge of targeted VM.
> > > +
> > > +               The values accepted are:
> > > +                * > 0 - this will be number reported by the VF's MSI-X
> > > +                        capability
> > > +                * < 0 - not valid
> > > +                * = 0 - will reset to the device default value
> > > +
> > > +               The file is writable if the PF is bound to a driver that
> > > +               supports the ->sriov_set_msix_vec_count() callback and there
> > > +               is no driver bound to the VF.
> > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > index 4afd4ee4f7f0..42c0df4158d1 100644
> > > --- a/drivers/pci/iov.c
> > > +++ b/drivers/pci/iov.c
> > > @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
> > >         return (dev->devfn + dev->sriov->offset +
> > >                 dev->sriov->stride * vf_id) & 0xff;
> > >  }
> > > +EXPORT_SYMBOL(pci_iov_virtfn_devfn);
> > >
> > >  /*
> > >   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
> > > @@ -426,6 +427,67 @@ const struct attribute_group sriov_dev_attr_group = {
> > >         .is_visible = sriov_attrs_are_visible,
> > >  };
> > >
> > > +#ifdef CONFIG_PCI_MSI
> > > +static ssize_t vf_msix_vec_show(struct device *dev,
> > > +                               struct device_attribute *attr, char *buf)
> > > +{
> > > +       struct pci_dev *pdev = to_pci_dev(dev);
> > > +       int numb = pci_msix_vec_count(pdev);
> > > +       struct pci_dev *pfdev;
> > > +
> > > +       if (numb < 0)
> > > +               return numb;
> > > +
> > > +       pfdev = pci_physfn(pdev);
> > > +       if (!pfdev->driver || !pfdev->driver->sriov_set_msix_vec_count)
> > > +               return -EOPNOTSUPP;
> > > +
> >
> > This doesn't make sense to me. You are getting the vector count for
> > the PCI device and reporting that. Are you expecting to call this on
> > the PF or the VFs? It seems like this should be a PF attribute and not
> > be called on the individual VFs.
>
> We had this discussion over v0 variant of this series.
> https://lore.kernel.org/linux-pci/20210108072525.GB31158@unreal/
>
> This is per-VF property, but this VF is not bounded to the driver so you
> need some way to convey new number to the HW, so it will update PCI value.
>
> You must change/update this field after VF is created, because all SR-IOV VFs
> are created at the same time. The operator (administrator/orchestration
> software/e.t.c) will know the right amount of MSI-X vectors right before
> he will bind this VF to requested VM.
>
> It means that extending PF sysfs to get both VF index and count will
> look very unfriendly for the users.
>
> The PF here is an anchor to the relevant driver.

Yes, but the problem is you are attempting to do it after a driver may
have already bound itself to the VF device. This setup works for the
direct assigned VF case, however if the VF drivers are running on the
host then this gets ugly as the driver may already be up and running.

Also I am not a big fan of the VF groping around looking for a PF
interface as it means the interface will likely be exposed in the
guest as well, but it just won't work.

> >
> > If you are calling this on the VFs then it doesn't really make any
> > sense anyway since the VF is not a "VF PCI dev representor" and
> > shouldn't be treated as such. In my opinion if we are going to be
> > doing per-port resource limiting that is something that might make
> > more sense as a part of the devlink configuration for the VF since the
> > actual change won't be visible to an assigned device.
>
> https://lore.kernel.org/linux-pci/20210112061535.GB4678@unreal/

So the question I would have is if we are spawning the VFs and
expecting them to have different configs or the same configuration?
I'm assuming in your case you are looking for a different
configuration per port. Do I have that correct?

Where this gets ugly is that SR-IOV assumes a certain uniformity per
VF so doing a per-VF custom limitation gets ugly pretty quick. I
wonder if it would make more sense if we are going this route to just
define a device-tree like schema that could be fed in to enable VFs
instead of just using echo X > sriov_numvfs and then trying to fix
things afterwards. Then you could define this and other features that
I am sure you would need in the future via json-schema like is done in
device-tree and write it once enabling the set of VFs that you need.

Given that SR-IOV isn't meant to be tweaked dynamically it seems like
that would be a much better fit for most users as then you can just
modify the schema and reload it which would probably be less effort
than having to manually redo all the commands needed to setup the VFs
using this approach if you are having to manually update each VF.

> >
> > > +       return sprintf(buf, "%d\n", numb);
> > > +}
> > > +
> > > +static ssize_t vf_msix_vec_store(struct device *dev,
> > > +                                struct device_attribute *attr, const char *buf,
> > > +                                size_t count)
> > > +{
> > > +       struct pci_dev *vf_dev = to_pci_dev(dev);
> > > +       int val, ret;
> > > +
> > > +       ret = kstrtoint(buf, 0, &val);
> > > +       if (ret)
> > > +               return ret;
> > > +
> > > +       ret = pci_set_msix_vec_count(vf_dev, val);
> > > +       if (ret)
> > > +               return ret;
> > > +
> > > +       return count;
> > > +}
> > > +static DEVICE_ATTR_RW(vf_msix_vec);
> > > +#endif
> > > +
> > > +static struct attribute *sriov_vf_dev_attrs[] = {
> > > +#ifdef CONFIG_PCI_MSI
> > > +       &dev_attr_vf_msix_vec.attr,
> > > +#endif
> > > +       NULL,
> > > +};
> > > +
> > > +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
> > > +                                         struct attribute *a, int n)
> > > +{
> > > +       struct device *dev = kobj_to_dev(kobj);
> > > +
> > > +       if (dev_is_pf(dev))
> > > +               return 0;
> > > +
> > > +       return a->mode;
> > > +}
> > > +
> > > +const struct attribute_group sriov_vf_dev_attr_group = {
> > > +       .attrs = sriov_vf_dev_attrs,
> > > +       .is_visible = sriov_vf_attrs_are_visible,
> > > +};
> > > +
> > >  int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
> > >  {
> > >         return 0;
> > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > > index 3162f88fe940..20705ca94666 100644
> > > --- a/drivers/pci/msi.c
> > > +++ b/drivers/pci/msi.c
> > > @@ -991,6 +991,35 @@ int pci_msix_vec_count(struct pci_dev *dev)
> > >  }
> > >  EXPORT_SYMBOL(pci_msix_vec_count);
> > >
> > > +/**
> > > + * pci_set_msix_vec_count - change the reported number of MSI-X vectors
> > > + * This function is applicable for SR-IOV VF because such devices allocate
> > > + * their MSI-X table before they will run on the VMs and HW can't guess the
> > > + * right number of vectors, so the HW allocates them statically and equally.
> > > + * @dev: VF device that is going to be changed
> > > + * @numb: amount of MSI-X vectors
> > > + **/
> > > +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
> > > +{
> > > +       struct pci_dev *pdev = pci_physfn(dev);
> > > +
> > > +       if (!dev->msix_cap || !pdev->msix_cap)
> > > +               return -EINVAL;
> > > +
> > > +       if (dev->driver || !pdev->driver ||
> > > +           !pdev->driver->sriov_set_msix_vec_count)
> > > +               return -EOPNOTSUPP;
> > > +
> > > +       if (numb < 0)
> > > +               /*
> > > +                * We don't support negative numbers for now,
> > > +                * but maybe in the future it will make sense.
> > > +                */
> > > +               return -EINVAL;
> > > +
> > > +       return pdev->driver->sriov_set_msix_vec_count(dev, numb);
> > > +}
> > > +
> >
> > If you are going to have a set operation for this it would make sense
> > to have a get operation. Your show operation seems unbalanced since
> > you are expecting to call it on the VF directly which just seems
> > wrong.
>
> There is already get operator - pci_msix_vec_count().
> The same as above, PF is an anchor for the driver. VF doesn't have
> driver yet and we can't write directly to this PCI field - it is read-only.

That returns the maximum. I would want to know what the value is that
you wrote here.

Also being able to change this once the device already exists is kind
of an ugly setup as I would imagine there will be cases where the
device has already partitioned out the vectors that it wanted.

> >
> > >  static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
> > >                              int nvec, struct irq_affinity *affd, int flags)
> > >  {
> > > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> > > index fb072f4b3176..0af2222643c2 100644
> > > --- a/drivers/pci/pci-sysfs.c
> > > +++ b/drivers/pci/pci-sysfs.c
> > > @@ -1557,6 +1557,7 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
> > >         &pci_dev_hp_attr_group,
> > >  #ifdef CONFIG_PCI_IOV
> > >         &sriov_dev_attr_group,
> > > +       &sriov_vf_dev_attr_group,
> > >  #endif
> > >         &pci_bridge_attr_group,
> > >         &pcie_dev_attr_group,
> > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > > index 5c59365092fa..1fd273077637 100644
> > > --- a/drivers/pci/pci.h
> > > +++ b/drivers/pci/pci.h
> > > @@ -183,6 +183,7 @@ extern unsigned int pci_pm_d3hot_delay;
> > >
> > >  #ifdef CONFIG_PCI_MSI
> > >  void pci_no_msi(void);
> > > +int pci_set_msix_vec_count(struct pci_dev *dev, int numb);
> > >  #else
> > >  static inline void pci_no_msi(void) { }
> > >  #endif
> > > @@ -502,6 +503,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
> > >  void pci_restore_iov_state(struct pci_dev *dev);
> > >  int pci_iov_bus_range(struct pci_bus *bus);
> > >  extern const struct attribute_group sriov_dev_attr_group;
> > > +extern const struct attribute_group sriov_vf_dev_attr_group;
> > >  #else
> > >  static inline int pci_iov_init(struct pci_dev *dev)
> > >  {
> > > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > > index b32126d26997..a17cfc28eb66 100644
> > > --- a/include/linux/pci.h
> > > +++ b/include/linux/pci.h
> > > @@ -856,6 +856,8 @@ struct module;
> > >   *             e.g. drivers/net/e100.c.
> > >   * @sriov_configure: Optional driver callback to allow configuration of
> > >   *             number of VFs to enable via sysfs "sriov_numvfs" file.
> > > + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> > > + *              exposed by the sysfs "vf_msix_vec" entry.
> >
> > Hopefully it is doing more than just changing the displayed sysfs
> > value. What is the effect of changing that value on the actual system
> > state? I'm assuming this is some limit that is enforced by the PF or
> > the device firmware?
>
> The whole purpose of this interface it to change read-only field and to
> prevent any possible complications, this field is possible to change
> only for unprobed device.

Calling this out somewhere and documenting that might be useful.
Otherwise it is very easy to overlook the absence of a "!" when
reviewing a set of patches such as these.

> The limit checks are done in FW as it is the entity that manages SR-IOV PCI
> config space. This is true for mlx5, other drivers can do their checks in the SW.

I get that. The problem is this is pushing configuration into firmware
that may not be visible to the kernel  or software. Thus why I was
asking how you would display the value once it is written.

> >
> >
> > >   * @err_handler: See Documentation/PCI/pci-error-recovery.rst
> > >   * @groups:    Sysfs attribute groups.
> > >   * @driver:    Driver model structure.
> > > @@ -871,6 +873,7 @@ struct pci_driver {
> > >         int  (*resume)(struct pci_dev *dev);    /* Device woken up */
> > >         void (*shutdown)(struct pci_dev *dev);
> > >         int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
> > > +       int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
> > >         const struct pci_error_handlers *err_handler;
> > >         const struct attribute_group **groups;
> > >         struct device_driver    driver;
> > > @@ -2057,7 +2060,6 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
> > >
> > >  #ifdef CONFIG_PCI_IOV
> > >  int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
> > > -int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
> > >
> > >  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
> > >  void pci_disable_sriov(struct pci_dev *dev);
> > > @@ -2402,6 +2404,10 @@ static inline bool pci_is_thunderbolt_attached(struct pci_dev *pdev)
> > >  void pci_uevent_ers(struct pci_dev *pdev, enum  pci_ers_result err_type);
> > >  #endif
> > >
> > > +#ifdef CONFIG_PCI_IOV
> > > +int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id);
> > > +#endif
> > > +
> > >  /* Provide the legacy pci_dma_* API */
> > >  #include <linux/pci-dma-compat.h>
> > >
> > > --
> > > 2.29.2
> > >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-12 21:59       ` Alexander Duyck
@ 2021-01-13  6:09         ` Leon Romanovsky
  2021-01-13 20:00           ` Alexander Duyck
  0 siblings, 1 reply; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-13  6:09 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Tue, Jan 12, 2021 at 01:59:51PM -0800, Alexander Duyck wrote:
> On Mon, Jan 11, 2021 at 10:39 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Mon, Jan 11, 2021 at 11:30:33AM -0800, Alexander Duyck wrote:
> > > On Sun, Jan 10, 2021 at 7:12 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > >
> > > > Extend PCI sysfs interface with a new callback that allows configure
> > > > the number of MSI-X vectors for specific SR-IO VF. This is needed
> > > > to optimize the performance of newly bound devices by allocating
> > > > the number of vectors based on the administrator knowledge of targeted VM.
> > > >
> > > > This function is applicable for SR-IOV VF because such devices allocate
> > > > their MSI-X table before they will run on the VMs and HW can't guess the
> > > > right number of vectors, so the HW allocates them statically and equally.
> > > >
> > > > The newly added /sys/bus/pci/devices/.../vf_msix_vec file will be seen
> > > > for the VFs and it is writable as long as a driver is not bounded to the VF.
> > > >
> > > > The values accepted are:
> > > >  * > 0 - this will be number reported by the VF's MSI-X capability
> > > >  * < 0 - not valid
> > > >  * = 0 - will reset to the device default value
> > > >
> > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > ---
> > > >  Documentation/ABI/testing/sysfs-bus-pci | 20 ++++++++
> > > >  drivers/pci/iov.c                       | 62 +++++++++++++++++++++++++
> > > >  drivers/pci/msi.c                       | 29 ++++++++++++
> > > >  drivers/pci/pci-sysfs.c                 |  1 +
> > > >  drivers/pci/pci.h                       |  2 +
> > > >  include/linux/pci.h                     |  8 +++-
> > > >  6 files changed, 121 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > > index 25c9c39770c6..05e26e5da54e 100644
> > > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > > @@ -375,3 +375,23 @@ Description:
> > > >                 The value comes from the PCI kernel device state and can be one
> > > >                 of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
> > > >                 The file is read only.
> > > > +
> > > > +What:          /sys/bus/pci/devices/.../vf_msix_vec
> > >
> > > So the name for this doesn't seem to match existing SR-IOV naming.  It
> > > seems like this should probably be something like sriov_vf_msix_count
> > > in order to be closer to the actual naming of what is being dealt
> > > with.
> >
> > I'm open for suggestions. I didn't use sriov_vf_msix_count because it
> > seems too long for me.
> >
> > >
> > > > +Date:          December 2020
> > > > +Contact:       Leon Romanovsky <leonro@nvidia.com>
> > > > +Description:
> > > > +               This file is associated with the SR-IOV VFs.
> > > > +               It allows configuration of the number of MSI-X vectors for
> > > > +               the VF. This is needed to optimize performance of newly bound
> > > > +               devices by allocating the number of vectors based on the
> > > > +               administrator knowledge of targeted VM.
> > > > +
> > > > +               The values accepted are:
> > > > +                * > 0 - this will be number reported by the VF's MSI-X
> > > > +                        capability
> > > > +                * < 0 - not valid
> > > > +                * = 0 - will reset to the device default value
> > > > +
> > > > +               The file is writable if the PF is bound to a driver that
> > > > +               supports the ->sriov_set_msix_vec_count() callback and there
> > > > +               is no driver bound to the VF.
> > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > > index 4afd4ee4f7f0..42c0df4158d1 100644
> > > > --- a/drivers/pci/iov.c
> > > > +++ b/drivers/pci/iov.c
> > > > @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
> > > >         return (dev->devfn + dev->sriov->offset +
> > > >                 dev->sriov->stride * vf_id) & 0xff;
> > > >  }
> > > > +EXPORT_SYMBOL(pci_iov_virtfn_devfn);
> > > >
> > > >  /*
> > > >   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
> > > > @@ -426,6 +427,67 @@ const struct attribute_group sriov_dev_attr_group = {
> > > >         .is_visible = sriov_attrs_are_visible,
> > > >  };
> > > >
> > > > +#ifdef CONFIG_PCI_MSI
> > > > +static ssize_t vf_msix_vec_show(struct device *dev,
> > > > +                               struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       struct pci_dev *pdev = to_pci_dev(dev);
> > > > +       int numb = pci_msix_vec_count(pdev);
> > > > +       struct pci_dev *pfdev;
> > > > +
> > > > +       if (numb < 0)
> > > > +               return numb;
> > > > +
> > > > +       pfdev = pci_physfn(pdev);
> > > > +       if (!pfdev->driver || !pfdev->driver->sriov_set_msix_vec_count)
> > > > +               return -EOPNOTSUPP;
> > > > +
> > >
> > > This doesn't make sense to me. You are getting the vector count for
> > > the PCI device and reporting that. Are you expecting to call this on
> > > the PF or the VFs? It seems like this should be a PF attribute and not
> > > be called on the individual VFs.
> >
> > We had this discussion over v0 variant of this series.
> > https://lore.kernel.org/linux-pci/20210108072525.GB31158@unreal/
> >
> > This is per-VF property, but this VF is not bounded to the driver so you
> > need some way to convey new number to the HW, so it will update PCI value.
> >
> > You must change/update this field after VF is created, because all SR-IOV VFs
> > are created at the same time. The operator (administrator/orchestration
> > software/e.t.c) will know the right amount of MSI-X vectors right before
> > he will bind this VF to requested VM.
> >
> > It means that extending PF sysfs to get both VF index and count will
> > look very unfriendly for the users.
> >
> > The PF here is an anchor to the relevant driver.
>
> Yes, but the problem is you are attempting to do it after a driver may
> have already bound itself to the VF device. This setup works for the
> direct assigned VF case, however if the VF drivers are running on the
> host then this gets ugly as the driver may already be up and running.

Please take a look on the pci_set_msix_vec_count() implementation, it
checks that VF is not probed yet. I outlined this requirement almost
in all my responses and commit messages.

So no, it is not possible to set MSI-X vectors to already bound device.

>
> Also I am not a big fan of the VF groping around looking for a PF
> interface as it means the interface will likely be exposed in the
> guest as well, but it just won't work.

If you are referring to VF exposed to the VM, so in this case VF must be
bound too vfio driver, or any other driver, and won't allow MSI-X change.
If you are referring to PF exposed to the VM, it is very unlikely scenario
in real world and reserved for braves among us. Even in this case, the
set MSI-X won't work, because PF will be connected to the hypervisor driver
that doesn't support set_msix.

So both cases are handled.

>
> > >
> > > If you are calling this on the VFs then it doesn't really make any
> > > sense anyway since the VF is not a "VF PCI dev representor" and
> > > shouldn't be treated as such. In my opinion if we are going to be
> > > doing per-port resource limiting that is something that might make
> > > more sense as a part of the devlink configuration for the VF since the
> > > actual change won't be visible to an assigned device.
> >
> > https://lore.kernel.org/linux-pci/20210112061535.GB4678@unreal/
>
> So the question I would have is if we are spawning the VFs and
> expecting them to have different configs or the same configuration?

By default, they have same configuration.

> I'm assuming in your case you are looking for a different
> configuration per port. Do I have that correct?

No, per-VF as represents one device in the PCI world. For example, mlx5
can have more than one physical port.

>
> Where this gets ugly is that SR-IOV assumes a certain uniformity per
> VF so doing a per-VF custom limitation gets ugly pretty quick.

I don't find any support for this "uniformity" claim in the PCI spec.


> I wonder if it would make more sense if we are going this route to just
> define a device-tree like schema that could be fed in to enable VFs
> instead of just using echo X > sriov_numvfs and then trying to fix
> things afterwards. Then you could define this and other features that
> I am sure you would need in the future via json-schema like is done in
> device-tree and write it once enabling the set of VFs that you need.

Sorry, but this is overkill, it won't give us much and it doesn't fit
the VF usage model at all.

Right now, all heavy users of SR-IOV are creating many VFs up to the maximum.
They do it with autoprobe disabled, because it is too time consuming to wait
till all VFs probe themselves and unbind them later.

After that, they wait for incoming request to provision VM on VF, they set MAC
address, change MSI-X according to VM properties and bind that VF to new VM.

So MSI-X change is done after VFs were created.

>
> Given that SR-IOV isn't meant to be tweaked dynamically it seems like
> that would be a much better fit for most users as then you can just
> modify the schema and reload it which would probably be less effort
> than having to manually redo all the commands needed to setup the VFs
> using this approach if you are having to manually update each VF.

Quite opposite it true. First users use orchestration software to manage it.
Second, you will need to take care of already bound device.

>
> > >
> > > > +       return sprintf(buf, "%d\n", numb);
> > > > +}
> > > > +
> > > > +static ssize_t vf_msix_vec_store(struct device *dev,
> > > > +                                struct device_attribute *attr, const char *buf,
> > > > +                                size_t count)
> > > > +{
> > > > +       struct pci_dev *vf_dev = to_pci_dev(dev);
> > > > +       int val, ret;
> > > > +
> > > > +       ret = kstrtoint(buf, 0, &val);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +
> > > > +       ret = pci_set_msix_vec_count(vf_dev, val);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +
> > > > +       return count;
> > > > +}
> > > > +static DEVICE_ATTR_RW(vf_msix_vec);
> > > > +#endif
> > > > +
> > > > +static struct attribute *sriov_vf_dev_attrs[] = {
> > > > +#ifdef CONFIG_PCI_MSI
> > > > +       &dev_attr_vf_msix_vec.attr,
> > > > +#endif
> > > > +       NULL,
> > > > +};
> > > > +
> > > > +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
> > > > +                                         struct attribute *a, int n)
> > > > +{
> > > > +       struct device *dev = kobj_to_dev(kobj);
> > > > +
> > > > +       if (dev_is_pf(dev))
> > > > +               return 0;
> > > > +
> > > > +       return a->mode;
> > > > +}
> > > > +
> > > > +const struct attribute_group sriov_vf_dev_attr_group = {
> > > > +       .attrs = sriov_vf_dev_attrs,
> > > > +       .is_visible = sriov_vf_attrs_are_visible,
> > > > +};
> > > > +
> > > >  int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
> > > >  {
> > > >         return 0;
> > > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > > > index 3162f88fe940..20705ca94666 100644
> > > > --- a/drivers/pci/msi.c
> > > > +++ b/drivers/pci/msi.c
> > > > @@ -991,6 +991,35 @@ int pci_msix_vec_count(struct pci_dev *dev)
> > > >  }
> > > >  EXPORT_SYMBOL(pci_msix_vec_count);
> > > >
> > > > +/**
> > > > + * pci_set_msix_vec_count - change the reported number of MSI-X vectors
> > > > + * This function is applicable for SR-IOV VF because such devices allocate
> > > > + * their MSI-X table before they will run on the VMs and HW can't guess the
> > > > + * right number of vectors, so the HW allocates them statically and equally.
> > > > + * @dev: VF device that is going to be changed
> > > > + * @numb: amount of MSI-X vectors
> > > > + **/
> > > > +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
> > > > +{
> > > > +       struct pci_dev *pdev = pci_physfn(dev);
> > > > +
> > > > +       if (!dev->msix_cap || !pdev->msix_cap)
> > > > +               return -EINVAL;
> > > > +
> > > > +       if (dev->driver || !pdev->driver ||
> > > > +           !pdev->driver->sriov_set_msix_vec_count)
> > > > +               return -EOPNOTSUPP;
> > > > +
> > > > +       if (numb < 0)
> > > > +               /*
> > > > +                * We don't support negative numbers for now,
> > > > +                * but maybe in the future it will make sense.
> > > > +                */
> > > > +               return -EINVAL;
> > > > +
> > > > +       return pdev->driver->sriov_set_msix_vec_count(dev, numb);
> > > > +}
> > > > +
> > >
> > > If you are going to have a set operation for this it would make sense
> > > to have a get operation. Your show operation seems unbalanced since
> > > you are expecting to call it on the VF directly which just seems
> > > wrong.
> >
> > There is already get operator - pci_msix_vec_count().
> > The same as above, PF is an anchor for the driver. VF doesn't have
> > driver yet and we can't write directly to this PCI field - it is read-only.
>
> That returns the maximum. I would want to know what the value is that
> you wrote here.

It returns the value from the device and not maximum.

>
> Also being able to change this once the device already exists is kind
> of an ugly setup as I would imagine there will be cases where the
> device has already partitioned out the vectors that it wanted.

It is not possible, because all VFs starts with some value. Normal FW
won't allow you to decrease vector count below certain limit.

So VFs are always operable.

>
> > >
> > > >  static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
> > > >                              int nvec, struct irq_affinity *affd, int flags)
> > > >  {
> > > > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> > > > index fb072f4b3176..0af2222643c2 100644
> > > > --- a/drivers/pci/pci-sysfs.c
> > > > +++ b/drivers/pci/pci-sysfs.c
> > > > @@ -1557,6 +1557,7 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
> > > >         &pci_dev_hp_attr_group,
> > > >  #ifdef CONFIG_PCI_IOV
> > > >         &sriov_dev_attr_group,
> > > > +       &sriov_vf_dev_attr_group,
> > > >  #endif
> > > >         &pci_bridge_attr_group,
> > > >         &pcie_dev_attr_group,
> > > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > > > index 5c59365092fa..1fd273077637 100644
> > > > --- a/drivers/pci/pci.h
> > > > +++ b/drivers/pci/pci.h
> > > > @@ -183,6 +183,7 @@ extern unsigned int pci_pm_d3hot_delay;
> > > >
> > > >  #ifdef CONFIG_PCI_MSI
> > > >  void pci_no_msi(void);
> > > > +int pci_set_msix_vec_count(struct pci_dev *dev, int numb);
> > > >  #else
> > > >  static inline void pci_no_msi(void) { }
> > > >  #endif
> > > > @@ -502,6 +503,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
> > > >  void pci_restore_iov_state(struct pci_dev *dev);
> > > >  int pci_iov_bus_range(struct pci_bus *bus);
> > > >  extern const struct attribute_group sriov_dev_attr_group;
> > > > +extern const struct attribute_group sriov_vf_dev_attr_group;
> > > >  #else
> > > >  static inline int pci_iov_init(struct pci_dev *dev)
> > > >  {
> > > > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > > > index b32126d26997..a17cfc28eb66 100644
> > > > --- a/include/linux/pci.h
> > > > +++ b/include/linux/pci.h
> > > > @@ -856,6 +856,8 @@ struct module;
> > > >   *             e.g. drivers/net/e100.c.
> > > >   * @sriov_configure: Optional driver callback to allow configuration of
> > > >   *             number of VFs to enable via sysfs "sriov_numvfs" file.
> > > > + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> > > > + *              exposed by the sysfs "vf_msix_vec" entry.
> > >
> > > Hopefully it is doing more than just changing the displayed sysfs
> > > value. What is the effect of changing that value on the actual system
> > > state? I'm assuming this is some limit that is enforced by the PF or
> > > the device firmware?
> >
> > The whole purpose of this interface it to change read-only field and to
> > prevent any possible complications, this field is possible to change
> > only for unprobed device.
>
> Calling this out somewhere and documenting that might be useful.
> Otherwise it is very easy to overlook the absence of a "!" when
> reviewing a set of patches such as these.

It is documented in the Documentation file, but I will add extra comment.

>
> > The limit checks are done in FW as it is the entity that manages SR-IOV PCI
> > config space. This is true for mlx5, other drivers can do their checks in the SW.
>
> I get that. The problem is this is pushing configuration into firmware
> that may not be visible to the kernel  or software. Thus why I was
> asking how you would display the value once it is written.

pci_msix_vec_count() and "lspci -vv ... | grep MSI-X" should give new value.

Thanks

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-12 21:34       ` Alexander Duyck
@ 2021-01-13  6:19         ` Leon Romanovsky
  2021-01-13 22:44           ` Alexander Duyck
  0 siblings, 1 reply; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-13  6:19 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Tue, Jan 12, 2021 at 01:34:50PM -0800, Alexander Duyck wrote:
> On Mon, Jan 11, 2021 at 10:56 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Mon, Jan 11, 2021 at 11:30:39AM -0800, Alexander Duyck wrote:
> > > On Sun, Jan 10, 2021 at 7:10 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > >
> > > > Some SR-IOV capable devices provide an ability to configure specific
> > > > number of MSI-X vectors on their VF prior driver is probed on that VF.
> > > >
> > > > In order to make management easy, provide new read-only sysfs file that
> > > > returns a total number of possible to configure MSI-X vectors.
> > > >
> > > > cat /sys/bus/pci/devices/.../sriov_vf_total_msix
> > > >   = 0 - feature is not supported
> > > >   > 0 - total number of MSI-X vectors to consume by the VFs
> > > >
> > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > ---
> > > >  Documentation/ABI/testing/sysfs-bus-pci | 14 +++++++++++
> > > >  drivers/pci/iov.c                       | 31 +++++++++++++++++++++++++
> > > >  drivers/pci/pci.h                       |  3 +++
> > > >  include/linux/pci.h                     |  2 ++
> > > >  4 files changed, 50 insertions(+)
> > > >
> > > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > > index 05e26e5da54e..64e9b700acc9 100644
> > > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > > @@ -395,3 +395,17 @@ Description:
> > > >                 The file is writable if the PF is bound to a driver that
> > > >                 supports the ->sriov_set_msix_vec_count() callback and there
> > > >                 is no driver bound to the VF.
> > > > +
> > > > +What:          /sys/bus/pci/devices/.../sriov_vf_total_msix
> > >
> > > In this case I would drop the "vf" and just go with sriov_total_msix
> > > since now you are referring to a global value instead of a per VF
> > > value.
> >
> > This field indicates the amount of MSI-X available for VFs, it doesn't
> > include PFs. The missing "_vf_" will mislead users who will believe that
> > it is all MSI-X vectors available for this device. They will need to take
> > into consideration amount of PF MSI-X in order to calculate the VF distribution.
> >
> > So I would leave "_vf_" here.
>
> The problem is you aren't indicating how many are available for an
> individual VF though, you are indicating how many are available for
> use by SR-IOV to give to the VFs. The fact that you are dealing with a
> pool makes things confusing in my opinion. For example sriov_vf_device
> describes the device ID that will be given to each VF.

sriov_vf_device is different and is implemented accordingly to the PCI
spec, 9.3.3.11 VF Device ID (Offset 1Ah)
"This field contains the Device ID that should be presented for every VF
to the SI."

It is one ID for all VFs.

>
> > >
> > > > +Date:          January 2021
> > > > +Contact:       Leon Romanovsky <leonro@nvidia.com>
> > > > +Description:
> > > > +               This file is associated with the SR-IOV PFs.
> > > > +               It returns a total number of possible to configure MSI-X
> > > > +               vectors on the enabled VFs.
> > > > +
> > > > +               The values returned are:
> > > > +                * > 0 - this will be total number possible to consume by VFs,
> > > > +                * = 0 - feature is not supported
> > > > +
> > > > +               If no SR-IOV VFs are enabled, this value will return 0.
> > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > > index 42c0df4158d1..0a6ddf3230fd 100644
> > > > --- a/drivers/pci/iov.c
> > > > +++ b/drivers/pci/iov.c
> > > > @@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev,
> > > >         return count;
> > > >  }
> > > >
> > > > +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> > > > +                                       struct device_attribute *attr,
> > > > +                                       char *buf)
> > > > +{
> > > > +       struct pci_dev *pdev = to_pci_dev(dev);
> > > > +
> > > > +       return sprintf(buf, "%d\n", pdev->sriov->vf_total_msix);
> > > > +}
> > > > +
> > >
> > > You display it as a signed value, but unsigned values are not
> > > supported, correct?
> >
> > Right, I made it similar to the vf_msix_set. I can change.
> >
> > >
> > > >  static DEVICE_ATTR_RO(sriov_totalvfs);
> > > >  static DEVICE_ATTR_RW(sriov_numvfs);
> > > >  static DEVICE_ATTR_RO(sriov_offset);
> > > >  static DEVICE_ATTR_RO(sriov_stride);
> > > >  static DEVICE_ATTR_RO(sriov_vf_device);
> > > >  static DEVICE_ATTR_RW(sriov_drivers_autoprobe);
> > > > +static DEVICE_ATTR_RO(sriov_vf_total_msix);
> > > >
> > > >  static struct attribute *sriov_dev_attrs[] = {
> > > >         &dev_attr_sriov_totalvfs.attr,
> > > > @@ -408,6 +418,7 @@ static struct attribute *sriov_dev_attrs[] = {
> > > >         &dev_attr_sriov_stride.attr,
> > > >         &dev_attr_sriov_vf_device.attr,
> > > >         &dev_attr_sriov_drivers_autoprobe.attr,
> > > > +       &dev_attr_sriov_vf_total_msix.attr,
> > > >         NULL,
> > > >  };
> > > >
> > > > @@ -658,6 +669,7 @@ static void sriov_disable(struct pci_dev *dev)
> > > >                 sysfs_remove_link(&dev->dev.kobj, "dep_link");
> > > >
> > > >         iov->num_VFs = 0;
> > > > +       iov->vf_total_msix = 0;
> > > >         pci_iov_set_numvfs(dev, 0);
> > > >  }
> > > >
> > > > @@ -1116,6 +1128,25 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> > > >
> > > > +/**
> > > > + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
> > > > + * @dev: the PCI PF device
> > > > + * @numb: the total number of MSI-X vector to consume by the VFs
> > > > + *
> > > > + * Sets the number of MSI-X vectors that is possible to consume by the VFs.
> > > > + * This interface is complimentary part of the pci_set_msix_vec_count()
> > > > + * that will be used to configure the required number on the VF.
> > > > + */
> > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb)
> > > > +{
> > > > +       if (!dev->is_physfn || !dev->driver ||
> > > > +           !dev->driver->sriov_set_msix_vec_count)
> > > > +               return;
> > > > +
> > > > +       dev->sriov->vf_total_msix = numb;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
> > > > +
> > >
> > > This seems broken. What validation is being done on the numb value?
> > > You pass it as int, and your documentation all refers to tests for >=
> > > 0, but isn't a signed input a possibility as well? Also "numb" doesn't
> > > make for a good abbreviation as it is already a word of its own. It
> > > might make more sense to use count or something like that rather than
> > > trying to abbreviate number.
> >
> > "Broken" is a nice word to describe misunderstanding.
>
> Would you prefer "lacking input validation".
>
> I see all this code in there checking for is_physfn and driver and
> sriov_set_msix_vec_count before allowing the setting of vf_total_msix.
> It just seems like a lot of validation is taking place on the wrong
> things if you are just going to be setting a value reporting the total
> number of MSI-X vectors in use for SR-IOV.

All those checks are in place to ensure that we are not overwriting the
default value, which is 0.

>
> In addition this value seems like a custom purpose being pushed into
> the PCIe code since there isn't anything that defaults the value. It
> seems like at a minimum there should be something that programs a
> default value for both of these new fields that are being added so
> that you pull the maximum number of VFs when SR-IOV is enabled, the
> maximum number of MSI-X vectors from a single VF, and then the default
> value for this should be the multiple of the two which can then be
> overridden later.

The default is 0, because most SR-IOV doesn't have proper support of
setting VF MSI-X.

Regarding the calculation, it is not correct for the mlx5. We have large
pool of MSI-X vectors, but setting small number of them. This allows us
to increase that number on specific VF without need to decrease on
others "to free" the vectors.

Thanks

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-10 15:07 ` [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
  2021-01-11 19:30   ` Alexander Duyck
@ 2021-01-13 17:50   ` Alex Williamson
  2021-01-13 19:49     ` Leon Romanovsky
  1 sibling, 1 reply; 49+ messages in thread
From: Alex Williamson @ 2021-01-13 17:50 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile

On Sun, 10 Jan 2021 17:07:23 +0200
Leon Romanovsky <leon@kernel.org> wrote:

> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Extend PCI sysfs interface with a new callback that allows configure
> the number of MSI-X vectors for specific SR-IO VF. This is needed
> to optimize the performance of newly bound devices by allocating
> the number of vectors based on the administrator knowledge of targeted VM.
> 
> This function is applicable for SR-IOV VF because such devices allocate
> their MSI-X table before they will run on the VMs and HW can't guess the
> right number of vectors, so the HW allocates them statically and equally.
> 
> The newly added /sys/bus/pci/devices/.../vf_msix_vec file will be seen
> for the VFs and it is writable as long as a driver is not bounded to the VF.
> 
> The values accepted are:
>  * > 0 - this will be number reported by the VF's MSI-X capability
>  * < 0 - not valid
>  * = 0 - will reset to the device default value
> 
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci | 20 ++++++++
>  drivers/pci/iov.c                       | 62 +++++++++++++++++++++++++
>  drivers/pci/msi.c                       | 29 ++++++++++++
>  drivers/pci/pci-sysfs.c                 |  1 +
>  drivers/pci/pci.h                       |  2 +
>  include/linux/pci.h                     |  8 +++-
>  6 files changed, 121 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 25c9c39770c6..05e26e5da54e 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -375,3 +375,23 @@ Description:
>  		The value comes from the PCI kernel device state and can be one
>  		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
>  		The file is read only.
> +
> +What:		/sys/bus/pci/devices/.../vf_msix_vec
> +Date:		December 2020
> +Contact:	Leon Romanovsky <leonro@nvidia.com>
> +Description:
> +		This file is associated with the SR-IOV VFs.
> +		It allows configuration of the number of MSI-X vectors for
> +		the VF. This is needed to optimize performance of newly bound
> +		devices by allocating the number of vectors based on the
> +		administrator knowledge of targeted VM.
> +
> +		The values accepted are:
> +		 * > 0 - this will be number reported by the VF's MSI-X
> +			 capability
> +		 * < 0 - not valid
> +		 * = 0 - will reset to the device default value
> +
> +		The file is writable if the PF is bound to a driver that
> +		supports the ->sriov_set_msix_vec_count() callback and there
> +		is no driver bound to the VF.
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 4afd4ee4f7f0..42c0df4158d1 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
>  	return (dev->devfn + dev->sriov->offset +
>  		dev->sriov->stride * vf_id) & 0xff;
>  }
> +EXPORT_SYMBOL(pci_iov_virtfn_devfn);
> 
>  /*
>   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
> @@ -426,6 +427,67 @@ const struct attribute_group sriov_dev_attr_group = {
>  	.is_visible = sriov_attrs_are_visible,
>  };
> 
> +#ifdef CONFIG_PCI_MSI
> +static ssize_t vf_msix_vec_show(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	int numb = pci_msix_vec_count(pdev);
> +	struct pci_dev *pfdev;
> +
> +	if (numb < 0)
> +		return numb;
> +
> +	pfdev = pci_physfn(pdev);
> +	if (!pfdev->driver || !pfdev->driver->sriov_set_msix_vec_count)
> +		return -EOPNOTSUPP;
> +
> +	return sprintf(buf, "%d\n", numb);
> +}
> +
> +static ssize_t vf_msix_vec_store(struct device *dev,
> +				 struct device_attribute *attr, const char *buf,
> +				 size_t count)
> +{
> +	struct pci_dev *vf_dev = to_pci_dev(dev);
> +	int val, ret;
> +
> +	ret = kstrtoint(buf, 0, &val);
> +	if (ret)
> +		return ret;
> +
> +	ret = pci_set_msix_vec_count(vf_dev, val);
> +	if (ret)
> +		return ret;
> +
> +	return count;
> +}
> +static DEVICE_ATTR_RW(vf_msix_vec);
> +#endif
> +
> +static struct attribute *sriov_vf_dev_attrs[] = {
> +#ifdef CONFIG_PCI_MSI
> +	&dev_attr_vf_msix_vec.attr,
> +#endif
> +	NULL,
> +};
> +
> +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
> +					  struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +
> +	if (dev_is_pf(dev))
> +		return 0;
> +
> +	return a->mode;
> +}
> +
> +const struct attribute_group sriov_vf_dev_attr_group = {
> +	.attrs = sriov_vf_dev_attrs,
> +	.is_visible = sriov_vf_attrs_are_visible,
> +};
> +
>  int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
>  {
>  	return 0;
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 3162f88fe940..20705ca94666 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -991,6 +991,35 @@ int pci_msix_vec_count(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL(pci_msix_vec_count);
> 
> +/**
> + * pci_set_msix_vec_count - change the reported number of MSI-X vectors
> + * This function is applicable for SR-IOV VF because such devices allocate
> + * their MSI-X table before they will run on the VMs and HW can't guess the
> + * right number of vectors, so the HW allocates them statically and equally.
> + * @dev: VF device that is going to be changed
> + * @numb: amount of MSI-X vectors
> + **/
> +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
> +{
> +	struct pci_dev *pdev = pci_physfn(dev);
> +
> +	if (!dev->msix_cap || !pdev->msix_cap)
> +		return -EINVAL;
> +
> +	if (dev->driver || !pdev->driver ||
> +	    !pdev->driver->sriov_set_msix_vec_count)
> +		return -EOPNOTSUPP;


This seems racy, don't we need to hold device_lock on both the VF and
PF to avoid driver {un}binding races?  Does that happen implicitly
somewhere?  Thanks,

Alex

> +
> +	if (numb < 0)
> +		/*
> +		 * We don't support negative numbers for now,
> +		 * but maybe in the future it will make sense.
> +		 */
> +		return -EINVAL;
> +
> +	return pdev->driver->sriov_set_msix_vec_count(dev, numb);
> +}
> +
>  static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
>  			     int nvec, struct irq_affinity *affd, int flags)
>  {
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index fb072f4b3176..0af2222643c2 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -1557,6 +1557,7 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
>  	&pci_dev_hp_attr_group,
>  #ifdef CONFIG_PCI_IOV
>  	&sriov_dev_attr_group,
> +	&sriov_vf_dev_attr_group,
>  #endif
>  	&pci_bridge_attr_group,
>  	&pcie_dev_attr_group,
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 5c59365092fa..1fd273077637 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -183,6 +183,7 @@ extern unsigned int pci_pm_d3hot_delay;
> 
>  #ifdef CONFIG_PCI_MSI
>  void pci_no_msi(void);
> +int pci_set_msix_vec_count(struct pci_dev *dev, int numb);
>  #else
>  static inline void pci_no_msi(void) { }
>  #endif
> @@ -502,6 +503,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
>  void pci_restore_iov_state(struct pci_dev *dev);
>  int pci_iov_bus_range(struct pci_bus *bus);
>  extern const struct attribute_group sriov_dev_attr_group;
> +extern const struct attribute_group sriov_vf_dev_attr_group;
>  #else
>  static inline int pci_iov_init(struct pci_dev *dev)
>  {
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index b32126d26997..a17cfc28eb66 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -856,6 +856,8 @@ struct module;
>   *		e.g. drivers/net/e100.c.
>   * @sriov_configure: Optional driver callback to allow configuration of
>   *		number of VFs to enable via sysfs "sriov_numvfs" file.
> + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> + *              exposed by the sysfs "vf_msix_vec" entry.
>   * @err_handler: See Documentation/PCI/pci-error-recovery.rst
>   * @groups:	Sysfs attribute groups.
>   * @driver:	Driver model structure.
> @@ -871,6 +873,7 @@ struct pci_driver {
>  	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
>  	void (*shutdown)(struct pci_dev *dev);
>  	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
> +	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
>  	const struct pci_error_handlers *err_handler;
>  	const struct attribute_group **groups;
>  	struct device_driver	driver;
> @@ -2057,7 +2060,6 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
> 
>  #ifdef CONFIG_PCI_IOV
>  int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
> -int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
> 
>  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>  void pci_disable_sriov(struct pci_dev *dev);
> @@ -2402,6 +2404,10 @@ static inline bool pci_is_thunderbolt_attached(struct pci_dev *pdev)
>  void pci_uevent_ers(struct pci_dev *pdev, enum  pci_ers_result err_type);
>  #endif
> 
> +#ifdef CONFIG_PCI_IOV
> +int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id);
> +#endif
> +
>  /* Provide the legacy pci_dma_* API */
>  #include <linux/pci-dma-compat.h>
> 
> --
> 2.29.2
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-13 17:50   ` Alex Williamson
@ 2021-01-13 19:49     ` Leon Romanovsky
  0 siblings, 0 replies; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-13 19:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, netdev, Don Dutile

On Wed, Jan 13, 2021 at 10:50:52AM -0700, Alex Williamson wrote:
> On Sun, 10 Jan 2021 17:07:23 +0200
> Leon Romanovsky <leon@kernel.org> wrote:
>
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > Extend PCI sysfs interface with a new callback that allows configure
> > the number of MSI-X vectors for specific SR-IO VF. This is needed
> > to optimize the performance of newly bound devices by allocating
> > the number of vectors based on the administrator knowledge of targeted VM.
> >
> > This function is applicable for SR-IOV VF because such devices allocate
> > their MSI-X table before they will run on the VMs and HW can't guess the
> > right number of vectors, so the HW allocates them statically and equally.
> >
> > The newly added /sys/bus/pci/devices/.../vf_msix_vec file will be seen
> > for the VFs and it is writable as long as a driver is not bounded to the VF.
> >
> > The values accepted are:
> >  * > 0 - this will be number reported by the VF's MSI-X capability
> >  * < 0 - not valid
> >  * = 0 - will reset to the device default value
> >
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  Documentation/ABI/testing/sysfs-bus-pci | 20 ++++++++
> >  drivers/pci/iov.c                       | 62 +++++++++++++++++++++++++
> >  drivers/pci/msi.c                       | 29 ++++++++++++
> >  drivers/pci/pci-sysfs.c                 |  1 +
> >  drivers/pci/pci.h                       |  2 +
> >  include/linux/pci.h                     |  8 +++-
> >  6 files changed, 121 insertions(+), 1 deletion(-)

<...>

> > +/**
> > + * pci_set_msix_vec_count - change the reported number of MSI-X vectors
> > + * This function is applicable for SR-IOV VF because such devices allocate
> > + * their MSI-X table before they will run on the VMs and HW can't guess the
> > + * right number of vectors, so the HW allocates them statically and equally.
> > + * @dev: VF device that is going to be changed
> > + * @numb: amount of MSI-X vectors
> > + **/
> > +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
> > +{
> > +	struct pci_dev *pdev = pci_physfn(dev);
> > +
> > +	if (!dev->msix_cap || !pdev->msix_cap)
> > +		return -EINVAL;
> > +
> > +	if (dev->driver || !pdev->driver ||
> > +	    !pdev->driver->sriov_set_msix_vec_count)
> > +		return -EOPNOTSUPP;
>
>
> This seems racy, don't we need to hold device_lock on both the VF and
> PF to avoid driver {un}binding races?  Does that happen implicitly
> somewhere?  Thanks,

Yes, you are right absolutely, pdev and dev are not protected here.

Thanks

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-13  6:09         ` Leon Romanovsky
@ 2021-01-13 20:00           ` Alexander Duyck
  2021-01-14  7:16             ` Leon Romanovsky
  0 siblings, 1 reply; 49+ messages in thread
From: Alexander Duyck @ 2021-01-13 20:00 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Tue, Jan 12, 2021 at 10:09 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Tue, Jan 12, 2021 at 01:59:51PM -0800, Alexander Duyck wrote:
> > On Mon, Jan 11, 2021 at 10:39 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Mon, Jan 11, 2021 at 11:30:33AM -0800, Alexander Duyck wrote:
> > > > On Sun, Jan 10, 2021 at 7:12 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > >
> > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > >
> > > > > Extend PCI sysfs interface with a new callback that allows configure
> > > > > the number of MSI-X vectors for specific SR-IO VF. This is needed
> > > > > to optimize the performance of newly bound devices by allocating
> > > > > the number of vectors based on the administrator knowledge of targeted VM.
> > > > >
> > > > > This function is applicable for SR-IOV VF because such devices allocate
> > > > > their MSI-X table before they will run on the VMs and HW can't guess the
> > > > > right number of vectors, so the HW allocates them statically and equally.
> > > > >
> > > > > The newly added /sys/bus/pci/devices/.../vf_msix_vec file will be seen
> > > > > for the VFs and it is writable as long as a driver is not bounded to the VF.
> > > > >
> > > > > The values accepted are:
> > > > >  * > 0 - this will be number reported by the VF's MSI-X capability
> > > > >  * < 0 - not valid
> > > > >  * = 0 - will reset to the device default value
> > > > >
> > > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > > ---
> > > > >  Documentation/ABI/testing/sysfs-bus-pci | 20 ++++++++
> > > > >  drivers/pci/iov.c                       | 62 +++++++++++++++++++++++++
> > > > >  drivers/pci/msi.c                       | 29 ++++++++++++
> > > > >  drivers/pci/pci-sysfs.c                 |  1 +
> > > > >  drivers/pci/pci.h                       |  2 +
> > > > >  include/linux/pci.h                     |  8 +++-
> > > > >  6 files changed, 121 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > > > index 25c9c39770c6..05e26e5da54e 100644
> > > > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > > > @@ -375,3 +375,23 @@ Description:
> > > > >                 The value comes from the PCI kernel device state and can be one
> > > > >                 of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
> > > > >                 The file is read only.
> > > > > +
> > > > > +What:          /sys/bus/pci/devices/.../vf_msix_vec
> > > >
> > > > So the name for this doesn't seem to match existing SR-IOV naming.  It
> > > > seems like this should probably be something like sriov_vf_msix_count
> > > > in order to be closer to the actual naming of what is being dealt
> > > > with.
> > >
> > > I'm open for suggestions. I didn't use sriov_vf_msix_count because it
> > > seems too long for me.
> > >
> > > >
> > > > > +Date:          December 2020
> > > > > +Contact:       Leon Romanovsky <leonro@nvidia.com>
> > > > > +Description:
> > > > > +               This file is associated with the SR-IOV VFs.
> > > > > +               It allows configuration of the number of MSI-X vectors for
> > > > > +               the VF. This is needed to optimize performance of newly bound
> > > > > +               devices by allocating the number of vectors based on the
> > > > > +               administrator knowledge of targeted VM.
> > > > > +
> > > > > +               The values accepted are:
> > > > > +                * > 0 - this will be number reported by the VF's MSI-X
> > > > > +                        capability
> > > > > +                * < 0 - not valid
> > > > > +                * = 0 - will reset to the device default value
> > > > > +
> > > > > +               The file is writable if the PF is bound to a driver that
> > > > > +               supports the ->sriov_set_msix_vec_count() callback and there
> > > > > +               is no driver bound to the VF.
> > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > > > index 4afd4ee4f7f0..42c0df4158d1 100644
> > > > > --- a/drivers/pci/iov.c
> > > > > +++ b/drivers/pci/iov.c
> > > > > @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
> > > > >         return (dev->devfn + dev->sriov->offset +
> > > > >                 dev->sriov->stride * vf_id) & 0xff;
> > > > >  }
> > > > > +EXPORT_SYMBOL(pci_iov_virtfn_devfn);
> > > > >
> > > > >  /*
> > > > >   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
> > > > > @@ -426,6 +427,67 @@ const struct attribute_group sriov_dev_attr_group = {
> > > > >         .is_visible = sriov_attrs_are_visible,
> > > > >  };
> > > > >
> > > > > +#ifdef CONFIG_PCI_MSI
> > > > > +static ssize_t vf_msix_vec_show(struct device *dev,
> > > > > +                               struct device_attribute *attr, char *buf)
> > > > > +{
> > > > > +       struct pci_dev *pdev = to_pci_dev(dev);
> > > > > +       int numb = pci_msix_vec_count(pdev);
> > > > > +       struct pci_dev *pfdev;
> > > > > +
> > > > > +       if (numb < 0)
> > > > > +               return numb;
> > > > > +
> > > > > +       pfdev = pci_physfn(pdev);
> > > > > +       if (!pfdev->driver || !pfdev->driver->sriov_set_msix_vec_count)
> > > > > +               return -EOPNOTSUPP;
> > > > > +
> > > >
> > > > This doesn't make sense to me. You are getting the vector count for
> > > > the PCI device and reporting that. Are you expecting to call this on
> > > > the PF or the VFs? It seems like this should be a PF attribute and not
> > > > be called on the individual VFs.
> > >
> > > We had this discussion over v0 variant of this series.
> > > https://lore.kernel.org/linux-pci/20210108072525.GB31158@unreal/
> > >
> > > This is per-VF property, but this VF is not bounded to the driver so you
> > > need some way to convey new number to the HW, so it will update PCI value.
> > >
> > > You must change/update this field after VF is created, because all SR-IOV VFs
> > > are created at the same time. The operator (administrator/orchestration
> > > software/e.t.c) will know the right amount of MSI-X vectors right before
> > > he will bind this VF to requested VM.
> > >
> > > It means that extending PF sysfs to get both VF index and count will
> > > look very unfriendly for the users.
> > >
> > > The PF here is an anchor to the relevant driver.
> >
> > Yes, but the problem is you are attempting to do it after a driver may
> > have already bound itself to the VF device. This setup works for the
> > direct assigned VF case, however if the VF drivers are running on the
> > host then this gets ugly as the driver may already be up and running.
>
> Please take a look on the pci_set_msix_vec_count() implementation, it
> checks that VF is not probed yet. I outlined this requirement almost
> in all my responses and commit messages.
>
> So no, it is not possible to set MSI-X vectors to already bound device.

Unless you are holding the device lock you cannot guarantee that as
Alex Williamson pointed out you can end up racing with the driver
probe/remove.

Secondly the fact that the driver might be probed before you even get
to make your call will cause this to return EOPNOTSUPP which doesn't
exactly make sense since it is supported, you just cannot do it since
the device is busy.

> >
> > Also I am not a big fan of the VF groping around looking for a PF
> > interface as it means the interface will likely be exposed in the
> > guest as well, but it just won't work.
>
> If you are referring to VF exposed to the VM, so in this case VF must be
> bound too vfio driver, or any other driver, and won't allow MSI-X change.
> If you are referring to PF exposed to the VM, it is very unlikely scenario
> in real world and reserved for braves among us. Even in this case, the
> set MSI-X won't work, because PF will be connected to the hypervisor driver
> that doesn't support set_msix.
>
> So both cases are handled.

I get that they are handled. However I am not a huge fan of the sysfs
attributes for one device being dependent on another device. When you
have to start searching for another device it just makes things messy.

> >
> > > >
> > > > If you are calling this on the VFs then it doesn't really make any
> > > > sense anyway since the VF is not a "VF PCI dev representor" and
> > > > shouldn't be treated as such. In my opinion if we are going to be
> > > > doing per-port resource limiting that is something that might make
> > > > more sense as a part of the devlink configuration for the VF since the
> > > > actual change won't be visible to an assigned device.
> > >
> > > https://lore.kernel.org/linux-pci/20210112061535.GB4678@unreal/
> >
> > So the question I would have is if we are spawning the VFs and
> > expecting them to have different configs or the same configuration?
>
> By default, they have same configuration.
>
> > I'm assuming in your case you are looking for a different
> > configuration per port. Do I have that correct?
>
> No, per-VF as represents one device in the PCI world. For example, mlx5
> can have more than one physical port.

Sorry, I meant per virtual function, not per port.

> >
> > Where this gets ugly is that SR-IOV assumes a certain uniformity per
> > VF so doing a per-VF custom limitation gets ugly pretty quick.
>
> I don't find any support for this "uniformity" claim in the PCI spec.

I am referring to the PCI configuration space. Each VF ends up with
some fixed amount of MMIO resources per function. So typically when
you spawn VFs we had things setup so that all you do is say how many
you want.

> > I wonder if it would make more sense if we are going this route to just
> > define a device-tree like schema that could be fed in to enable VFs
> > instead of just using echo X > sriov_numvfs and then trying to fix
> > things afterwards. Then you could define this and other features that
> > I am sure you would need in the future via json-schema like is done in
> > device-tree and write it once enabling the set of VFs that you need.
>
> Sorry, but this is overkill, it won't give us much and it doesn't fit
> the VF usage model at all.
>
> Right now, all heavy users of SR-IOV are creating many VFs up to the maximum.
> They do it with autoprobe disabled, because it is too time consuming to wait
> till all VFs probe themselves and unbind them later.
>
> After that, they wait for incoming request to provision VM on VF, they set MAC
> address, change MSI-X according to VM properties and bind that VF to new VM.
>
> So MSI-X change is done after VFs were created.

So if I understand correctly based on your comments below you are
dynamically changing the VF's MSI-X configuration space then?

> >
> > Given that SR-IOV isn't meant to be tweaked dynamically it seems like
> > that would be a much better fit for most users as then you can just
> > modify the schema and reload it which would probably be less effort
> > than having to manually redo all the commands needed to setup the VFs
> > using this approach if you are having to manually update each VF.
>
> Quite opposite it true. First users use orchestration software to manage it.
> Second, you will need to take care of already bound device.

This is the part that concerns me. It seems like this is all adding a
bunch of orchestration need to all of this. Basically the device
config can shift out from under us on the host if we aren't paying
attention.

> >
> > > >
> > > > > +       return sprintf(buf, "%d\n", numb);
> > > > > +}
> > > > > +
> > > > > +static ssize_t vf_msix_vec_store(struct device *dev,
> > > > > +                                struct device_attribute *attr, const char *buf,
> > > > > +                                size_t count)
> > > > > +{
> > > > > +       struct pci_dev *vf_dev = to_pci_dev(dev);
> > > > > +       int val, ret;
> > > > > +
> > > > > +       ret = kstrtoint(buf, 0, &val);
> > > > > +       if (ret)
> > > > > +               return ret;
> > > > > +
> > > > > +       ret = pci_set_msix_vec_count(vf_dev, val);
> > > > > +       if (ret)
> > > > > +               return ret;
> > > > > +
> > > > > +       return count;
> > > > > +}
> > > > > +static DEVICE_ATTR_RW(vf_msix_vec);
> > > > > +#endif
> > > > > +
> > > > > +static struct attribute *sriov_vf_dev_attrs[] = {
> > > > > +#ifdef CONFIG_PCI_MSI
> > > > > +       &dev_attr_vf_msix_vec.attr,
> > > > > +#endif
> > > > > +       NULL,
> > > > > +};
> > > > > +
> > > > > +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
> > > > > +                                         struct attribute *a, int n)
> > > > > +{
> > > > > +       struct device *dev = kobj_to_dev(kobj);
> > > > > +
> > > > > +       if (dev_is_pf(dev))
> > > > > +               return 0;
> > > > > +
> > > > > +       return a->mode;
> > > > > +}
> > > > > +
> > > > > +const struct attribute_group sriov_vf_dev_attr_group = {
> > > > > +       .attrs = sriov_vf_dev_attrs,
> > > > > +       .is_visible = sriov_vf_attrs_are_visible,
> > > > > +};
> > > > > +
> > > > >  int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
> > > > >  {
> > > > >         return 0;
> > > > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > > > > index 3162f88fe940..20705ca94666 100644
> > > > > --- a/drivers/pci/msi.c
> > > > > +++ b/drivers/pci/msi.c
> > > > > @@ -991,6 +991,35 @@ int pci_msix_vec_count(struct pci_dev *dev)
> > > > >  }
> > > > >  EXPORT_SYMBOL(pci_msix_vec_count);
> > > > >
> > > > > +/**
> > > > > + * pci_set_msix_vec_count - change the reported number of MSI-X vectors
> > > > > + * This function is applicable for SR-IOV VF because such devices allocate
> > > > > + * their MSI-X table before they will run on the VMs and HW can't guess the
> > > > > + * right number of vectors, so the HW allocates them statically and equally.
> > > > > + * @dev: VF device that is going to be changed
> > > > > + * @numb: amount of MSI-X vectors
> > > > > + **/
> > > > > +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
> > > > > +{
> > > > > +       struct pci_dev *pdev = pci_physfn(dev);
> > > > > +
> > > > > +       if (!dev->msix_cap || !pdev->msix_cap)
> > > > > +               return -EINVAL;
> > > > > +
> > > > > +       if (dev->driver || !pdev->driver ||
> > > > > +           !pdev->driver->sriov_set_msix_vec_count)
> > > > > +               return -EOPNOTSUPP;
> > > > > +
> > > > > +       if (numb < 0)
> > > > > +               /*
> > > > > +                * We don't support negative numbers for now,
> > > > > +                * but maybe in the future it will make sense.
> > > > > +                */
> > > > > +               return -EINVAL;
> > > > > +
> > > > > +       return pdev->driver->sriov_set_msix_vec_count(dev, numb);
> > > > > +}
> > > > > +
> > > >
> > > > If you are going to have a set operation for this it would make sense
> > > > to have a get operation. Your show operation seems unbalanced since
> > > > you are expecting to call it on the VF directly which just seems
> > > > wrong.
> > >
> > > There is already get operator - pci_msix_vec_count().
> > > The same as above, PF is an anchor for the driver. VF doesn't have
> > > driver yet and we can't write directly to this PCI field - it is read-only.
> >
> > That returns the maximum. I would want to know what the value is that
> > you wrote here.
>
> It returns the value from the device and not maximum.

Okay, that was the part I hadn't caught onto. That you were using this
to make read-only configuration space writable.

> >
> > Also being able to change this once the device already exists is kind
> > of an ugly setup as I would imagine there will be cases where the
> > device has already partitioned out the vectors that it wanted.
>
> It is not possible, because all VFs starts with some value. Normal FW
> won't allow you to decrease vector count below certain limit.
>
> So VFs are always operable.

I get that. My concern is more the fact that you are modifying read
only bits within the configuration space that is visible to the host.
I initially thought you were doing the dynamic resizing behind the
scenes by just not enabling some of the MSI-X vectors in the table.
However, as I understand it now you are resizing the MSI-X table
itself which doesn't seem correct to me. The read-only portions of the
configuration space shouldn't be changed, at least within the host. I
just see the changing of fields that are expected to be static to be
problematic.

If all of this is just to tweak the MSI-X table size in the guest
/userspace wouldn't it just make more sense to add the ability for
vfio to limit this directly and perhaps intercept the reads to the
MSI-X control register? Then you could have vfio also take care of the
coordination of things with the driver which would make much more
sense to me as then you don't have to worry about a device on the host
changing size unexpectedly.

Anyway that is just my $.02.

- Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-13  6:19         ` Leon Romanovsky
@ 2021-01-13 22:44           ` Alexander Duyck
  2021-01-14  6:50             ` Leon Romanovsky
  0 siblings, 1 reply; 49+ messages in thread
From: Alexander Duyck @ 2021-01-13 22:44 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Tue, Jan 12, 2021 at 10:19 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Tue, Jan 12, 2021 at 01:34:50PM -0800, Alexander Duyck wrote:
> > On Mon, Jan 11, 2021 at 10:56 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Mon, Jan 11, 2021 at 11:30:39AM -0800, Alexander Duyck wrote:
> > > > On Sun, Jan 10, 2021 at 7:10 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > >
> > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > >
> > > > > Some SR-IOV capable devices provide an ability to configure specific
> > > > > number of MSI-X vectors on their VF prior driver is probed on that VF.
> > > > >
> > > > > In order to make management easy, provide new read-only sysfs file that
> > > > > returns a total number of possible to configure MSI-X vectors.
> > > > >
> > > > > cat /sys/bus/pci/devices/.../sriov_vf_total_msix
> > > > >   = 0 - feature is not supported
> > > > >   > 0 - total number of MSI-X vectors to consume by the VFs
> > > > >
> > > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > > ---
> > > > >  Documentation/ABI/testing/sysfs-bus-pci | 14 +++++++++++
> > > > >  drivers/pci/iov.c                       | 31 +++++++++++++++++++++++++
> > > > >  drivers/pci/pci.h                       |  3 +++
> > > > >  include/linux/pci.h                     |  2 ++
> > > > >  4 files changed, 50 insertions(+)
> > > > >
> > > > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > > > index 05e26e5da54e..64e9b700acc9 100644
> > > > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > > > @@ -395,3 +395,17 @@ Description:
> > > > >                 The file is writable if the PF is bound to a driver that
> > > > >                 supports the ->sriov_set_msix_vec_count() callback and there
> > > > >                 is no driver bound to the VF.
> > > > > +
> > > > > +What:          /sys/bus/pci/devices/.../sriov_vf_total_msix
> > > >
> > > > In this case I would drop the "vf" and just go with sriov_total_msix
> > > > since now you are referring to a global value instead of a per VF
> > > > value.
> > >
> > > This field indicates the amount of MSI-X available for VFs, it doesn't
> > > include PFs. The missing "_vf_" will mislead users who will believe that
> > > it is all MSI-X vectors available for this device. They will need to take
> > > into consideration amount of PF MSI-X in order to calculate the VF distribution.
> > >
> > > So I would leave "_vf_" here.
> >
> > The problem is you aren't indicating how many are available for an
> > individual VF though, you are indicating how many are available for
> > use by SR-IOV to give to the VFs. The fact that you are dealing with a
> > pool makes things confusing in my opinion. For example sriov_vf_device
> > describes the device ID that will be given to each VF.
>
> sriov_vf_device is different and is implemented accordingly to the PCI
> spec, 9.3.3.11 VF Device ID (Offset 1Ah)
> "This field contains the Device ID that should be presented for every VF
> to the SI."
>
> It is one ID for all VFs.

Yes, but that is what I am getting at. It is also what the device
configuration will be for one VF. So when I read sriov_vf_total_msix
it reads as the total for a single VF, not all of the the VFs. That is
why I think dropping the "vf_" part of the name would make sense, as
what you are describing is the total number of MSI-X vectors for use
by SR-IOV VFs.

> >
> > > >
> > > > > +Date:          January 2021
> > > > > +Contact:       Leon Romanovsky <leonro@nvidia.com>
> > > > > +Description:
> > > > > +               This file is associated with the SR-IOV PFs.
> > > > > +               It returns a total number of possible to configure MSI-X
> > > > > +               vectors on the enabled VFs.
> > > > > +
> > > > > +               The values returned are:
> > > > > +                * > 0 - this will be total number possible to consume by VFs,
> > > > > +                * = 0 - feature is not supported
> > > > > +
> > > > > +               If no SR-IOV VFs are enabled, this value will return 0.
> > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > > > index 42c0df4158d1..0a6ddf3230fd 100644
> > > > > --- a/drivers/pci/iov.c
> > > > > +++ b/drivers/pci/iov.c
> > > > > @@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev,
> > > > >         return count;
> > > > >  }
> > > > >
> > > > > +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> > > > > +                                       struct device_attribute *attr,
> > > > > +                                       char *buf)
> > > > > +{
> > > > > +       struct pci_dev *pdev = to_pci_dev(dev);
> > > > > +
> > > > > +       return sprintf(buf, "%d\n", pdev->sriov->vf_total_msix);
> > > > > +}
> > > > > +
> > > >
> > > > You display it as a signed value, but unsigned values are not
> > > > supported, correct?
> > >
> > > Right, I made it similar to the vf_msix_set. I can change.
> > >
> > > >
> > > > >  static DEVICE_ATTR_RO(sriov_totalvfs);
> > > > >  static DEVICE_ATTR_RW(sriov_numvfs);
> > > > >  static DEVICE_ATTR_RO(sriov_offset);
> > > > >  static DEVICE_ATTR_RO(sriov_stride);
> > > > >  static DEVICE_ATTR_RO(sriov_vf_device);
> > > > >  static DEVICE_ATTR_RW(sriov_drivers_autoprobe);
> > > > > +static DEVICE_ATTR_RO(sriov_vf_total_msix);
> > > > >
> > > > >  static struct attribute *sriov_dev_attrs[] = {
> > > > >         &dev_attr_sriov_totalvfs.attr,
> > > > > @@ -408,6 +418,7 @@ static struct attribute *sriov_dev_attrs[] = {
> > > > >         &dev_attr_sriov_stride.attr,
> > > > >         &dev_attr_sriov_vf_device.attr,
> > > > >         &dev_attr_sriov_drivers_autoprobe.attr,
> > > > > +       &dev_attr_sriov_vf_total_msix.attr,
> > > > >         NULL,
> > > > >  };
> > > > >
> > > > > @@ -658,6 +669,7 @@ static void sriov_disable(struct pci_dev *dev)
> > > > >                 sysfs_remove_link(&dev->dev.kobj, "dep_link");
> > > > >
> > > > >         iov->num_VFs = 0;
> > > > > +       iov->vf_total_msix = 0;
> > > > >         pci_iov_set_numvfs(dev, 0);
> > > > >  }
> > > > >
> > > > > @@ -1116,6 +1128,25 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> > > > >
> > > > > +/**
> > > > > + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
> > > > > + * @dev: the PCI PF device
> > > > > + * @numb: the total number of MSI-X vector to consume by the VFs
> > > > > + *
> > > > > + * Sets the number of MSI-X vectors that is possible to consume by the VFs.
> > > > > + * This interface is complimentary part of the pci_set_msix_vec_count()
> > > > > + * that will be used to configure the required number on the VF.
> > > > > + */
> > > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb)
> > > > > +{
> > > > > +       if (!dev->is_physfn || !dev->driver ||
> > > > > +           !dev->driver->sriov_set_msix_vec_count)
> > > > > +               return;
> > > > > +
> > > > > +       dev->sriov->vf_total_msix = numb;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
> > > > > +
> > > >
> > > > This seems broken. What validation is being done on the numb value?
> > > > You pass it as int, and your documentation all refers to tests for >=
> > > > 0, but isn't a signed input a possibility as well? Also "numb" doesn't
> > > > make for a good abbreviation as it is already a word of its own. It
> > > > might make more sense to use count or something like that rather than
> > > > trying to abbreviate number.
> > >
> > > "Broken" is a nice word to describe misunderstanding.
> >
> > Would you prefer "lacking input validation".
> >
> > I see all this code in there checking for is_physfn and driver and
> > sriov_set_msix_vec_count before allowing the setting of vf_total_msix.
> > It just seems like a lot of validation is taking place on the wrong
> > things if you are just going to be setting a value reporting the total
> > number of MSI-X vectors in use for SR-IOV.
>
> All those checks are in place to ensure that we are not overwriting the
> default value, which is 0.

Okay, so what you really have is surplus interrupts that you are
wanting to give out to VF devices. So when we indicate 0 here as the
default it really means we have no additional interrupts to give out.
Am I understanding that correctly?

The problem is this is very vendor specific so I am doing my best to
understand it as it is different then the other NICs I have worked
with.

So this value is the size of the total pool of interrupt vectors you
have to split up between the functions, or just the spare ones you
could add to individual VFs? Since you say "total" I am assuming it is
the total pool which means that in order to figure out how many are
available to be reserved we would have to run through all the VFs and
figure out what has already been assigned, correct? If so it wouldn't
hurt to also think about having a free and in-use count somewhere as
well.

> >
> > In addition this value seems like a custom purpose being pushed into
> > the PCIe code since there isn't anything that defaults the value. It
> > seems like at a minimum there should be something that programs a
> > default value for both of these new fields that are being added so
> > that you pull the maximum number of VFs when SR-IOV is enabled, the
> > maximum number of MSI-X vectors from a single VF, and then the default
> > value for this should be the multiple of the two which can then be
> > overridden later.
>
> The default is 0, because most SR-IOV doesn't have proper support of
> setting VF MSI-X.

It wasn't designed to work this way. That is why it doesn't really work.

> Regarding the calculation, it is not correct for the mlx5. We have large
> pool of MSI-X vectors, but setting small number of them. This allows us
> to increase that number on specific VF without need to decrease on
> others "to free" the vectors.

I think I am finally starting to grok what is going on here, but I
really don't like the approach.

Is there any reason why you couldn't have configured your VF to
support whatever the maximum number of MSI-X vectors you wanted to use
was, and then just internally masked off or disabled the ones that you
couldn't allocate to the VF and communicate that to the VF via some
sort of firmware message so it wouldn't use them? If I am not mistaken
that is the approach that has been taken in the past for at least this
portion of things in the Intel drivers.

Then the matter is how to configure it. I'm not a big fan of adding
sysfs to the VF to manage resources that are meant to be controlled by
the PF. Especially when you are having to add sysfs to the PF as well
which creates an asymmetric setup where you are having to read the PF
to find out what resources you can move to the VF. I wonder if
something like the Devlink Resource interface wouldn't make more sense
in this case. Then you would just manage "vf-interrupts" or something
like that with the resource split between each of the VFs with each VF
uniquely identified as a separate sub resource.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-13 22:44           ` Alexander Duyck
@ 2021-01-14  6:50             ` Leon Romanovsky
  2021-01-14 16:40               ` Alexander Duyck
  0 siblings, 1 reply; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-14  6:50 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Wed, Jan 13, 2021 at 02:44:45PM -0800, Alexander Duyck wrote:
> On Tue, Jan 12, 2021 at 10:19 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Tue, Jan 12, 2021 at 01:34:50PM -0800, Alexander Duyck wrote:
> > > On Mon, Jan 11, 2021 at 10:56 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > On Mon, Jan 11, 2021 at 11:30:39AM -0800, Alexander Duyck wrote:
> > > > > On Sun, Jan 10, 2021 at 7:10 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > >
> > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > >
> > > > > > Some SR-IOV capable devices provide an ability to configure specific
> > > > > > number of MSI-X vectors on their VF prior driver is probed on that VF.
> > > > > >
> > > > > > In order to make management easy, provide new read-only sysfs file that
> > > > > > returns a total number of possible to configure MSI-X vectors.
> > > > > >
> > > > > > cat /sys/bus/pci/devices/.../sriov_vf_total_msix
> > > > > >   = 0 - feature is not supported
> > > > > >   > 0 - total number of MSI-X vectors to consume by the VFs
> > > > > >
> > > > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > > > ---
> > > > > >  Documentation/ABI/testing/sysfs-bus-pci | 14 +++++++++++
> > > > > >  drivers/pci/iov.c                       | 31 +++++++++++++++++++++++++
> > > > > >  drivers/pci/pci.h                       |  3 +++
> > > > > >  include/linux/pci.h                     |  2 ++
> > > > > >  4 files changed, 50 insertions(+)
> > > > > >
> > > > > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > > > > index 05e26e5da54e..64e9b700acc9 100644
> > > > > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > > > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > > > > @@ -395,3 +395,17 @@ Description:
> > > > > >                 The file is writable if the PF is bound to a driver that
> > > > > >                 supports the ->sriov_set_msix_vec_count() callback and there
> > > > > >                 is no driver bound to the VF.
> > > > > > +
> > > > > > +What:          /sys/bus/pci/devices/.../sriov_vf_total_msix
> > > > >
> > > > > In this case I would drop the "vf" and just go with sriov_total_msix
> > > > > since now you are referring to a global value instead of a per VF
> > > > > value.
> > > >
> > > > This field indicates the amount of MSI-X available for VFs, it doesn't
> > > > include PFs. The missing "_vf_" will mislead users who will believe that
> > > > it is all MSI-X vectors available for this device. They will need to take
> > > > into consideration amount of PF MSI-X in order to calculate the VF distribution.
> > > >
> > > > So I would leave "_vf_" here.
> > >
> > > The problem is you aren't indicating how many are available for an
> > > individual VF though, you are indicating how many are available for
> > > use by SR-IOV to give to the VFs. The fact that you are dealing with a
> > > pool makes things confusing in my opinion. For example sriov_vf_device
> > > describes the device ID that will be given to each VF.
> >
> > sriov_vf_device is different and is implemented accordingly to the PCI
> > spec, 9.3.3.11 VF Device ID (Offset 1Ah)
> > "This field contains the Device ID that should be presented for every VF
> > to the SI."
> >
> > It is one ID for all VFs.
>
> Yes, but that is what I am getting at. It is also what the device
> configuration will be for one VF. So when I read sriov_vf_total_msix
> it reads as the total for a single VF, not all of the the VFs. That is
> why I think dropping the "vf_" part of the name would make sense, as
> what you are describing is the total number of MSI-X vectors for use
> by SR-IOV VFs.

I can change to anything as long as new name will give clear indication
that this total number is for VFs and doesn't include SR-IOV PF MSI-X.

>
> > >
> > > > >
> > > > > > +Date:          January 2021
> > > > > > +Contact:       Leon Romanovsky <leonro@nvidia.com>
> > > > > > +Description:
> > > > > > +               This file is associated with the SR-IOV PFs.
> > > > > > +               It returns a total number of possible to configure MSI-X
> > > > > > +               vectors on the enabled VFs.
> > > > > > +
> > > > > > +               The values returned are:
> > > > > > +                * > 0 - this will be total number possible to consume by VFs,
> > > > > > +                * = 0 - feature is not supported
> > > > > > +
> > > > > > +               If no SR-IOV VFs are enabled, this value will return 0.
> > > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > > > > index 42c0df4158d1..0a6ddf3230fd 100644
> > > > > > --- a/drivers/pci/iov.c
> > > > > > +++ b/drivers/pci/iov.c
> > > > > > @@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev,
> > > > > >         return count;
> > > > > >  }
> > > > > >
> > > > > > +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> > > > > > +                                       struct device_attribute *attr,
> > > > > > +                                       char *buf)
> > > > > > +{
> > > > > > +       struct pci_dev *pdev = to_pci_dev(dev);
> > > > > > +
> > > > > > +       return sprintf(buf, "%d\n", pdev->sriov->vf_total_msix);
> > > > > > +}
> > > > > > +
> > > > >
> > > > > You display it as a signed value, but unsigned values are not
> > > > > supported, correct?
> > > >
> > > > Right, I made it similar to the vf_msix_set. I can change.
> > > >
> > > > >
> > > > > >  static DEVICE_ATTR_RO(sriov_totalvfs);
> > > > > >  static DEVICE_ATTR_RW(sriov_numvfs);
> > > > > >  static DEVICE_ATTR_RO(sriov_offset);
> > > > > >  static DEVICE_ATTR_RO(sriov_stride);
> > > > > >  static DEVICE_ATTR_RO(sriov_vf_device);
> > > > > >  static DEVICE_ATTR_RW(sriov_drivers_autoprobe);
> > > > > > +static DEVICE_ATTR_RO(sriov_vf_total_msix);
> > > > > >
> > > > > >  static struct attribute *sriov_dev_attrs[] = {
> > > > > >         &dev_attr_sriov_totalvfs.attr,
> > > > > > @@ -408,6 +418,7 @@ static struct attribute *sriov_dev_attrs[] = {
> > > > > >         &dev_attr_sriov_stride.attr,
> > > > > >         &dev_attr_sriov_vf_device.attr,
> > > > > >         &dev_attr_sriov_drivers_autoprobe.attr,
> > > > > > +       &dev_attr_sriov_vf_total_msix.attr,
> > > > > >         NULL,
> > > > > >  };
> > > > > >
> > > > > > @@ -658,6 +669,7 @@ static void sriov_disable(struct pci_dev *dev)
> > > > > >                 sysfs_remove_link(&dev->dev.kobj, "dep_link");
> > > > > >
> > > > > >         iov->num_VFs = 0;
> > > > > > +       iov->vf_total_msix = 0;
> > > > > >         pci_iov_set_numvfs(dev, 0);
> > > > > >  }
> > > > > >
> > > > > > @@ -1116,6 +1128,25 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
> > > > > >  }
> > > > > >  EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> > > > > >
> > > > > > +/**
> > > > > > + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
> > > > > > + * @dev: the PCI PF device
> > > > > > + * @numb: the total number of MSI-X vector to consume by the VFs
> > > > > > + *
> > > > > > + * Sets the number of MSI-X vectors that is possible to consume by the VFs.
> > > > > > + * This interface is complimentary part of the pci_set_msix_vec_count()
> > > > > > + * that will be used to configure the required number on the VF.
> > > > > > + */
> > > > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb)
> > > > > > +{
> > > > > > +       if (!dev->is_physfn || !dev->driver ||
> > > > > > +           !dev->driver->sriov_set_msix_vec_count)
> > > > > > +               return;
> > > > > > +
> > > > > > +       dev->sriov->vf_total_msix = numb;
> > > > > > +}
> > > > > > +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
> > > > > > +
> > > > >
> > > > > This seems broken. What validation is being done on the numb value?
> > > > > You pass it as int, and your documentation all refers to tests for >=
> > > > > 0, but isn't a signed input a possibility as well? Also "numb" doesn't
> > > > > make for a good abbreviation as it is already a word of its own. It
> > > > > might make more sense to use count or something like that rather than
> > > > > trying to abbreviate number.
> > > >
> > > > "Broken" is a nice word to describe misunderstanding.
> > >
> > > Would you prefer "lacking input validation".
> > >
> > > I see all this code in there checking for is_physfn and driver and
> > > sriov_set_msix_vec_count before allowing the setting of vf_total_msix.
> > > It just seems like a lot of validation is taking place on the wrong
> > > things if you are just going to be setting a value reporting the total
> > > number of MSI-X vectors in use for SR-IOV.
> >
> > All those checks are in place to ensure that we are not overwriting the
> > default value, which is 0.
>
> Okay, so what you really have is surplus interrupts that you are
> wanting to give out to VF devices. So when we indicate 0 here as the
> default it really means we have no additional interrupts to give out.
> Am I understanding that correctly?

The vf_total_msix is static value and shouldn't be recalculated after
every MSI-X vector number change. So 0 means that driver doesn't support
at all this feature. The operator is responsible to make proper assignment
calculations, because he is already doing it for the CPUs and netdev queues.

>
> The problem is this is very vendor specific so I am doing my best to
> understand it as it is different then the other NICs I have worked
> with.

There is nothing vendor specific here. There are two types of devices:
1. Support this feature. - The vf_total_msix will be greater than 0 for them
and their FW will do sanity checks when user overwrites their default number
that they sat in the VF creation stage.
2. Doesn't support this feature - The vf_total_msix will be 0.

It is PCI spec, so those "other NICs" that didn't implement the PCI spec
will stay with option #2. It is not different from current situation.

>
> So this value is the size of the total pool of interrupt vectors you
> have to split up between the functions, or just the spare ones you
> could add to individual VFs? Since you say "total" I am assuming it is
> the total pool which means that in order to figure out how many are
> available to be reserved we would have to run through all the VFs and
> figure out what has already been assigned, correct? If so it wouldn't
> hurt to also think about having a free and in-use count somewhere as
> well.

It is not really necessary, the VFs are created with some defaults,
because FW needs to ensure that after "echo X > sriov_numvfs" everything
is operable.

As we already discussed, FW has no other way but create VFs with same
configuration. It means that user needs to check only first VF MSI-X
number and multiply by number of VFs to get total number of already
consumed.

The reserved vectors are implemented by checks in FW to ensure that user
can't write number below certain level.

I prefer to stay "lean" as much as possible and add extra fields only
after the real need arise. Right now and for seen future, it is not needed.

>
> > >
> > > In addition this value seems like a custom purpose being pushed into
> > > the PCIe code since there isn't anything that defaults the value. It
> > > seems like at a minimum there should be something that programs a
> > > default value for both of these new fields that are being added so
> > > that you pull the maximum number of VFs when SR-IOV is enabled, the
> > > maximum number of MSI-X vectors from a single VF, and then the default
> > > value for this should be the multiple of the two which can then be
> > > overridden later.
> >
> > The default is 0, because most SR-IOV doesn't have proper support of
> > setting VF MSI-X.
>
> It wasn't designed to work this way. That is why it doesn't really work.

It can be true for everything that doesn't work.
I will rewrite my sentence above:
"The default is 0, because I don't have other than mlx5 devices at my disposal,
so leave it to other driver authors to implement their callbacks"

>
> > Regarding the calculation, it is not correct for the mlx5. We have large
> > pool of MSI-X vectors, but setting small number of them. This allows us
> > to increase that number on specific VF without need to decrease on
> > others "to free" the vectors.
>
> I think I am finally starting to grok what is going on here, but I
> really don't like the approach.
>
> Is there any reason why you couldn't have configured your VF to
> support whatever the maximum number of MSI-X vectors you wanted to use
> was, and then just internally masked off or disabled the ones that you
> couldn't allocate to the VF and communicate that to the VF via some
> sort of firmware message so it wouldn't use them? If I am not mistaken
> that is the approach that has been taken in the past for at least this
> portion of things in the Intel drivers.

I'm not proficient in Intel drivers and can't comment it, but the idea
looks very controversial and hacky to me.

There are so many issues with proposed model:
1. During driver probe, the __pci_enable_msix() is called and device MSI-X
number is used to set internal to the kernel and device configuration.
It means that vfio will try to set it to be large (our default is huge),
so we will need to provide some module parameter or sysfs to limit in VFIO
and it should be done per-VM.
2. Without limiting in VFIO during VM attach, the first VM can and will
grab everything and we are returning to square one.
3. The difference in lspci output on the hypervisor and inside VM in
regards of MSI-X vector count will be source of endless bug reports.
4. I afraid that is not how orchestration works.

This supports even more than before the need to do it properly in
pci/core and make user experience clean, easy and reliable.

>
> Then the matter is how to configure it. I'm not a big fan of adding
> sysfs to the VF to manage resources that are meant to be controlled by
> the PF. Especially when you are having to add sysfs to the PF as well
> which creates an asymmetric setup where you are having to read the PF
> to find out what resources you can move to the VF. I wonder if
> something like the Devlink Resource interface wouldn't make more sense
> in this case. Then you would just manage "vf-interrupts" or something
> like that with the resource split between each of the VFs with each VF
> uniquely identified as a separate sub resource.

I remind you that this feature is applicable to all SR-IOV devices, we have
RDMA, NVMe, crypto, FPGA and netdev VFs. The devlink is not supported
outside of netdev world and implementation there will make this feature
is far from being usable.

Right now, the configuration of PCI/core is done through sysfs, so let's
review this feature from PCI/core perspective and not from netdev where
sysfs is not widely used.

Thanks

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-13 20:00           ` Alexander Duyck
@ 2021-01-14  7:16             ` Leon Romanovsky
  2021-01-14 16:51               ` Alexander Duyck
  0 siblings, 1 reply; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-14  7:16 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Wed, Jan 13, 2021 at 12:00:00PM -0800, Alexander Duyck wrote:
> On Tue, Jan 12, 2021 at 10:09 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Tue, Jan 12, 2021 at 01:59:51PM -0800, Alexander Duyck wrote:
> > > On Mon, Jan 11, 2021 at 10:39 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > On Mon, Jan 11, 2021 at 11:30:33AM -0800, Alexander Duyck wrote:
> > > > > On Sun, Jan 10, 2021 at 7:12 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > >
> > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > >
> > > > > > Extend PCI sysfs interface with a new callback that allows configure
> > > > > > the number of MSI-X vectors for specific SR-IO VF. This is needed
> > > > > > to optimize the performance of newly bound devices by allocating
> > > > > > the number of vectors based on the administrator knowledge of targeted VM.
> > > > > >
> > > > > > This function is applicable for SR-IOV VF because such devices allocate
> > > > > > their MSI-X table before they will run on the VMs and HW can't guess the
> > > > > > right number of vectors, so the HW allocates them statically and equally.
> > > > > >
> > > > > > The newly added /sys/bus/pci/devices/.../vf_msix_vec file will be seen
> > > > > > for the VFs and it is writable as long as a driver is not bounded to the VF.
> > > > > >
> > > > > > The values accepted are:
> > > > > >  * > 0 - this will be number reported by the VF's MSI-X capability
> > > > > >  * < 0 - not valid
> > > > > >  * = 0 - will reset to the device default value
> > > > > >
> > > > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > > > ---
> > > > > >  Documentation/ABI/testing/sysfs-bus-pci | 20 ++++++++
> > > > > >  drivers/pci/iov.c                       | 62 +++++++++++++++++++++++++
> > > > > >  drivers/pci/msi.c                       | 29 ++++++++++++
> > > > > >  drivers/pci/pci-sysfs.c                 |  1 +
> > > > > >  drivers/pci/pci.h                       |  2 +
> > > > > >  include/linux/pci.h                     |  8 +++-
> > > > > >  6 files changed, 121 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > > > > index 25c9c39770c6..05e26e5da54e 100644
> > > > > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > > > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > > > > @@ -375,3 +375,23 @@ Description:
> > > > > >                 The value comes from the PCI kernel device state and can be one
> > > > > >                 of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
> > > > > >                 The file is read only.
> > > > > > +
> > > > > > +What:          /sys/bus/pci/devices/.../vf_msix_vec
> > > > >
> > > > > So the name for this doesn't seem to match existing SR-IOV naming.  It
> > > > > seems like this should probably be something like sriov_vf_msix_count
> > > > > in order to be closer to the actual naming of what is being dealt
> > > > > with.
> > > >
> > > > I'm open for suggestions. I didn't use sriov_vf_msix_count because it
> > > > seems too long for me.
> > > >
> > > > >
> > > > > > +Date:          December 2020
> > > > > > +Contact:       Leon Romanovsky <leonro@nvidia.com>
> > > > > > +Description:
> > > > > > +               This file is associated with the SR-IOV VFs.
> > > > > > +               It allows configuration of the number of MSI-X vectors for
> > > > > > +               the VF. This is needed to optimize performance of newly bound
> > > > > > +               devices by allocating the number of vectors based on the
> > > > > > +               administrator knowledge of targeted VM.
> > > > > > +
> > > > > > +               The values accepted are:
> > > > > > +                * > 0 - this will be number reported by the VF's MSI-X
> > > > > > +                        capability
> > > > > > +                * < 0 - not valid
> > > > > > +                * = 0 - will reset to the device default value
> > > > > > +
> > > > > > +               The file is writable if the PF is bound to a driver that
> > > > > > +               supports the ->sriov_set_msix_vec_count() callback and there
> > > > > > +               is no driver bound to the VF.
> > > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > > > > index 4afd4ee4f7f0..42c0df4158d1 100644
> > > > > > --- a/drivers/pci/iov.c
> > > > > > +++ b/drivers/pci/iov.c
> > > > > > @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
> > > > > >         return (dev->devfn + dev->sriov->offset +
> > > > > >                 dev->sriov->stride * vf_id) & 0xff;
> > > > > >  }
> > > > > > +EXPORT_SYMBOL(pci_iov_virtfn_devfn);
> > > > > >
> > > > > >  /*
> > > > > >   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
> > > > > > @@ -426,6 +427,67 @@ const struct attribute_group sriov_dev_attr_group = {
> > > > > >         .is_visible = sriov_attrs_are_visible,
> > > > > >  };
> > > > > >
> > > > > > +#ifdef CONFIG_PCI_MSI
> > > > > > +static ssize_t vf_msix_vec_show(struct device *dev,
> > > > > > +                               struct device_attribute *attr, char *buf)
> > > > > > +{
> > > > > > +       struct pci_dev *pdev = to_pci_dev(dev);
> > > > > > +       int numb = pci_msix_vec_count(pdev);
> > > > > > +       struct pci_dev *pfdev;
> > > > > > +
> > > > > > +       if (numb < 0)
> > > > > > +               return numb;
> > > > > > +
> > > > > > +       pfdev = pci_physfn(pdev);
> > > > > > +       if (!pfdev->driver || !pfdev->driver->sriov_set_msix_vec_count)
> > > > > > +               return -EOPNOTSUPP;
> > > > > > +
> > > > >
> > > > > This doesn't make sense to me. You are getting the vector count for
> > > > > the PCI device and reporting that. Are you expecting to call this on
> > > > > the PF or the VFs? It seems like this should be a PF attribute and not
> > > > > be called on the individual VFs.
> > > >
> > > > We had this discussion over v0 variant of this series.
> > > > https://lore.kernel.org/linux-pci/20210108072525.GB31158@unreal/
> > > >
> > > > This is per-VF property, but this VF is not bounded to the driver so you
> > > > need some way to convey new number to the HW, so it will update PCI value.
> > > >
> > > > You must change/update this field after VF is created, because all SR-IOV VFs
> > > > are created at the same time. The operator (administrator/orchestration
> > > > software/e.t.c) will know the right amount of MSI-X vectors right before
> > > > he will bind this VF to requested VM.
> > > >
> > > > It means that extending PF sysfs to get both VF index and count will
> > > > look very unfriendly for the users.
> > > >
> > > > The PF here is an anchor to the relevant driver.
> > >
> > > Yes, but the problem is you are attempting to do it after a driver may
> > > have already bound itself to the VF device. This setup works for the
> > > direct assigned VF case, however if the VF drivers are running on the
> > > host then this gets ugly as the driver may already be up and running.
> >
> > Please take a look on the pci_set_msix_vec_count() implementation, it
> > checks that VF is not probed yet. I outlined this requirement almost
> > in all my responses and commit messages.
> >
> > So no, it is not possible to set MSI-X vectors to already bound device.
>
> Unless you are holding the device lock you cannot guarantee that as
> Alex Williamson pointed out you can end up racing with the driver
> probe/remove.

Yeah, I overlooked it, thanks.

>
> Secondly the fact that the driver might be probed before you even get
> to make your call will cause this to return EOPNOTSUPP which doesn't
> exactly make sense since it is supported, you just cannot do it since
> the device is busy.

I'll return -EBUSY.

>
> > >
> > > Also I am not a big fan of the VF groping around looking for a PF
> > > interface as it means the interface will likely be exposed in the
> > > guest as well, but it just won't work.
> >
> > If you are referring to VF exposed to the VM, so in this case VF must be
> > bound too vfio driver, or any other driver, and won't allow MSI-X change.
> > If you are referring to PF exposed to the VM, it is very unlikely scenario
> > in real world and reserved for braves among us. Even in this case, the
> > set MSI-X won't work, because PF will be connected to the hypervisor driver
> > that doesn't support set_msix.
> >
> > So both cases are handled.
>
> I get that they are handled. However I am not a huge fan of the sysfs
> attributes for one device being dependent on another device. When you
> have to start searching for another device it just makes things messy.

This is pretty common way, nothing new here.

>
> > >
> > > > >
> > > > > If you are calling this on the VFs then it doesn't really make any
> > > > > sense anyway since the VF is not a "VF PCI dev representor" and
> > > > > shouldn't be treated as such. In my opinion if we are going to be
> > > > > doing per-port resource limiting that is something that might make
> > > > > more sense as a part of the devlink configuration for the VF since the
> > > > > actual change won't be visible to an assigned device.
> > > >
> > > > https://lore.kernel.org/linux-pci/20210112061535.GB4678@unreal/
> > >
> > > So the question I would have is if we are spawning the VFs and
> > > expecting them to have different configs or the same configuration?
> >
> > By default, they have same configuration.
> >
> > > I'm assuming in your case you are looking for a different
> > > configuration per port. Do I have that correct?
> >
> > No, per-VF as represents one device in the PCI world. For example, mlx5
> > can have more than one physical port.
>
> Sorry, I meant per virtual function, not per port.

Yes, PCI spec is clear, MSI-X vector count is per-device and in our case
it means per-VF.

>
> > >
> > > Where this gets ugly is that SR-IOV assumes a certain uniformity per
> > > VF so doing a per-VF custom limitation gets ugly pretty quick.
> >
> > I don't find any support for this "uniformity" claim in the PCI spec.
>
> I am referring to the PCI configuration space. Each VF ends up with
> some fixed amount of MMIO resources per function. So typically when
> you spawn VFs we had things setup so that all you do is say how many
> you want.
>
> > > I wonder if it would make more sense if we are going this route to just
> > > define a device-tree like schema that could be fed in to enable VFs
> > > instead of just using echo X > sriov_numvfs and then trying to fix
> > > things afterwards. Then you could define this and other features that
> > > I am sure you would need in the future via json-schema like is done in
> > > device-tree and write it once enabling the set of VFs that you need.
> >
> > Sorry, but this is overkill, it won't give us much and it doesn't fit
> > the VF usage model at all.
> >
> > Right now, all heavy users of SR-IOV are creating many VFs up to the maximum.
> > They do it with autoprobe disabled, because it is too time consuming to wait
> > till all VFs probe themselves and unbind them later.
> >
> > After that, they wait for incoming request to provision VM on VF, they set MAC
> > address, change MSI-X according to VM properties and bind that VF to new VM.
> >
> > So MSI-X change is done after VFs were created.
>
> So if I understand correctly based on your comments below you are
> dynamically changing the VF's MSI-X configuration space then?

I'm changing "Table Size" from "7.7.2.2 Message Control Register for
MSI-X (Offset 02h)" and nothing more.

If you do raw PCI read before and after, only this field will be changed.

>
> > >
> > > Given that SR-IOV isn't meant to be tweaked dynamically it seems like
> > > that would be a much better fit for most users as then you can just
> > > modify the schema and reload it which would probably be less effort
> > > than having to manually redo all the commands needed to setup the VFs
> > > using this approach if you are having to manually update each VF.
> >
> > Quite opposite it true. First users use orchestration software to manage it.
> > Second, you will need to take care of already bound device.
>
> This is the part that concerns me. It seems like this is all adding a
> bunch of orchestration need to all of this. Basically the device
> config can shift out from under us on the host if we aren't paying
> attention.

I just presented one possible use case that is for sure will be used here.
Nothing prohibits from curious user to block driver probe, or to unbind
or do any other trick to ensure that driver is not probed.

>
> > >
> > > > >
> > > > > > +       return sprintf(buf, "%d\n", numb);
> > > > > > +}
> > > > > > +
> > > > > > +static ssize_t vf_msix_vec_store(struct device *dev,
> > > > > > +                                struct device_attribute *attr, const char *buf,
> > > > > > +                                size_t count)
> > > > > > +{
> > > > > > +       struct pci_dev *vf_dev = to_pci_dev(dev);
> > > > > > +       int val, ret;
> > > > > > +
> > > > > > +       ret = kstrtoint(buf, 0, &val);
> > > > > > +       if (ret)
> > > > > > +               return ret;
> > > > > > +
> > > > > > +       ret = pci_set_msix_vec_count(vf_dev, val);
> > > > > > +       if (ret)
> > > > > > +               return ret;
> > > > > > +
> > > > > > +       return count;
> > > > > > +}
> > > > > > +static DEVICE_ATTR_RW(vf_msix_vec);
> > > > > > +#endif
> > > > > > +
> > > > > > +static struct attribute *sriov_vf_dev_attrs[] = {
> > > > > > +#ifdef CONFIG_PCI_MSI
> > > > > > +       &dev_attr_vf_msix_vec.attr,
> > > > > > +#endif
> > > > > > +       NULL,
> > > > > > +};
> > > > > > +
> > > > > > +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
> > > > > > +                                         struct attribute *a, int n)
> > > > > > +{
> > > > > > +       struct device *dev = kobj_to_dev(kobj);
> > > > > > +
> > > > > > +       if (dev_is_pf(dev))
> > > > > > +               return 0;
> > > > > > +
> > > > > > +       return a->mode;
> > > > > > +}
> > > > > > +
> > > > > > +const struct attribute_group sriov_vf_dev_attr_group = {
> > > > > > +       .attrs = sriov_vf_dev_attrs,
> > > > > > +       .is_visible = sriov_vf_attrs_are_visible,
> > > > > > +};
> > > > > > +
> > > > > >  int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
> > > > > >  {
> > > > > >         return 0;
> > > > > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > > > > > index 3162f88fe940..20705ca94666 100644
> > > > > > --- a/drivers/pci/msi.c
> > > > > > +++ b/drivers/pci/msi.c
> > > > > > @@ -991,6 +991,35 @@ int pci_msix_vec_count(struct pci_dev *dev)
> > > > > >  }
> > > > > >  EXPORT_SYMBOL(pci_msix_vec_count);
> > > > > >
> > > > > > +/**
> > > > > > + * pci_set_msix_vec_count - change the reported number of MSI-X vectors
> > > > > > + * This function is applicable for SR-IOV VF because such devices allocate
> > > > > > + * their MSI-X table before they will run on the VMs and HW can't guess the
> > > > > > + * right number of vectors, so the HW allocates them statically and equally.
> > > > > > + * @dev: VF device that is going to be changed
> > > > > > + * @numb: amount of MSI-X vectors
> > > > > > + **/
> > > > > > +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
> > > > > > +{
> > > > > > +       struct pci_dev *pdev = pci_physfn(dev);
> > > > > > +
> > > > > > +       if (!dev->msix_cap || !pdev->msix_cap)
> > > > > > +               return -EINVAL;
> > > > > > +
> > > > > > +       if (dev->driver || !pdev->driver ||
> > > > > > +           !pdev->driver->sriov_set_msix_vec_count)
> > > > > > +               return -EOPNOTSUPP;
> > > > > > +
> > > > > > +       if (numb < 0)
> > > > > > +               /*
> > > > > > +                * We don't support negative numbers for now,
> > > > > > +                * but maybe in the future it will make sense.
> > > > > > +                */
> > > > > > +               return -EINVAL;
> > > > > > +
> > > > > > +       return pdev->driver->sriov_set_msix_vec_count(dev, numb);
> > > > > > +}
> > > > > > +
> > > > >
> > > > > If you are going to have a set operation for this it would make sense
> > > > > to have a get operation. Your show operation seems unbalanced since
> > > > > you are expecting to call it on the VF directly which just seems
> > > > > wrong.
> > > >
> > > > There is already get operator - pci_msix_vec_count().
> > > > The same as above, PF is an anchor for the driver. VF doesn't have
> > > > driver yet and we can't write directly to this PCI field - it is read-only.
> > >
> > > That returns the maximum. I would want to know what the value is that
> > > you wrote here.
> >
> > It returns the value from the device and not maximum.
>
> Okay, that was the part I hadn't caught onto. That you were using this
> to make read-only configuration space writable.
>
> > >
> > > Also being able to change this once the device already exists is kind
> > > of an ugly setup as I would imagine there will be cases where the
> > > device has already partitioned out the vectors that it wanted.
> >
> > It is not possible, because all VFs starts with some value. Normal FW
> > won't allow you to decrease vector count below certain limit.
> >
> > So VFs are always operable.
>
> I get that. My concern is more the fact that you are modifying read
> only bits within the configuration space that is visible to the host.

Yes, and this is why it is done through PF and without driver probes
and with driver callback. It is done to make coherent picture for the
administrator and VM user.

> I initially thought you were doing the dynamic resizing behind the
> scenes by just not enabling some of the MSI-X vectors in the table.
> However, as I understand it now you are resizing the MSI-X table
> itself which doesn't seem correct to me. The read-only portions of the
> configuration space shouldn't be changed, at least within the host. I
> just see the changing of fields that are expected to be static to be
> problematic.

This is why sriov_drivers_autoprobe exist, to configure device prior
driver initialization.

>
> If all of this is just to tweak the MSI-X table size in the guest
> /userspace wouldn't it just make more sense to add the ability for
> vfio to limit this directly and perhaps intercept the reads to the
> MSI-X control register? Then you could have vfio also take care of the
> coordination of things with the driver which would make much more
> sense to me as then you don't have to worry about a device on the host
> changing size unexpectedly.

I hope that I provided answers in parallel thread on why vfio solution
is not viable.

>
> Anyway that is just my $.02.
>
> - Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-14  6:50             ` Leon Romanovsky
@ 2021-01-14 16:40               ` Alexander Duyck
  2021-01-14 16:48                 ` Jason Gunthorpe
  2021-01-14 17:26                 ` Leon Romanovsky
  0 siblings, 2 replies; 49+ messages in thread
From: Alexander Duyck @ 2021-01-14 16:40 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Wed, Jan 13, 2021 at 10:50 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Wed, Jan 13, 2021 at 02:44:45PM -0800, Alexander Duyck wrote:
> > On Tue, Jan 12, 2021 at 10:19 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Tue, Jan 12, 2021 at 01:34:50PM -0800, Alexander Duyck wrote:
> > > > On Mon, Jan 11, 2021 at 10:56 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > >
> > > > > On Mon, Jan 11, 2021 at 11:30:39AM -0800, Alexander Duyck wrote:
> > > > > > On Sun, Jan 10, 2021 at 7:10 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > > >
> > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > > >
> > > > > > > Some SR-IOV capable devices provide an ability to configure specific
> > > > > > > number of MSI-X vectors on their VF prior driver is probed on that VF.
> > > > > > >
> > > > > > > In order to make management easy, provide new read-only sysfs file that
> > > > > > > returns a total number of possible to configure MSI-X vectors.
> > > > > > >
> > > > > > > cat /sys/bus/pci/devices/.../sriov_vf_total_msix
> > > > > > >   = 0 - feature is not supported
> > > > > > >   > 0 - total number of MSI-X vectors to consume by the VFs
> > > > > > >
> > > > > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > > > > ---
> > > > > > >  Documentation/ABI/testing/sysfs-bus-pci | 14 +++++++++++
> > > > > > >  drivers/pci/iov.c                       | 31 +++++++++++++++++++++++++
> > > > > > >  drivers/pci/pci.h                       |  3 +++
> > > > > > >  include/linux/pci.h                     |  2 ++
> > > > > > >  4 files changed, 50 insertions(+)
> > > > > > >
> > > > > > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > > > > > index 05e26e5da54e..64e9b700acc9 100644
> > > > > > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > > > > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > > > > > @@ -395,3 +395,17 @@ Description:
> > > > > > >                 The file is writable if the PF is bound to a driver that
> > > > > > >                 supports the ->sriov_set_msix_vec_count() callback and there
> > > > > > >                 is no driver bound to the VF.
> > > > > > > +
> > > > > > > +What:          /sys/bus/pci/devices/.../sriov_vf_total_msix
> > > > > >
> > > > > > In this case I would drop the "vf" and just go with sriov_total_msix
> > > > > > since now you are referring to a global value instead of a per VF
> > > > > > value.
> > > > >
> > > > > This field indicates the amount of MSI-X available for VFs, it doesn't
> > > > > include PFs. The missing "_vf_" will mislead users who will believe that
> > > > > it is all MSI-X vectors available for this device. They will need to take
> > > > > into consideration amount of PF MSI-X in order to calculate the VF distribution.
> > > > >
> > > > > So I would leave "_vf_" here.
> > > >
> > > > The problem is you aren't indicating how many are available for an
> > > > individual VF though, you are indicating how many are available for
> > > > use by SR-IOV to give to the VFs. The fact that you are dealing with a
> > > > pool makes things confusing in my opinion. For example sriov_vf_device
> > > > describes the device ID that will be given to each VF.
> > >
> > > sriov_vf_device is different and is implemented accordingly to the PCI
> > > spec, 9.3.3.11 VF Device ID (Offset 1Ah)
> > > "This field contains the Device ID that should be presented for every VF
> > > to the SI."
> > >
> > > It is one ID for all VFs.
> >
> > Yes, but that is what I am getting at. It is also what the device
> > configuration will be for one VF. So when I read sriov_vf_total_msix
> > it reads as the total for a single VF, not all of the the VFs. That is
> > why I think dropping the "vf_" part of the name would make sense, as
> > what you are describing is the total number of MSI-X vectors for use
> > by SR-IOV VFs.
>
> I can change to anything as long as new name will give clear indication
> that this total number is for VFs and doesn't include SR-IOV PF MSI-X.

It is interesting that you make that distinction.

So in the case of the Intel hardware we had one pool of MSI-X
interrupts that was available for the entire port, both PF and VF.
When we enabled SR-IOV we had to repartition that pool in order to
assign interrupts to devices. So it sounds like in your case you don't
do that and instead the PF is static and the VFs are the only piece
that is flexible. Do I have that correct?

> >
> > > >
> > > > > >
> > > > > > > +Date:          January 2021
> > > > > > > +Contact:       Leon Romanovsky <leonro@nvidia.com>
> > > > > > > +Description:
> > > > > > > +               This file is associated with the SR-IOV PFs.
> > > > > > > +               It returns a total number of possible to configure MSI-X
> > > > > > > +               vectors on the enabled VFs.
> > > > > > > +
> > > > > > > +               The values returned are:
> > > > > > > +                * > 0 - this will be total number possible to consume by VFs,
> > > > > > > +                * = 0 - feature is not supported
> > > > > > > +
> > > > > > > +               If no SR-IOV VFs are enabled, this value will return 0.
> > > > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > > > > > index 42c0df4158d1..0a6ddf3230fd 100644
> > > > > > > --- a/drivers/pci/iov.c
> > > > > > > +++ b/drivers/pci/iov.c
> > > > > > > @@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev,
> > > > > > >         return count;
> > > > > > >  }
> > > > > > >
> > > > > > > +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> > > > > > > +                                       struct device_attribute *attr,
> > > > > > > +                                       char *buf)
> > > > > > > +{
> > > > > > > +       struct pci_dev *pdev = to_pci_dev(dev);
> > > > > > > +
> > > > > > > +       return sprintf(buf, "%d\n", pdev->sriov->vf_total_msix);
> > > > > > > +}
> > > > > > > +
> > > > > >
> > > > > > You display it as a signed value, but unsigned values are not
> > > > > > supported, correct?
> > > > >
> > > > > Right, I made it similar to the vf_msix_set. I can change.
> > > > >
> > > > > >
> > > > > > >  static DEVICE_ATTR_RO(sriov_totalvfs);
> > > > > > >  static DEVICE_ATTR_RW(sriov_numvfs);
> > > > > > >  static DEVICE_ATTR_RO(sriov_offset);
> > > > > > >  static DEVICE_ATTR_RO(sriov_stride);
> > > > > > >  static DEVICE_ATTR_RO(sriov_vf_device);
> > > > > > >  static DEVICE_ATTR_RW(sriov_drivers_autoprobe);
> > > > > > > +static DEVICE_ATTR_RO(sriov_vf_total_msix);
> > > > > > >
> > > > > > >  static struct attribute *sriov_dev_attrs[] = {
> > > > > > >         &dev_attr_sriov_totalvfs.attr,
> > > > > > > @@ -408,6 +418,7 @@ static struct attribute *sriov_dev_attrs[] = {
> > > > > > >         &dev_attr_sriov_stride.attr,
> > > > > > >         &dev_attr_sriov_vf_device.attr,
> > > > > > >         &dev_attr_sriov_drivers_autoprobe.attr,
> > > > > > > +       &dev_attr_sriov_vf_total_msix.attr,
> > > > > > >         NULL,
> > > > > > >  };
> > > > > > >
> > > > > > > @@ -658,6 +669,7 @@ static void sriov_disable(struct pci_dev *dev)
> > > > > > >                 sysfs_remove_link(&dev->dev.kobj, "dep_link");
> > > > > > >
> > > > > > >         iov->num_VFs = 0;
> > > > > > > +       iov->vf_total_msix = 0;
> > > > > > >         pci_iov_set_numvfs(dev, 0);
> > > > > > >  }
> > > > > > >
> > > > > > > @@ -1116,6 +1128,25 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> > > > > > >
> > > > > > > +/**
> > > > > > > + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
> > > > > > > + * @dev: the PCI PF device
> > > > > > > + * @numb: the total number of MSI-X vector to consume by the VFs
> > > > > > > + *
> > > > > > > + * Sets the number of MSI-X vectors that is possible to consume by the VFs.
> > > > > > > + * This interface is complimentary part of the pci_set_msix_vec_count()
> > > > > > > + * that will be used to configure the required number on the VF.
> > > > > > > + */
> > > > > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb)
> > > > > > > +{
> > > > > > > +       if (!dev->is_physfn || !dev->driver ||
> > > > > > > +           !dev->driver->sriov_set_msix_vec_count)
> > > > > > > +               return;
> > > > > > > +
> > > > > > > +       dev->sriov->vf_total_msix = numb;
> > > > > > > +}
> > > > > > > +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
> > > > > > > +
> > > > > >
> > > > > > This seems broken. What validation is being done on the numb value?
> > > > > > You pass it as int, and your documentation all refers to tests for >=
> > > > > > 0, but isn't a signed input a possibility as well? Also "numb" doesn't
> > > > > > make for a good abbreviation as it is already a word of its own. It
> > > > > > might make more sense to use count or something like that rather than
> > > > > > trying to abbreviate number.
> > > > >
> > > > > "Broken" is a nice word to describe misunderstanding.
> > > >
> > > > Would you prefer "lacking input validation".
> > > >
> > > > I see all this code in there checking for is_physfn and driver and
> > > > sriov_set_msix_vec_count before allowing the setting of vf_total_msix.
> > > > It just seems like a lot of validation is taking place on the wrong
> > > > things if you are just going to be setting a value reporting the total
> > > > number of MSI-X vectors in use for SR-IOV.
> > >
> > > All those checks are in place to ensure that we are not overwriting the
> > > default value, which is 0.
> >
> > Okay, so what you really have is surplus interrupts that you are
> > wanting to give out to VF devices. So when we indicate 0 here as the
> > default it really means we have no additional interrupts to give out.
> > Am I understanding that correctly?
>
> The vf_total_msix is static value and shouldn't be recalculated after
> every MSI-X vector number change. So 0 means that driver doesn't support
> at all this feature. The operator is responsible to make proper assignment
> calculations, because he is already doing it for the CPUs and netdev queues.

Honestly that makes things even uglier. So basically this is a feature
where if it isn't supported it will make it look like the SR-IOV
device doesn't support MSI-X. I realize it is just the way it is
worded but it isn't very pretty.

> >
> > The problem is this is very vendor specific so I am doing my best to
> > understand it as it is different then the other NICs I have worked
> > with.
>
> There is nothing vendor specific here. There are two types of devices:
> 1. Support this feature. - The vf_total_msix will be greater than 0 for them
> and their FW will do sanity checks when user overwrites their default number
> that they sat in the VF creation stage.
> 2. Doesn't support this feature - The vf_total_msix will be 0.
>
> It is PCI spec, so those "other NICs" that didn't implement the PCI spec
> will stay with option #2. It is not different from current situation.

Where in the spec is this?

I know in the PCI spec it says that the MSI-X table size is read-only
and is not supposed to be written by system software. That is what is
being overwritten right now by your patches that has me concerned.

As far as the logic behind the approach described above. You can
define up to the MSI-X table size worth of interrupt vectors, however
nothing says the device has to make use of those vectors. So defining
a table with unused entries works within the spec since all the table
is defining is interrupts you can use, not interrupts you must use. It
basically comes down to the fact that an interrupt can be broken into
target actions which are defined in the table, and "initiators" or
"sources" which are controlled by the internal configuration of the
device and must be associated with the target action for something to
occur. You can have a fixed number of initiators that can be shared
between the targets.

> >
> > So this value is the size of the total pool of interrupt vectors you
> > have to split up between the functions, or just the spare ones you
> > could add to individual VFs? Since you say "total" I am assuming it is
> > the total pool which means that in order to figure out how many are
> > available to be reserved we would have to run through all the VFs and
> > figure out what has already been assigned, correct? If so it wouldn't
> > hurt to also think about having a free and in-use count somewhere as
> > well.
>
> It is not really necessary, the VFs are created with some defaults,
> because FW needs to ensure that after "echo X > sriov_numvfs" everything
> is operable.
>
> As we already discussed, FW has no other way but create VFs with same
> configuration. It means that user needs to check only first VF MSI-X
> number and multiply by number of VFs to get total number of already
> consumed.

That is only at the start though. Once they start modifying VF
interrupts the individual numbers could be all over the place.

> The reserved vectors are implemented by checks in FW to ensure that user
> can't write number below certain level.
>
> I prefer to stay "lean" as much as possible and add extra fields only
> after the real need arise. Right now and for seen future, it is not needed.

It isn't so much about extra fields as the availability of data. When
you have data placed in multiple locations and only a single error
return type if things go wrong it makes it kind of ugly in terms of a
user interface.

> >
> > > >
> > > > In addition this value seems like a custom purpose being pushed into
> > > > the PCIe code since there isn't anything that defaults the value. It
> > > > seems like at a minimum there should be something that programs a
> > > > default value for both of these new fields that are being added so
> > > > that you pull the maximum number of VFs when SR-IOV is enabled, the
> > > > maximum number of MSI-X vectors from a single VF, and then the default
> > > > value for this should be the multiple of the two which can then be
> > > > overridden later.
> > >
> > > The default is 0, because most SR-IOV doesn't have proper support of
> > > setting VF MSI-X.
> >
> > It wasn't designed to work this way. That is why it doesn't really work.
>
> It can be true for everything that doesn't work.
> I will rewrite my sentence above:
> "The default is 0, because I don't have other than mlx5 devices at my disposal,
> so leave it to other driver authors to implement their callbacks"

So you are saying this is a vendor specific interface then? Until we
have another party willing to sign on as this being an approach they
are going to take with their NIC I would question if this is really
the way we want to go about handling this.

I would be curious if any other vendor supports editing the VF MSI-X
table size on the fly. If so then maybe we will have to accept it as a
quirk of SR-IOV. However we really should be trying to avoid this as a
supported method if possible.

> >
> > > Regarding the calculation, it is not correct for the mlx5. We have large
> > > pool of MSI-X vectors, but setting small number of them. This allows us
> > > to increase that number on specific VF without need to decrease on
> > > others "to free" the vectors.
> >
> > I think I am finally starting to grok what is going on here, but I
> > really don't like the approach.
> >
> > Is there any reason why you couldn't have configured your VF to
> > support whatever the maximum number of MSI-X vectors you wanted to use
> > was, and then just internally masked off or disabled the ones that you
> > couldn't allocate to the VF and communicate that to the VF via some
> > sort of firmware message so it wouldn't use them? If I am not mistaken
> > that is the approach that has been taken in the past for at least this
> > portion of things in the Intel drivers.
>
> I'm not proficient in Intel drivers and can't comment it, but the idea
> looks very controversial and hacky to me.
>
> There are so many issues with proposed model:
> 1. During driver probe, the __pci_enable_msix() is called and device MSI-X
> number is used to set internal to the kernel and device configuration.
> It means that vfio will try to set it to be large (our default is huge),
> so we will need to provide some module parameter or sysfs to limit in VFIO
> and it should be done per-VM.

How would that be much different than what you have now? What I was
suggesting is that you could expose your sysfs value either as a part
of binding the VFIO driver, or as a module parameter. You said that
your orchestration was managing this per VM so I see this behaving the
same way. Your orchestration layer would be setting this value to
limit what is exposed in the VM.

The extra call to the PF may be needed in order to guarantee that the
device/firmware knows that it is supposed to ignore the interrupts
past a certain point.

> 2. Without limiting in VFIO during VM attach, the first VM can and will
> grab everything and we are returning to square one.

It depends on how you have this implemented. As I mentioned earlier
the editing of the VF configuration space is troubling since you are
essentially using the firmware to circumvent the read-only protection
that is provided by the spec in order to change the MSI-X table size
on the fly.

Where I think you and I disagree is that I really think the MSI-X
table size should be fixed at one value for the life of the VF.
Instead of changing the table size it should be the number of vectors
that are functional that should be the limit. Basically there should
be only some that are functional and some that are essentially just
dummy vectors that you can write values into that will never be used.

> 3. The difference in lspci output on the hypervisor and inside VM in
> regards of MSI-X vector count will be source of endless bug reports.

Nothing says the two have to differ other than the fact that at a
certain point the MSI-X table stops being written to the hardware and
those vectors beyond that point are internally masked and disabled.

> 4. I afraid that is not how orchestration works.
>
> This supports even more than before the need to do it properly in
> pci/core and make user experience clean, easy and reliable.

I would think that would be pretty straight forward. Basically the
hardware would ignore interrupt vectors entries past a certain point,
and the PBA bits for those entries would always read 0.

> >
> > Then the matter is how to configure it. I'm not a big fan of adding
> > sysfs to the VF to manage resources that are meant to be controlled by
> > the PF. Especially when you are having to add sysfs to the PF as well
> > which creates an asymmetric setup where you are having to read the PF
> > to find out what resources you can move to the VF. I wonder if
> > something like the Devlink Resource interface wouldn't make more sense
> > in this case. Then you would just manage "vf-interrupts" or something
> > like that with the resource split between each of the VFs with each VF
> > uniquely identified as a separate sub resource.
>
> I remind you that this feature is applicable to all SR-IOV devices, we have
> RDMA, NVMe, crypto, FPGA and netdev VFs. The devlink is not supported
> outside of netdev world and implementation there will make this feature
> is far from being usable.

To quote the documentation:
"devlink is an API to expose device information and resources not directly
related to any device class, such as chip-wide/switch-ASIC-wide configuration."

> Right now, the configuration of PCI/core is done through sysfs, so let's
> review this feature from PCI/core perspective and not from netdev where
> sysfs is not widely used.

The problem is what you are configuring is not necessarily PCI device
specific. You are configuring the firmware which operates at a level
above the PCI device. You just have it manifesting as a PCI behavior
as you are editing a read-only configuration space field.

Also as I mentioned a few times now the approach you are taking
violates the PCIe spec by essentially providing a means of making a
read-only field writable.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-14 16:40               ` Alexander Duyck
@ 2021-01-14 16:48                 ` Jason Gunthorpe
  2021-01-14 17:55                   ` Alexander Duyck
  2021-01-14 17:26                 ` Leon Romanovsky
  1 sibling, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2021-01-14 16:48 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Thu, Jan 14, 2021 at 08:40:14AM -0800, Alexander Duyck wrote:

> Where I think you and I disagree is that I really think the MSI-X
> table size should be fixed at one value for the life of the VF.
> Instead of changing the table size it should be the number of vectors
> that are functional that should be the limit. Basically there should
> be only some that are functional and some that are essentially just
> dummy vectors that you can write values into that will never be used.

Ignoring the PCI config space to learn the # of available MSI-X
vectors is big break on the how the device's programming ABI works.

Or stated another way, that isn't compatible with any existing drivers
so it is basically not a useful approach as it can't be deployed.

I don't know why you think that is better.

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-14  7:16             ` Leon Romanovsky
@ 2021-01-14 16:51               ` Alexander Duyck
  2021-01-14 17:12                 ` Leon Romanovsky
  0 siblings, 1 reply; 49+ messages in thread
From: Alexander Duyck @ 2021-01-14 16:51 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Wed, Jan 13, 2021 at 11:16 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Wed, Jan 13, 2021 at 12:00:00PM -0800, Alexander Duyck wrote:
> > On Tue, Jan 12, 2021 at 10:09 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Tue, Jan 12, 2021 at 01:59:51PM -0800, Alexander Duyck wrote:
> > > > On Mon, Jan 11, 2021 at 10:39 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > >
> > > > > On Mon, Jan 11, 2021 at 11:30:33AM -0800, Alexander Duyck wrote:
> > > > > > On Sun, Jan 10, 2021 at 7:12 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > > >
> > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > > >

<snip>

> >
> > > >
> > > > Also I am not a big fan of the VF groping around looking for a PF
> > > > interface as it means the interface will likely be exposed in the
> > > > guest as well, but it just won't work.
> > >
> > > If you are referring to VF exposed to the VM, so in this case VF must be
> > > bound too vfio driver, or any other driver, and won't allow MSI-X change.
> > > If you are referring to PF exposed to the VM, it is very unlikely scenario
> > > in real world and reserved for braves among us. Even in this case, the
> > > set MSI-X won't work, because PF will be connected to the hypervisor driver
> > > that doesn't support set_msix.
> > >
> > > So both cases are handled.
> >
> > I get that they are handled. However I am not a huge fan of the sysfs
> > attributes for one device being dependent on another device. When you
> > have to start searching for another device it just makes things messy.
>
> This is pretty common way, nothing new here.

This is how writable fields within the device are handled. I am pretty
sure this is the first sysfs entry that is providing a workaround via
a device firmware to make the field editable that wasn't intended to
be.

So if in the future I define a device that has an MMIO register that
allows me to edit configuration space should I just tie it into the
same framework? That is kind of where I am going with my objection to
this. It just seems like you are adding a backdoor to allow editing
read-only configuration options.

> >
> > > >
> > > > > >
> > > > > > If you are calling this on the VFs then it doesn't really make any
> > > > > > sense anyway since the VF is not a "VF PCI dev representor" and
> > > > > > shouldn't be treated as such. In my opinion if we are going to be
> > > > > > doing per-port resource limiting that is something that might make
> > > > > > more sense as a part of the devlink configuration for the VF since the
> > > > > > actual change won't be visible to an assigned device.
> > > > >
> > > > > https://lore.kernel.org/linux-pci/20210112061535.GB4678@unreal/
> > > >
> > > > So the question I would have is if we are spawning the VFs and
> > > > expecting them to have different configs or the same configuration?
> > >
> > > By default, they have same configuration.
> > >
> > > > I'm assuming in your case you are looking for a different
> > > > configuration per port. Do I have that correct?
> > >
> > > No, per-VF as represents one device in the PCI world. For example, mlx5
> > > can have more than one physical port.
> >
> > Sorry, I meant per virtual function, not per port.
>
> Yes, PCI spec is clear, MSI-X vector count is per-device and in our case
> it means per-VF.

I think you overlooked the part about it being "read-only". It isn't
really meant to be changed and that is what this patch set is
providing.

> >
> > > >
> > > > Where this gets ugly is that SR-IOV assumes a certain uniformity per
> > > > VF so doing a per-VF custom limitation gets ugly pretty quick.
> > >
> > > I don't find any support for this "uniformity" claim in the PCI spec.
> >
> > I am referring to the PCI configuration space. Each VF ends up with
> > some fixed amount of MMIO resources per function. So typically when
> > you spawn VFs we had things setup so that all you do is say how many
> > you want.
> >
> > > > I wonder if it would make more sense if we are going this route to just
> > > > define a device-tree like schema that could be fed in to enable VFs
> > > > instead of just using echo X > sriov_numvfs and then trying to fix
> > > > things afterwards. Then you could define this and other features that
> > > > I am sure you would need in the future via json-schema like is done in
> > > > device-tree and write it once enabling the set of VFs that you need.
> > >
> > > Sorry, but this is overkill, it won't give us much and it doesn't fit
> > > the VF usage model at all.
> > >
> > > Right now, all heavy users of SR-IOV are creating many VFs up to the maximum.
> > > They do it with autoprobe disabled, because it is too time consuming to wait
> > > till all VFs probe themselves and unbind them later.
> > >
> > > After that, they wait for incoming request to provision VM on VF, they set MAC
> > > address, change MSI-X according to VM properties and bind that VF to new VM.
> > >
> > > So MSI-X change is done after VFs were created.
> >
> > So if I understand correctly based on your comments below you are
> > dynamically changing the VF's MSI-X configuration space then?
>
> I'm changing "Table Size" from "7.7.2.2 Message Control Register for
> MSI-X (Offset 02h)" and nothing more.
>
> If you do raw PCI read before and after, only this field will be changed.

I would hope there is much more going on. Otherwise the VF hardware
will be exploitable by a malicious driver in the guest since you could
read/write to registers beyond the table and see some result. I am
assuming the firmware doesn't allow triggering of any interrupts
beyond the ones defined as being in the table.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-14 16:51               ` Alexander Duyck
@ 2021-01-14 17:12                 ` Leon Romanovsky
  0 siblings, 0 replies; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-14 17:12 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Thu, Jan 14, 2021 at 08:51:22AM -0800, Alexander Duyck wrote:
> On Wed, Jan 13, 2021 at 11:16 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Wed, Jan 13, 2021 at 12:00:00PM -0800, Alexander Duyck wrote:
> > > On Tue, Jan 12, 2021 at 10:09 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > On Tue, Jan 12, 2021 at 01:59:51PM -0800, Alexander Duyck wrote:
> > > > > On Mon, Jan 11, 2021 at 10:39 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > >
> > > > > > On Mon, Jan 11, 2021 at 11:30:33AM -0800, Alexander Duyck wrote:
> > > > > > > On Sun, Jan 10, 2021 at 7:12 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > > > >
> > > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > > > >
>
> <snip>
>
> > >
> > > > >
> > > > > Also I am not a big fan of the VF groping around looking for a PF
> > > > > interface as it means the interface will likely be exposed in the
> > > > > guest as well, but it just won't work.
> > > >
> > > > If you are referring to VF exposed to the VM, so in this case VF must be
> > > > bound too vfio driver, or any other driver, and won't allow MSI-X change.
> > > > If you are referring to PF exposed to the VM, it is very unlikely scenario
> > > > in real world and reserved for braves among us. Even in this case, the
> > > > set MSI-X won't work, because PF will be connected to the hypervisor driver
> > > > that doesn't support set_msix.
> > > >
> > > > So both cases are handled.
> > >
> > > I get that they are handled. However I am not a huge fan of the sysfs
> > > attributes for one device being dependent on another device. When you
> > > have to start searching for another device it just makes things messy.
> >
> > This is pretty common way, nothing new here.
>
> This is how writable fields within the device are handled. I am pretty
> sure this is the first sysfs entry that is providing a workaround via
> a device firmware to make the field editable that wasn't intended to
> be.
>
> So if in the future I define a device that has an MMIO register that
> allows me to edit configuration space should I just tie it into the
> same framework? That is kind of where I am going with my objection to
> this. It just seems like you are adding a backdoor to allow editing
> read-only configuration options.

I have no objections as long as your new sysfs will make sense and will
be acceptable by PCI community.

>
> > >
> > > > >
> > > > > > >
> > > > > > > If you are calling this on the VFs then it doesn't really make any
> > > > > > > sense anyway since the VF is not a "VF PCI dev representor" and
> > > > > > > shouldn't be treated as such. In my opinion if we are going to be
> > > > > > > doing per-port resource limiting that is something that might make
> > > > > > > more sense as a part of the devlink configuration for the VF since the
> > > > > > > actual change won't be visible to an assigned device.
> > > > > >
> > > > > > https://lore.kernel.org/linux-pci/20210112061535.GB4678@unreal/
> > > > >
> > > > > So the question I would have is if we are spawning the VFs and
> > > > > expecting them to have different configs or the same configuration?
> > > >
> > > > By default, they have same configuration.
> > > >
> > > > > I'm assuming in your case you are looking for a different
> > > > > configuration per port. Do I have that correct?
> > > >
> > > > No, per-VF as represents one device in the PCI world. For example, mlx5
> > > > can have more than one physical port.
> > >
> > > Sorry, I meant per virtual function, not per port.
> >
> > Yes, PCI spec is clear, MSI-X vector count is per-device and in our case
> > it means per-VF.
>
> I think you overlooked the part about it being "read-only". It isn't
> really meant to be changed and that is what this patch set is
> providing.

We doesn't allow writes to this field, but setting "hint" to the HW on
the proper resource provisioning.

>
> > >
> > > > >
> > > > > Where this gets ugly is that SR-IOV assumes a certain uniformity per
> > > > > VF so doing a per-VF custom limitation gets ugly pretty quick.
> > > >
> > > > I don't find any support for this "uniformity" claim in the PCI spec.
> > >
> > > I am referring to the PCI configuration space. Each VF ends up with
> > > some fixed amount of MMIO resources per function. So typically when
> > > you spawn VFs we had things setup so that all you do is say how many
> > > you want.
> > >
> > > > > I wonder if it would make more sense if we are going this route to just
> > > > > define a device-tree like schema that could be fed in to enable VFs
> > > > > instead of just using echo X > sriov_numvfs and then trying to fix
> > > > > things afterwards. Then you could define this and other features that
> > > > > I am sure you would need in the future via json-schema like is done in
> > > > > device-tree and write it once enabling the set of VFs that you need.
> > > >
> > > > Sorry, but this is overkill, it won't give us much and it doesn't fit
> > > > the VF usage model at all.
> > > >
> > > > Right now, all heavy users of SR-IOV are creating many VFs up to the maximum.
> > > > They do it with autoprobe disabled, because it is too time consuming to wait
> > > > till all VFs probe themselves and unbind them later.
> > > >
> > > > After that, they wait for incoming request to provision VM on VF, they set MAC
> > > > address, change MSI-X according to VM properties and bind that VF to new VM.
> > > >
> > > > So MSI-X change is done after VFs were created.
> > >
> > > So if I understand correctly based on your comments below you are
> > > dynamically changing the VF's MSI-X configuration space then?
> >
> > I'm changing "Table Size" from "7.7.2.2 Message Control Register for
> > MSI-X (Offset 02h)" and nothing more.
> >
> > If you do raw PCI read before and after, only this field will be changed.
>
> I would hope there is much more going on. Otherwise the VF hardware
> will be exploitable by a malicious driver in the guest since you could
> read/write to registers beyond the table and see some result. I am
> assuming the firmware doesn't allow triggering of any interrupts
> beyond the ones defined as being in the table.

You are talking about internal FW implementation, which is of course
secure. I'm talking about PCI config space.

Thanks

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-14 16:40               ` Alexander Duyck
  2021-01-14 16:48                 ` Jason Gunthorpe
@ 2021-01-14 17:26                 ` Leon Romanovsky
  1 sibling, 0 replies; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-14 17:26 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Thu, Jan 14, 2021 at 08:40:14AM -0800, Alexander Duyck wrote:
> On Wed, Jan 13, 2021 at 10:50 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Wed, Jan 13, 2021 at 02:44:45PM -0800, Alexander Duyck wrote:
> > > On Tue, Jan 12, 2021 at 10:19 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > On Tue, Jan 12, 2021 at 01:34:50PM -0800, Alexander Duyck wrote:
> > > > > On Mon, Jan 11, 2021 at 10:56 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > >
> > > > > > On Mon, Jan 11, 2021 at 11:30:39AM -0800, Alexander Duyck wrote:
> > > > > > > On Sun, Jan 10, 2021 at 7:10 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > > > >
> > > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > > > >
> > > > > > > > Some SR-IOV capable devices provide an ability to configure specific
> > > > > > > > number of MSI-X vectors on their VF prior driver is probed on that VF.
> > > > > > > >
> > > > > > > > In order to make management easy, provide new read-only sysfs file that
> > > > > > > > returns a total number of possible to configure MSI-X vectors.
> > > > > > > >
> > > > > > > > cat /sys/bus/pci/devices/.../sriov_vf_total_msix
> > > > > > > >   = 0 - feature is not supported
> > > > > > > >   > 0 - total number of MSI-X vectors to consume by the VFs
> > > > > > > >
> > > > > > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > > > > > ---
> > > > > > > >  Documentation/ABI/testing/sysfs-bus-pci | 14 +++++++++++
> > > > > > > >  drivers/pci/iov.c                       | 31 +++++++++++++++++++++++++
> > > > > > > >  drivers/pci/pci.h                       |  3 +++
> > > > > > > >  include/linux/pci.h                     |  2 ++
> > > > > > > >  4 files changed, 50 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > > > > > > index 05e26e5da54e..64e9b700acc9 100644
> > > > > > > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > > > > > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > > > > > > @@ -395,3 +395,17 @@ Description:
> > > > > > > >                 The file is writable if the PF is bound to a driver that
> > > > > > > >                 supports the ->sriov_set_msix_vec_count() callback and there
> > > > > > > >                 is no driver bound to the VF.
> > > > > > > > +
> > > > > > > > +What:          /sys/bus/pci/devices/.../sriov_vf_total_msix
> > > > > > >
> > > > > > > In this case I would drop the "vf" and just go with sriov_total_msix
> > > > > > > since now you are referring to a global value instead of a per VF
> > > > > > > value.
> > > > > >
> > > > > > This field indicates the amount of MSI-X available for VFs, it doesn't
> > > > > > include PFs. The missing "_vf_" will mislead users who will believe that
> > > > > > it is all MSI-X vectors available for this device. They will need to take
> > > > > > into consideration amount of PF MSI-X in order to calculate the VF distribution.
> > > > > >
> > > > > > So I would leave "_vf_" here.
> > > > >
> > > > > The problem is you aren't indicating how many are available for an
> > > > > individual VF though, you are indicating how many are available for
> > > > > use by SR-IOV to give to the VFs. The fact that you are dealing with a
> > > > > pool makes things confusing in my opinion. For example sriov_vf_device
> > > > > describes the device ID that will be given to each VF.
> > > >
> > > > sriov_vf_device is different and is implemented accordingly to the PCI
> > > > spec, 9.3.3.11 VF Device ID (Offset 1Ah)
> > > > "This field contains the Device ID that should be presented for every VF
> > > > to the SI."
> > > >
> > > > It is one ID for all VFs.
> > >
> > > Yes, but that is what I am getting at. It is also what the device
> > > configuration will be for one VF. So when I read sriov_vf_total_msix
> > > it reads as the total for a single VF, not all of the the VFs. That is
> > > why I think dropping the "vf_" part of the name would make sense, as
> > > what you are describing is the total number of MSI-X vectors for use
> > > by SR-IOV VFs.
> >
> > I can change to anything as long as new name will give clear indication
> > that this total number is for VFs and doesn't include SR-IOV PF MSI-X.
>
> It is interesting that you make that distinction.
>
> So in the case of the Intel hardware we had one pool of MSI-X
> interrupts that was available for the entire port, both PF and VF.
> When we enabled SR-IOV we had to repartition that pool in order to
> assign interrupts to devices. So it sounds like in your case you don't
> do that and instead the PF is static and the VFs are the only piece
> that is flexible. Do I have that correct?

It is partially correct. The mlx5 devices have ability to change MSI-X
vectors of PF too, but to do so, you will need driver reload and much
more complex user interface. So we (SW) leave it as static.

>
> > >
> > > > >
> > > > > > >
> > > > > > > > +Date:          January 2021
> > > > > > > > +Contact:       Leon Romanovsky <leonro@nvidia.com>
> > > > > > > > +Description:
> > > > > > > > +               This file is associated with the SR-IOV PFs.
> > > > > > > > +               It returns a total number of possible to configure MSI-X
> > > > > > > > +               vectors on the enabled VFs.
> > > > > > > > +
> > > > > > > > +               The values returned are:
> > > > > > > > +                * > 0 - this will be total number possible to consume by VFs,
> > > > > > > > +                * = 0 - feature is not supported
> > > > > > > > +
> > > > > > > > +               If no SR-IOV VFs are enabled, this value will return 0.
> > > > > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > > > > > > index 42c0df4158d1..0a6ddf3230fd 100644
> > > > > > > > --- a/drivers/pci/iov.c
> > > > > > > > +++ b/drivers/pci/iov.c
> > > > > > > > @@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev,
> > > > > > > >         return count;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> > > > > > > > +                                       struct device_attribute *attr,
> > > > > > > > +                                       char *buf)
> > > > > > > > +{
> > > > > > > > +       struct pci_dev *pdev = to_pci_dev(dev);
> > > > > > > > +
> > > > > > > > +       return sprintf(buf, "%d\n", pdev->sriov->vf_total_msix);
> > > > > > > > +}
> > > > > > > > +
> > > > > > >
> > > > > > > You display it as a signed value, but unsigned values are not
> > > > > > > supported, correct?
> > > > > >
> > > > > > Right, I made it similar to the vf_msix_set. I can change.
> > > > > >
> > > > > > >
> > > > > > > >  static DEVICE_ATTR_RO(sriov_totalvfs);
> > > > > > > >  static DEVICE_ATTR_RW(sriov_numvfs);
> > > > > > > >  static DEVICE_ATTR_RO(sriov_offset);
> > > > > > > >  static DEVICE_ATTR_RO(sriov_stride);
> > > > > > > >  static DEVICE_ATTR_RO(sriov_vf_device);
> > > > > > > >  static DEVICE_ATTR_RW(sriov_drivers_autoprobe);
> > > > > > > > +static DEVICE_ATTR_RO(sriov_vf_total_msix);
> > > > > > > >
> > > > > > > >  static struct attribute *sriov_dev_attrs[] = {
> > > > > > > >         &dev_attr_sriov_totalvfs.attr,
> > > > > > > > @@ -408,6 +418,7 @@ static struct attribute *sriov_dev_attrs[] = {
> > > > > > > >         &dev_attr_sriov_stride.attr,
> > > > > > > >         &dev_attr_sriov_vf_device.attr,
> > > > > > > >         &dev_attr_sriov_drivers_autoprobe.attr,
> > > > > > > > +       &dev_attr_sriov_vf_total_msix.attr,
> > > > > > > >         NULL,
> > > > > > > >  };
> > > > > > > >
> > > > > > > > @@ -658,6 +669,7 @@ static void sriov_disable(struct pci_dev *dev)
> > > > > > > >                 sysfs_remove_link(&dev->dev.kobj, "dep_link");
> > > > > > > >
> > > > > > > >         iov->num_VFs = 0;
> > > > > > > > +       iov->vf_total_msix = 0;
> > > > > > > >         pci_iov_set_numvfs(dev, 0);
> > > > > > > >  }
> > > > > > > >
> > > > > > > > @@ -1116,6 +1128,25 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
> > > > > > > >  }
> > > > > > > >  EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> > > > > > > >
> > > > > > > > +/**
> > > > > > > > + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
> > > > > > > > + * @dev: the PCI PF device
> > > > > > > > + * @numb: the total number of MSI-X vector to consume by the VFs
> > > > > > > > + *
> > > > > > > > + * Sets the number of MSI-X vectors that is possible to consume by the VFs.
> > > > > > > > + * This interface is complimentary part of the pci_set_msix_vec_count()
> > > > > > > > + * that will be used to configure the required number on the VF.
> > > > > > > > + */
> > > > > > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, int numb)
> > > > > > > > +{
> > > > > > > > +       if (!dev->is_physfn || !dev->driver ||
> > > > > > > > +           !dev->driver->sriov_set_msix_vec_count)
> > > > > > > > +               return;
> > > > > > > > +
> > > > > > > > +       dev->sriov->vf_total_msix = numb;
> > > > > > > > +}
> > > > > > > > +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
> > > > > > > > +
> > > > > > >
> > > > > > > This seems broken. What validation is being done on the numb value?
> > > > > > > You pass it as int, and your documentation all refers to tests for >=
> > > > > > > 0, but isn't a signed input a possibility as well? Also "numb" doesn't
> > > > > > > make for a good abbreviation as it is already a word of its own. It
> > > > > > > might make more sense to use count or something like that rather than
> > > > > > > trying to abbreviate number.
> > > > > >
> > > > > > "Broken" is a nice word to describe misunderstanding.
> > > > >
> > > > > Would you prefer "lacking input validation".
> > > > >
> > > > > I see all this code in there checking for is_physfn and driver and
> > > > > sriov_set_msix_vec_count before allowing the setting of vf_total_msix.
> > > > > It just seems like a lot of validation is taking place on the wrong
> > > > > things if you are just going to be setting a value reporting the total
> > > > > number of MSI-X vectors in use for SR-IOV.
> > > >
> > > > All those checks are in place to ensure that we are not overwriting the
> > > > default value, which is 0.
> > >
> > > Okay, so what you really have is surplus interrupts that you are
> > > wanting to give out to VF devices. So when we indicate 0 here as the
> > > default it really means we have no additional interrupts to give out.
> > > Am I understanding that correctly?
> >
> > The vf_total_msix is static value and shouldn't be recalculated after
> > every MSI-X vector number change. So 0 means that driver doesn't support
> > at all this feature. The operator is responsible to make proper assignment
> > calculations, because he is already doing it for the CPUs and netdev queues.
>
> Honestly that makes things even uglier. So basically this is a feature
> where if it isn't supported it will make it look like the SR-IOV
> device doesn't support MSI-X. I realize it is just the way it is
> worded but it isn't very pretty.

I'm not native English speaker, we can work together later on the
Documentation to make it more clear.

>
> > >
> > > The problem is this is very vendor specific so I am doing my best to
> > > understand it as it is different then the other NICs I have worked
> > > with.
> >
> > There is nothing vendor specific here. There are two types of devices:
> > 1. Support this feature. - The vf_total_msix will be greater than 0 for them
> > and their FW will do sanity checks when user overwrites their default number
> > that they sat in the VF creation stage.
> > 2. Doesn't support this feature - The vf_total_msix will be 0.
> >
> > It is PCI spec, so those "other NICs" that didn't implement the PCI spec
> > will stay with option #2. It is not different from current situation.
>
> Where in the spec is this?
>
> I know in the PCI spec it says that the MSI-X table size is read-only
> and is not supposed to be written by system software. That is what is
> being overwritten right now by your patches that has me concerned.

My patches are not over-writting anything, they are asking from FW to
set more appropriate value. The field stays read-only.

<...>

> >
> > I remind you that this feature is applicable to all SR-IOV devices, we have
> > RDMA, NVMe, crypto, FPGA and netdev VFs. The devlink is not supported
> > outside of netdev world and implementation there will make this feature
> > is far from being usable.
>
> To quote the documentation:
> "devlink is an API to expose device information and resources not directly
> related to any device class, such as chip-wide/switch-ASIC-wide configuration."

There is a great world outside of netdev and it doesn't include devlink. :)

>
> > Right now, the configuration of PCI/core is done through sysfs, so let's
> > review this feature from PCI/core perspective and not from netdev where
> > sysfs is not widely used.
>
> The problem is what you are configuring is not necessarily PCI device
> specific. You are configuring the firmware which operates at a level
> above the PCI device. You just have it manifesting as a PCI behavior
> as you are editing a read-only configuration space field.

In our devices, PCI config space is managed by FW and it represents HW.

>
> Also as I mentioned a few times now the approach you are taking
> violates the PCIe spec by essentially providing a means of making a
> read-only field writable.

AS you said, we see the same picture differently.

Thansk

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-14 16:48                 ` Jason Gunthorpe
@ 2021-01-14 17:55                   ` Alexander Duyck
  2021-01-14 18:29                     ` Jason Gunthorpe
  0 siblings, 1 reply; 49+ messages in thread
From: Alexander Duyck @ 2021-01-14 17:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Thu, Jan 14, 2021 at 8:49 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Jan 14, 2021 at 08:40:14AM -0800, Alexander Duyck wrote:
>
> > Where I think you and I disagree is that I really think the MSI-X
> > table size should be fixed at one value for the life of the VF.
> > Instead of changing the table size it should be the number of vectors
> > that are functional that should be the limit. Basically there should
> > be only some that are functional and some that are essentially just
> > dummy vectors that you can write values into that will never be used.
>
> Ignoring the PCI config space to learn the # of available MSI-X
> vectors is big break on the how the device's programming ABI works.
>
> Or stated another way, that isn't compatible with any existing drivers
> so it is basically not a useful approach as it can't be deployed.
>
> I don't know why you think that is better.
>
> Jason

First off, this is technically violating the PCIe spec section 7.7.2.2
because this is the device driver modifying the Message Control
register for a device, even if it is the PF firmware modifying the VF.
The table size is something that should be set and fixed at device
creation and not changed.

The MSI-X table is essentially just an MMIO resource, and I believe it
should not be resized, just as you wouldn't expect any MMIO BAR to be
dynamically resized. Many drivers don't make use of the full MSI-X
table nor do they bother reading the size. We just populate a subset
of the table based on the number of interrupt causes we will need to
associate to interrupt handlers. You can check for yourself. There are
only a handful of drivers such as vfio that ever bother reading at the
offset "PCI_MSIX_FLAGS". Normally it is the number of interrupt causes
that determine how many we need, not the size of the table. In
addition the OS may restrict us further since there are only so many
MSI-X interrupts that are supported per system/socket.

As far as the programming ABI, having support for some number of MSI-X
vectors isn't the same as having that number of MSI-X vectors. The
MSI-X vector table entry is nothing more than a spot to write an
address/data pair with the ability to mask the value to prevent it
from being triggered. The driver/OS will associate some handler to
that address/data pair. Populating an entry doesn't guarantee the
interrupt will ever be used. The programming model for the device is
what defines what trigger will be associated with it and when/how it
will be used.

What I see this patch doing is trying to push driver PF policy onto
the VF PCIe device configuration space dynamically. Having some
limited number of interrupt causes should really be what is limiting
things here. I see that being mostly a thing between the firmware and
the VF in terms of configuration and not something that necessarily
has to be pushed down onto the PCIe configuration space itself.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-14 17:55                   ` Alexander Duyck
@ 2021-01-14 18:29                     ` Jason Gunthorpe
  2021-01-14 19:24                       ` Alexander Duyck
  0 siblings, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2021-01-14 18:29 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Thu, Jan 14, 2021 at 09:55:24AM -0800, Alexander Duyck wrote:
> On Thu, Jan 14, 2021 at 8:49 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Jan 14, 2021 at 08:40:14AM -0800, Alexander Duyck wrote:
> >
> > > Where I think you and I disagree is that I really think the MSI-X
> > > table size should be fixed at one value for the life of the VF.
> > > Instead of changing the table size it should be the number of vectors
> > > that are functional that should be the limit. Basically there should
> > > be only some that are functional and some that are essentially just
> > > dummy vectors that you can write values into that will never be used.
> >
> > Ignoring the PCI config space to learn the # of available MSI-X
> > vectors is big break on the how the device's programming ABI works.
> >
> > Or stated another way, that isn't compatible with any existing drivers
> > so it is basically not a useful approach as it can't be deployed.
> >
> > I don't know why you think that is better.
> >
> > Jason
> 
> First off, this is technically violating the PCIe spec section 7.7.2.2
> because this is the device driver modifying the Message Control
> register for a device, even if it is the PF firmware modifying the VF.
> The table size is something that should be set and fixed at device
> creation and not changed.

The word "violating" is rather an over-reaction, at worst this is an
extension.

> The MSI-X table is essentially just an MMIO resource, and I believe it
> should not be resized, just as you wouldn't expect any MMIO BAR to be
> dynamically resized. 

Resizing the BAR is already defined see commit 276b738deb5b ("PCI:
Add resizable BAR infrastructure")

As you say BAR and MSI vector table are not so different, it seems
perfectly in line with current PCI sig thinking to allow resizing the
MSI as well

> Many drivers don't make use of the full MSI-X table nor do they
> bother reading the size. We just populate a subset of the table
> based on the number of interrupt causes we will need to associate to
> interrupt handlers. 

This isn't about "many drivers" this is about what mlx5 does in all
the various OS drivers it has, and mlx5 has a sophisticated use of
MSI-X.

> What I see this patch doing is trying to push driver PF policy onto
> the VF PCIe device configuration space dynamically.

Huh? This is using the PF to dynamically reconfigure a child VF beyond
what the PCI spec defined. This is done safely under Linux because no
driver is bound when it is reconfigured, and any stale config data is
flushed out of any OS caches.

This is also why there is not a strong desire to standardize an ECN at
PCI-sig, the rules for how resizing can work are complicated and OS
specific.

> Having some limited number of interrupt causes should really be what
> is limiting things here. 

MSI inherently requires dedicated on-die resources to implement, so
every device has a maximum # of MSI vectors it can currently
expose. This is some consequence of various PCI rules and applies to
all devices.

To make effective use of this limited pool requires a hard restriction
enforced by the secure domain (hypervisor and FW) onto every
user. Every driver attached to the function needs to be aware of the
hard enforced limit by the secure domain to operate properly. It has
nothing to do with "limited number of interrupt causes".

The standards based way to communicate that limit is the MSI table cap
size.

To complain that changing the MSI table cap size dynamically is
non-standard then offer up a completely non-standard way to operate
MSI instead seems to miss the entire point.

The important standard is to keep the PCI config space acting per-spec
so all the various consumers can work as-is. The extension is to only
modify the rare hypervisor to support a dynamic MSI resizing extension
for VFs.

As far as applicability, any device working at high scale with MSI and
VMs is going to need this. Dynamically assigning the limited MSI HW is
really required to support the universe of VM configurations people
want. eg generally I would expect a VM to receive the number of MSI
vectors equal to the number of CPUs the VM gets.

> I see that being mostly a thing between the firmware and the VF in
> terms of configuration and not something that necessarily has to be
> pushed down onto the PCIe configuration space itself.

If mlx5 drivers had been designed long ago to never use standard based
MSI and instead did something internal with FW you might have a point,
but they weren't. All the mlx5 drivers use standards based MSI and
expect the config space to be correct.

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-14 18:29                     ` Jason Gunthorpe
@ 2021-01-14 19:24                       ` Alexander Duyck
  2021-01-14 19:56                         ` Leon Romanovsky
  2021-01-14 20:08                         ` Jason Gunthorpe
  0 siblings, 2 replies; 49+ messages in thread
From: Alexander Duyck @ 2021-01-14 19:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Thu, Jan 14, 2021 at 10:29 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Jan 14, 2021 at 09:55:24AM -0800, Alexander Duyck wrote:
> > On Thu, Jan 14, 2021 at 8:49 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Thu, Jan 14, 2021 at 08:40:14AM -0800, Alexander Duyck wrote:
> > >
> > > > Where I think you and I disagree is that I really think the MSI-X
> > > > table size should be fixed at one value for the life of the VF.
> > > > Instead of changing the table size it should be the number of vectors
> > > > that are functional that should be the limit. Basically there should
> > > > be only some that are functional and some that are essentially just
> > > > dummy vectors that you can write values into that will never be used.
> > >
> > > Ignoring the PCI config space to learn the # of available MSI-X
> > > vectors is big break on the how the device's programming ABI works.
> > >
> > > Or stated another way, that isn't compatible with any existing drivers
> > > so it is basically not a useful approach as it can't be deployed.
> > >
> > > I don't know why you think that is better.
> > >
> > > Jason
> >
> > First off, this is technically violating the PCIe spec section 7.7.2.2
> > because this is the device driver modifying the Message Control
> > register for a device, even if it is the PF firmware modifying the VF.
> > The table size is something that should be set and fixed at device
> > creation and not changed.
>
> The word "violating" is rather an over-reaction, at worst this is an
> extension.
>
> > The MSI-X table is essentially just an MMIO resource, and I believe it
> > should not be resized, just as you wouldn't expect any MMIO BAR to be
> > dynamically resized.
>
> Resizing the BAR is already defined see commit 276b738deb5b ("PCI:
> Add resizable BAR infrastructure")
>
> As you say BAR and MSI vector table are not so different, it seems
> perfectly in line with current PCI sig thinking to allow resizing the
> MSI as well

The resizing of a BAR has an extended capability that goes with it and
isn't something that the device can just do on a whim. This patch set
is not based on implementing some ECN for resizable MSI-X tables. I
view it as arbitrarily rewriting the table size for a device after it
is created.

In addition Leon still hasn't answered my question on preventing the
VF driver from altering entries beyond the ones listed in the table.
My concern is that this may just be glossing over things and
introducing potential issues in the process if a VF can access
resources that don't belong to it.

> > Many drivers don't make use of the full MSI-X table nor do they
> > bother reading the size. We just populate a subset of the table
> > based on the number of interrupt causes we will need to associate to
> > interrupt handlers.
>
> This isn't about "many drivers" this is about what mlx5 does in all
> the various OS drivers it has, and mlx5 has a sophisticated use of
> MSI-X.

Can you please cite an example for me? The problem here is you are
claiming things without any proof. I feel like the requirement for
changing the VF MSI-X table size is coming from something outside of
Linux and without having any info on that I cannot really understand
the issue this is trying to resolve.

From what I can tell, the mlx5 Linux driver never reads the MSI-X
flags register so it isn't reading the MSI-X size either.

> > What I see this patch doing is trying to push driver PF policy onto
> > the VF PCIe device configuration space dynamically.
>
> Huh? This is using the PF to dynamically reconfigure a child VF beyond
> what the PCI spec defined. This is done safely under Linux because no
> driver is bound when it is reconfigured, and any stale config data is
> flushed out of any OS caches.
>
> This is also why there is not a strong desire to standardize an ECN at
> PCI-sig, the rules for how resizing can work are complicated and OS
> specific.

At a minimum I really think we need to go through and have a clear
definition on when updating the MSI-X table size is okay and when it
is not. I am not sure just saying to not update it when a driver isn't
attached is enough to guarantee all that.

On top of that the interface as defined here is rather ugly. It is
just providing a sysfs front end for a vendor proprietary
circumvention of the fact that the MSI-X table size is read-only. If
we are going to do that we might as well allow any vendor that has a
backdoor to their PCIe config go through and edit it.

> > Having some limited number of interrupt causes should really be what
> > is limiting things here.
>
> MSI inherently requires dedicated on-die resources to implement, so
> every device has a maximum # of MSI vectors it can currently
> expose. This is some consequence of various PCI rules and applies to
> all devices.
>
> To make effective use of this limited pool requires a hard restriction
> enforced by the secure domain (hypervisor and FW) onto every
> user. Every driver attached to the function needs to be aware of the
> hard enforced limit by the secure domain to operate properly. It has
> nothing to do with "limited number of interrupt causes".

What we are talking about is the MSI-X table size. Not the number of
MSI-X vectors being requested by the device driver. Those are normally
two seperate things.

I can only assume you have some out-of-tree issue or some other OS
that is the problem. If you can describe the issue in more detail for
me we have something to work with. Otherwise the request so far seems
unreasonable to me.

> The standards based way to communicate that limit is the MSI table cap
> size.

This only defines the maximum number of entries, not how many have to be used.

> To complain that changing the MSI table cap size dynamically is
> non-standard then offer up a completely non-standard way to operate
> MSI instead seems to miss the entire point.

I'm not offering up a non-standard way to do this. Just think about
it. If I have a device that defines an MSI-X table size of 2048 but
makes use of only one vector how would that be any different than what
I am suggesting where you size your VF to whatever the maximum is you
need but only make use of some fixed number from the hardware.

> The important standard is to keep the PCI config space acting per-spec
> so all the various consumers can work as-is. The extension is to only
> modify the rare hypervisor to support a dynamic MSI resizing extension
> for VFs.

I will repeat what I said before. Why can't this be handled via the
vfio interface? You just need to add the ability to specify the upper
limit on vdev->msix_size so that it is a number between 1 and whatever
your max table size is. It sounds like something that probably belongs
in the vfio_pci_ioctl somewhere. Then if you have an OS running in a
guest that cannot help itself and will allocate an interrupt for every
vector, you can simply modify that value so that it puts a cap on the
total number of vectors it will try to allocate in the guest.

> As far as applicability, any device working at high scale with MSI and
> VMs is going to need this. Dynamically assigning the limited MSI HW is
> really required to support the universe of VM configurations people
> want. eg generally I would expect a VM to receive the number of MSI
> vectors equal to the number of CPUs the VM gets.

Again, you are pushing VM requirements on to PCI. That really seems
like the realm of vfio rather than PCI. Especially since your answer
to the problem is to update a value that is only really being read by
vfio on the host. It really just seems like it would make more sense
to maybe make this part of the vfio ioctl call so you could limit the
use of the MSI-X table in the guest.

> > I see that being mostly a thing between the firmware and the VF in
> > terms of configuration and not something that necessarily has to be
> > pushed down onto the PCIe configuration space itself.
>
> If mlx5 drivers had been designed long ago to never use standard based
> MSI and instead did something internal with FW you might have a point,
> but they weren't. All the mlx5 drivers use standards based MSI and
> expect the config space to be correct.

Again, from what I can tell you are updating a field that isn't read
by the mlx5 driver. It is read/written by the OS and the field itself
is only supposed to be updated by the OS according to the PCIe spec.
While it is all well and good that the firmware can circumvent this
and modify the MMIO space I really feel like it shouldn't be doing
that.

In my mind it sounds a lot like this is something that really should
have been configured in the VFIO driver as the problems you have
described all seem to be issues either with some unknown userspace app
or OSes other than Linux.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-14 19:24                       ` Alexander Duyck
@ 2021-01-14 19:56                         ` Leon Romanovsky
  2021-01-14 20:08                         ` Jason Gunthorpe
  1 sibling, 0 replies; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-14 19:56 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jason Gunthorpe, Bjorn Helgaas, Saeed Mahameed, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Thu, Jan 14, 2021 at 11:24:12AM -0800, Alexander Duyck wrote:
> On Thu, Jan 14, 2021 at 10:29 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Jan 14, 2021 at 09:55:24AM -0800, Alexander Duyck wrote:
> > > On Thu, Jan 14, 2021 at 8:49 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >
> > > > On Thu, Jan 14, 2021 at 08:40:14AM -0800, Alexander Duyck wrote:
> > > >
> > > > > Where I think you and I disagree is that I really think the MSI-X
> > > > > table size should be fixed at one value for the life of the VF.
> > > > > Instead of changing the table size it should be the number of vectors
> > > > > that are functional that should be the limit. Basically there should
> > > > > be only some that are functional and some that are essentially just
> > > > > dummy vectors that you can write values into that will never be used.
> > > >
> > > > Ignoring the PCI config space to learn the # of available MSI-X
> > > > vectors is big break on the how the device's programming ABI works.
> > > >
> > > > Or stated another way, that isn't compatible with any existing drivers
> > > > so it is basically not a useful approach as it can't be deployed.
> > > >
> > > > I don't know why you think that is better.
> > > >
> > > > Jason
> > >
> > > First off, this is technically violating the PCIe spec section 7.7.2.2
> > > because this is the device driver modifying the Message Control
> > > register for a device, even if it is the PF firmware modifying the VF.
> > > The table size is something that should be set and fixed at device
> > > creation and not changed.
> >
> > The word "violating" is rather an over-reaction, at worst this is an
> > extension.
> >
> > > The MSI-X table is essentially just an MMIO resource, and I believe it
> > > should not be resized, just as you wouldn't expect any MMIO BAR to be
> > > dynamically resized.
> >
> > Resizing the BAR is already defined see commit 276b738deb5b ("PCI:
> > Add resizable BAR infrastructure")
> >
> > As you say BAR and MSI vector table are not so different, it seems
> > perfectly in line with current PCI sig thinking to allow resizing the
> > MSI as well
>
> The resizing of a BAR has an extended capability that goes with it and
> isn't something that the device can just do on a whim. This patch set
> is not based on implementing some ECN for resizable MSI-X tables. I
> view it as arbitrarily rewriting the table size for a device after it
> is created.
>
> In addition Leon still hasn't answered my question on preventing the
> VF driver from altering entries beyond the ones listed in the table.
> My concern is that this may just be glossing over things and
> introducing potential issues in the process if a VF can access
> resources that don't belong to it.
>
> > > Many drivers don't make use of the full MSI-X table nor do they
> > > bother reading the size. We just populate a subset of the table
> > > based on the number of interrupt causes we will need to associate to
> > > interrupt handlers.
> >
> > This isn't about "many drivers" this is about what mlx5 does in all
> > the various OS drivers it has, and mlx5 has a sophisticated use of
> > MSI-X.
>
> Can you please cite an example for me? The problem here is you are
> claiming things without any proof. I feel like the requirement for
> changing the VF MSI-X table size is coming from something outside of
> Linux and without having any info on that I cannot really understand
> the issue this is trying to resolve.
>
> From what I can tell, the mlx5 Linux driver never reads the MSI-X
> flags register so it isn't reading the MSI-X size either.
>
> > > What I see this patch doing is trying to push driver PF policy onto
> > > the VF PCIe device configuration space dynamically.
> >
> > Huh? This is using the PF to dynamically reconfigure a child VF beyond
> > what the PCI spec defined. This is done safely under Linux because no
> > driver is bound when it is reconfigured, and any stale config data is
> > flushed out of any OS caches.
> >
> > This is also why there is not a strong desire to standardize an ECN at
> > PCI-sig, the rules for how resizing can work are complicated and OS
> > specific.
>
> At a minimum I really think we need to go through and have a clear
> definition on when updating the MSI-X table size is okay and when it
> is not. I am not sure just saying to not update it when a driver isn't
> attached is enough to guarantee all that.
>
> On top of that the interface as defined here is rather ugly. It is
> just providing a sysfs front end for a vendor proprietary
> circumvention of the fact that the MSI-X table size is read-only. If
> we are going to do that we might as well allow any vendor that has a
> backdoor to their PCIe config go through and edit it.
>
> > > Having some limited number of interrupt causes should really be what
> > > is limiting things here.
> >
> > MSI inherently requires dedicated on-die resources to implement, so
> > every device has a maximum # of MSI vectors it can currently
> > expose. This is some consequence of various PCI rules and applies to
> > all devices.
> >
> > To make effective use of this limited pool requires a hard restriction
> > enforced by the secure domain (hypervisor and FW) onto every
> > user. Every driver attached to the function needs to be aware of the
> > hard enforced limit by the secure domain to operate properly. It has
> > nothing to do with "limited number of interrupt causes".
>
> What we are talking about is the MSI-X table size. Not the number of
> MSI-X vectors being requested by the device driver. Those are normally
> two seperate things.
>
> I can only assume you have some out-of-tree issue or some other OS
> that is the problem. If you can describe the issue in more detail for
> me we have something to work with. Otherwise the request so far seems
> unreasonable to me.
>
> > The standards based way to communicate that limit is the MSI table cap
> > size.
>
> This only defines the maximum number of entries, not how many have to be used.
>
> > To complain that changing the MSI table cap size dynamically is
> > non-standard then offer up a completely non-standard way to operate
> > MSI instead seems to miss the entire point.
>
> I'm not offering up a non-standard way to do this. Just think about
> it. If I have a device that defines an MSI-X table size of 2048 but
> makes use of only one vector how would that be any different than what
> I am suggesting where you size your VF to whatever the maximum is you
> need but only make use of some fixed number from the hardware.
>
> > The important standard is to keep the PCI config space acting per-spec
> > so all the various consumers can work as-is. The extension is to only
> > modify the rare hypervisor to support a dynamic MSI resizing extension
> > for VFs.
>
> I will repeat what I said before. Why can't this be handled via the
> vfio interface? You just need to add the ability to specify the upper
> limit on vdev->msix_size so that it is a number between 1 and whatever
> your max table size is. It sounds like something that probably belongs
> in the vfio_pci_ioctl somewhere. Then if you have an OS running in a
> guest that cannot help itself and will allocate an interrupt for every
> vector, you can simply modify that value so that it puts a cap on the
> total number of vectors it will try to allocate in the guest.
>
> > As far as applicability, any device working at high scale with MSI and
> > VMs is going to need this. Dynamically assigning the limited MSI HW is
> > really required to support the universe of VM configurations people
> > want. eg generally I would expect a VM to receive the number of MSI
> > vectors equal to the number of CPUs the VM gets.
>
> Again, you are pushing VM requirements on to PCI. That really seems
> like the realm of vfio rather than PCI. Especially since your answer
> to the problem is to update a value that is only really being read by
> vfio on the host. It really just seems like it would make more sense
> to maybe make this part of the vfio ioctl call so you could limit the
> use of the MSI-X table in the guest.
>
> > > I see that being mostly a thing between the firmware and the VF in
> > > terms of configuration and not something that necessarily has to be
> > > pushed down onto the PCIe configuration space itself.
> >
> > If mlx5 drivers had been designed long ago to never use standard based
> > MSI and instead did something internal with FW you might have a point,
> > but they weren't. All the mlx5 drivers use standards based MSI and
> > expect the config space to be correct.
>
> Again, from what I can tell you are updating a field that isn't read
> by the mlx5 driver. It is read/written by the OS and the field itself
> is only supposed to be updated by the OS according to the PCIe spec.
> While it is all well and good that the firmware can circumvent this
> and modify the MMIO space I really feel like it shouldn't be doing
> that.
>
> In my mind it sounds a lot like this is something that really should
> have been configured in the VFIO driver as the problems you have
> described all seem to be issues either with some unknown userspace app
> or OSes other than Linux.

Let me summarize:
1. Our FW treats this new MSI-X table size value in the same manner as "default" one.
If user tries to access outside of this table, it will be denied by HW.
2. This feature is for Linux and OSes based Linux. High performance
devices needs that that number of CPUs == number of channels == number of MSI-X vectors
to achieve maximum performance.
3. Device should be operable after SR-IOV ennoblement, it means lspci
over newly created VF should present real and working MSI-X table size.
It can't set to all new VFs some arbitrary max value.
4. This is not for "decreasing" number of vectors, but for increasing.
5. The field was read-only and it stays read-only.

Thanks

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-14 19:24                       ` Alexander Duyck
  2021-01-14 19:56                         ` Leon Romanovsky
@ 2021-01-14 20:08                         ` Jason Gunthorpe
  2021-01-14 21:43                           ` Alexander Duyck
  1 sibling, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2021-01-14 20:08 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Thu, Jan 14, 2021 at 11:24:12AM -0800, Alexander Duyck wrote:

> > As you say BAR and MSI vector table are not so different, it seems
> > perfectly in line with current PCI sig thinking to allow resizing the
> > MSI as well
> 
> The resizing of a BAR has an extended capability that goes with it and
> isn't something that the device can just do on a whim. This patch set
> is not based on implementing some ECN for resizable MSI-X tables. I
> view it as arbitrarily rewriting the table size for a device after it
> is created.

The only difference is resizing the BAR is backed by an ECN, and this
is an extension. The device does not "do it on a whim" the OS tells it
when to change the size, exactly like for BAR resizing.

> In addition Leon still hasn't answered my question on preventing the
> VF driver from altering entries beyond the ones listed in the table.

Of course this is blocked, the FW completely revokes the HW resource
backing the vectors.

> From what I can tell, the mlx5 Linux driver never reads the MSI-X
> flags register so it isn't reading the MSI-X size either.

I don't know why you say that. All Linux drivers call into something
like pci_alloc_irq_vectors() requesting a maximum # of vectors and
that call returns the actual allocated. Drivers can request more
vectors than the system provides, which is what mlx5 does.

Under the call chain of pci_alloc_irq_vectors() it calls
pci_msix_vec_count() which does

	pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &control);
	return msix_table_size(control);

And eventually uses that to return the result to the driver.

So, yes, it reads the config space and ensures it doesn't allocate
more vectors than that.

Every driver using MSI does this in Linux.

Adjusting config space *directly* limits the number of vectors the
driver allocates.

You should be able to find the call chain in mlx5 based on the above
guidance.

> At a minimum I really think we need to go through and have a clear
> definition on when updating the MSI-X table size is okay and when it
> is not. I am not sure just saying to not update it when a driver isn't
> attached is enough to guarantee all that.

If you know of a real issue then please state it, other don't fear
monger "maybe" issues that don't exist.

> What we are talking about is the MSI-X table size. Not the number of
> MSI-X vectors being requested by the device driver. Those are normally
> two seperate things.

Yes, table size is what is critical. The number of entries in that BAR
memory is what needs to be controlled.

> > The standards based way to communicate that limit is the MSI table cap
> > size.
> 
> This only defines the maximum number of entries, not how many have to be used.

A driver can't use entries beyond the cap. We are not trying to
reclaim vectors that are available but not used by the OS.

> I'm not offering up a non-standard way to do this. Just think about
> it. If I have a device that defines an MSI-X table size of 2048 but
> makes use of only one vector how would that be any different than what
> I am suggesting where you size your VF to whatever the maximum is you
> need but only make use of some fixed number from the hardware.

That is completely different, in the hypervisor there is no idea how
many vectors a guest OS will create. The FW is told to only permit 1
vector. How is the guest to know this if we don't update the config
space *as the standard requires* ?

> I will repeat what I said before. Why can't this be handled via the
> vfio interface? 

1) The FW has to be told of the limit and everything has to be in sync
   If the FW is out of sync with the config space then everything
   breaks if the user makes even a small mistake - for instance
   forgetting to use the ioctl to override vfio. This is needlessly
   frail and complicated.

2) VFIO needs to know how to tell the FW the limit so it can override
   the config space with emulation. This is all device specific and I
   don't see that adding an extension to vfio is any better than
   adding one here.

3) VFIO doesn't cover any other driver that binds to the VF, so
   this "solution" leaves the normal mlx5_core functionally broken on
   VFs which is a major regression.

Overall the entire idea to have the config space not reflect the
actual current device configuration seems completely wrong to me - why
do this? For what gain? It breaks everything.

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-14 20:08                         ` Jason Gunthorpe
@ 2021-01-14 21:43                           ` Alexander Duyck
  2021-01-14 23:28                             ` Alex Williamson
  2021-01-15 13:51                             ` Jason Gunthorpe
  0 siblings, 2 replies; 49+ messages in thread
From: Alexander Duyck @ 2021-01-14 21:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Thu, Jan 14, 2021 at 12:08 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Jan 14, 2021 at 11:24:12AM -0800, Alexander Duyck wrote:
>
> > > As you say BAR and MSI vector table are not so different, it seems
> > > perfectly in line with current PCI sig thinking to allow resizing the
> > > MSI as well
> >
> > The resizing of a BAR has an extended capability that goes with it and
> > isn't something that the device can just do on a whim. This patch set
> > is not based on implementing some ECN for resizable MSI-X tables. I
> > view it as arbitrarily rewriting the table size for a device after it
> > is created.
>
> The only difference is resizing the BAR is backed by an ECN, and this
> is an extension. The device does not "do it on a whim" the OS tells it
> when to change the size, exactly like for BAR resizing.
>
> > In addition Leon still hasn't answered my question on preventing the
> > VF driver from altering entries beyond the ones listed in the table.
>
> Of course this is blocked, the FW completely revokes the HW resource
> backing the vectors.

One of the troubles with this is that I am having to take your word
for it. The worst case scenario in my mind is that this is just was
Leon described it earlier and the firmware interface is doing nothing
more than altering the table size in the MSI-X config space so that
the value can be picked up by VFIO and advertised to the guest. In
such a situation you end up opening up a backdoor as now there are
vectors that could be configured by userspace since the protections
provided by VFIO are disabled as you could be reporting a size that is
smaller than the actual number or vectors.

These are the kind of things I am worried about with this interface.
It just seems like this is altering the VF PCIe device config to
address an issue that would be better handled by the vfio interface
ath VM creation time. At least if this was left to vfio it could
prevent the VM from being able to access the unused entries and just
limit the guest to the ones that we want to have the VM access.
Instead we are just being expected to trust the firmware for security
from the VF should it decide to try and be malicious.

> > From what I can tell, the mlx5 Linux driver never reads the MSI-X
> > flags register so it isn't reading the MSI-X size either.
>
> I don't know why you say that. All Linux drivers call into something
> like pci_alloc_irq_vectors() requesting a maximum # of vectors and
> that call returns the actual allocated. Drivers can request more
> vectors than the system provides, which is what mlx5 does.
>
> Under the call chain of pci_alloc_irq_vectors() it calls
> pci_msix_vec_count() which does
>
>         pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &control);
>         return msix_table_size(control);
>
> And eventually uses that to return the result to the driver.
>
> So, yes, it reads the config space and ensures it doesn't allocate
> more vectors than that.
>
> Every driver using MSI does this in Linux.
>
> Adjusting config space *directly* limits the number of vectors the
> driver allocates.
>
> You should be able to find the call chain in mlx5 based on the above
> guidance.

Yes, but at the same time you also end up passing a max_vecs into the
function as you have multiple limitations that come into account from
both the driver, the system, and the table. The MSI-X table size is
just one piece. Specifically the bit in the code for the mlx5 driver
is:

nvec = MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
      MLX5_IRQ_VEC_COMP_BASE;
nvec = min_t(int, nvec, num_eqs);

So saying that the MSI-X table size is what defines the number of
interrupts you can use is only partially true. What it defines is the
aperture available in MMIO to define the possible addresses and values
to be written to trigger the interrupts. The device itself plays a
large role in defining the number of interrupts ultimately requested.

> > At a minimum I really think we need to go through and have a clear
> > definition on when updating the MSI-X table size is okay and when it
> > is not. I am not sure just saying to not update it when a driver isn't
> > attached is enough to guarantee all that.
>
> If you know of a real issue then please state it, other don't fear
> monger "maybe" issues that don't exist.

Well I don't have visibility into your firmware so I am not sure what
is going on in response to this command so forgive me when I do a bit
of fear mongering when somebody tells me that all this patch set does
is modify the VF configuration space.

As I have said my main concern is somebody really screwing something
like this up and creating a security vulnerability where they do
something exactly like updating just the MSI-X table size without
protecting the MMIO region that contains the remaining MSI-X table
entries. This is why I would be much more comfortable with something
like a vfio ioctl that says that while the device supports the number
reported in the MSI-X table size the vfio will only present some
subset of those entries to the guest. With that I could at least
review the code and verify that it is providing the expected
protections.

> > What we are talking about is the MSI-X table size. Not the number of
> > MSI-X vectors being requested by the device driver. Those are normally
> > two seperate things.
>
> Yes, table size is what is critical. The number of entries in that BAR
> memory is what needs to be controlled.

That is where we disagree. My past experience is that in the device
you have to be able to associate an interrupt cause to an interrupt
vector. Normally as a part of that the device itself will place some
limit on how many causes and vectors you can associate before you even
get to the MSI-X table. The MSI-X table size is usually a formality
that defines the upper limit on the number of entries the device might
request. The reason why most drivers don't bother asking for it or
reading it is because it is defined early as a part of the definition
of the device itself.

Going back to my earlier example I am not going to size a MSI-X table
at 2048 for a device that only has a few interrupt sources. Odds are I
would size it such that I will have enough entries in the table to
cover all my interrupt sources. Usually the limiting factor for an
MSI-X request is the system itself as too many devices requesting a
large number of interrupts may end up eating up all the vectors
available for a given CPU.

> > > The standards based way to communicate that limit is the MSI table cap
> > > size.
> >
> > This only defines the maximum number of entries, not how many have to be used.
>
> A driver can't use entries beyond the cap. We are not trying to
> reclaim vectors that are available but not used by the OS.

The MSI-X table consists of a MMIO region in one of the BARs on the
device. It is easily possible to code something up in a driver that
would go in and be able to access the region. Most sensible devices
place it in a separate BAR but even then you have to worry about them
possibly sharing the memory region internally among several devices.

The big thing I see as an issue with all this is that arbitrarily
reducing the size of the MSI-X table doesn't have any actual effect on
the MMIO resources. They are still there. So a bad firmware doing
something like reducing the table size without also restricting access
to the resources in the BAR potentially opens up a security hole as
the MSI-X vector is really nothing more than a pre-programmed PCIe
write waiting for something to trigger it. Odds are an IOMMU would
block it, but still not necessarily a good thing.

As such my preference would be to leave the MSI-X table size static at
the size of possible vectors that could be modified, and instead have
the firmware guarantee that writing to those registers has no effect.
Then when you do something like a direct assignment vfio_pci will also
guard that region of MMIO instead of only guarding a portion of it.

> > I'm not offering up a non-standard way to do this. Just think about
> > it. If I have a device that defines an MSI-X table size of 2048 but
> > makes use of only one vector how would that be any different than what
> > I am suggesting where you size your VF to whatever the maximum is you
> > need but only make use of some fixed number from the hardware.
>
> That is completely different, in the hypervisor there is no idea how
> many vectors a guest OS will create. The FW is told to only permit 1
> vector. How is the guest to know this if we don't update the config
> space *as the standard requires* ?

Doesn't the guest driver talk to the firmware? Last I knew it had to
request additional resources such as queues and those come from the
firmware don't they?

> > I will repeat what I said before. Why can't this be handled via the
> > vfio interface?
>
> 1) The FW has to be told of the limit and everything has to be in sync
>    If the FW is out of sync with the config space then everything
>    breaks if the user makes even a small mistake - for instance
>    forgetting to use the ioctl to override vfio. This is needlessly
>    frail and complicated.

That is also the way I feel about the sysfs solution.

I just see VFIO making the call to the device to notify it that it
only needs X number of vectors instead of having the VF sysfs doing
it.

> 2) VFIO needs to know how to tell the FW the limit so it can override
>    the config space with emulation. This is all device specific and I
>    don't see that adding an extension to vfio is any better than
>    adding one here.

So it depends on the setup. In my suggestion I was suggesting VFIO
defines the maximum, not the minimum. So the guest would automatically
know exactly how many it would have as the table size would be
specified inside the guest.

> 3) VFIO doesn't cover any other driver that binds to the VF, so
>    this "solution" leaves the normal mlx5_core functionally broken on
>    VFs which is a major regression.
>
> Overall the entire idea to have the config space not reflect the
> actual current device configuration seems completely wrong to me - why
> do this? For what gain? It breaks everything.

Your configuration is admittedly foreign to me as well. So if I am
understanding correctly your limiting factor isn't the number of
interrupt causes in the platform, but the firmware deciding to provide
too few interrupt vectors in the MSI-X table. I'm just kind of
surprised the firmware itself isn't providing some sort of messaging
on the number of interrupt vectors that a given device has since I
assume that it is already providing you with information on the number
of queues and such since that isn't provided by any other mechanism.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-14 21:43                           ` Alexander Duyck
@ 2021-01-14 23:28                             ` Alex Williamson
  2021-01-15  1:56                               ` Alexander Duyck
  2021-01-15 13:51                             ` Jason Gunthorpe
  1 sibling, 1 reply; 49+ messages in thread
From: Alex Williamson @ 2021-01-14 23:28 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jason Gunthorpe, Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile

On Thu, 14 Jan 2021 13:43:57 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Thu, Jan 14, 2021 at 12:08 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Jan 14, 2021 at 11:24:12AM -0800, Alexander Duyck wrote:
> >  
> > > > As you say BAR and MSI vector table are not so different, it seems
> > > > perfectly in line with current PCI sig thinking to allow resizing the
> > > > MSI as well  
> > >
> > > The resizing of a BAR has an extended capability that goes with it and
> > > isn't something that the device can just do on a whim. This patch set
> > > is not based on implementing some ECN for resizable MSI-X tables. I
> > > view it as arbitrarily rewriting the table size for a device after it
> > > is created.  
> >
> > The only difference is resizing the BAR is backed by an ECN, and this
> > is an extension. The device does not "do it on a whim" the OS tells it
> > when to change the size, exactly like for BAR resizing.
> >  
> > > In addition Leon still hasn't answered my question on preventing the
> > > VF driver from altering entries beyond the ones listed in the table.  
> >
> > Of course this is blocked, the FW completely revokes the HW resource
> > backing the vectors.  
> 
> One of the troubles with this is that I am having to take your word
> for it. The worst case scenario in my mind is that this is just was
> Leon described it earlier and the firmware interface is doing nothing
> more than altering the table size in the MSI-X config space so that
> the value can be picked up by VFIO and advertised to the guest. In
> such a situation you end up opening up a backdoor as now there are
> vectors that could be configured by userspace since the protections
> provided by VFIO are disabled as you could be reporting a size that is
> smaller than the actual number or vectors.

This is why we have interrupt remappers.  We gave up trying to actually
prevent vfio users from having access to write the MSI-X vector table,
there are too many backdoors and page-size issues that this is
impractical.  Instead we rely on interrupt remappers or equivalent
IOMMU hardware to make sure that a device can only signal the interrupts
we've programmed, which occurs via vfio ioctls making use of the MSI-X
capability information.  So if vfio thinks the limit is lower than
actually implemented in the vector table, the user shouldn't be able to
do anything with the extra vectors anyway.

Regarding the read-only nature of the MSI-X capability, would it help
if this interface posted a value which was enabled on FLR of the VF?
I think that would be legal relative to the spec, so really we're
talking about the semantics of whether we really need those a device
reset to change the value report in a read-only field.

> These are the kind of things I am worried about with this interface.
> It just seems like this is altering the VF PCIe device config to
> address an issue that would be better handled by the vfio interface
> ath VM creation time. At least if this was left to vfio it could
> prevent the VM from being able to access the unused entries and just
> limit the guest to the ones that we want to have the VM access.
> Instead we are just being expected to trust the firmware for security
> from the VF should it decide to try and be malicious.

It is possible for vfio to fake the MSI-X capability and limit what a
user can access, but I don't think that's what is being done here.
There was a previous comment that the goal is actually to expand the
number of MSI-X vectors for some devices.  vfio can't do that.  That
requires interaction and coordination with a PF driver managing a pool
of resources where we might require not only device ownership, but host
administrative privileges to prevent a user exhausting the pool.  That's
not ioctl material for vfio.

> > > From what I can tell, the mlx5 Linux driver never reads the MSI-X
> > > flags register so it isn't reading the MSI-X size either.  
> >
> > I don't know why you say that. All Linux drivers call into something
> > like pci_alloc_irq_vectors() requesting a maximum # of vectors and
> > that call returns the actual allocated. Drivers can request more
> > vectors than the system provides, which is what mlx5 does.
> >
> > Under the call chain of pci_alloc_irq_vectors() it calls
> > pci_msix_vec_count() which does
> >
> >         pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &control);
> >         return msix_table_size(control);
> >
> > And eventually uses that to return the result to the driver.
> >
> > So, yes, it reads the config space and ensures it doesn't allocate
> > more vectors than that.
> >
> > Every driver using MSI does this in Linux.
> >
> > Adjusting config space *directly* limits the number of vectors the
> > driver allocates.
> >
> > You should be able to find the call chain in mlx5 based on the above
> > guidance.  
> 
> Yes, but at the same time you also end up passing a max_vecs into the
> function as you have multiple limitations that come into account from
> both the driver, the system, and the table. The MSI-X table size is
> just one piece. Specifically the bit in the code for the mlx5 driver
> is:
> 
> nvec = MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
>       MLX5_IRQ_VEC_COMP_BASE;
> nvec = min_t(int, nvec, num_eqs);
> 
> So saying that the MSI-X table size is what defines the number of
> interrupts you can use is only partially true. What it defines is the
> aperture available in MMIO to define the possible addresses and values
> to be written to trigger the interrupts. The device itself plays a
> large role in defining the number of interrupts ultimately requested.
> 
> > > At a minimum I really think we need to go through and have a clear
> > > definition on when updating the MSI-X table size is okay and when it
> > > is not. I am not sure just saying to not update it when a driver isn't
> > > attached is enough to guarantee all that.  
> >
> > If you know of a real issue then please state it, other don't fear
> > monger "maybe" issues that don't exist.  
> 
> Well I don't have visibility into your firmware so I am not sure what
> is going on in response to this command so forgive me when I do a bit
> of fear mongering when somebody tells me that all this patch set does
> is modify the VF configuration space.
> 
> As I have said my main concern is somebody really screwing something
> like this up and creating a security vulnerability where they do
> something exactly like updating just the MSI-X table size without
> protecting the MMIO region that contains the remaining MSI-X table
> entries. This is why I would be much more comfortable with something
> like a vfio ioctl that says that while the device supports the number
> reported in the MSI-X table size the vfio will only present some
> subset of those entries to the guest. With that I could at least
> review the code and verify that it is providing the expected
> protections.

We put a lot of trust in VF firmware otherwise, perhaps too much, but I
don't see how this is outside of how we already trust VFs.  MSIs are
just a DMA write, users don't need to use the vector table to make
devices perform arbitrary DMA writes.  If we trust VF firmware to
signal MSIs with the VF RID, then the IOMMU should provide protection.

> > > What we are talking about is the MSI-X table size. Not the number of
> > > MSI-X vectors being requested by the device driver. Those are normally
> > > two seperate things.  
> >
> > Yes, table size is what is critical. The number of entries in that BAR
> > memory is what needs to be controlled.  
> 
> That is where we disagree. My past experience is that in the device
> you have to be able to associate an interrupt cause to an interrupt
> vector. Normally as a part of that the device itself will place some
> limit on how many causes and vectors you can associate before you even
> get to the MSI-X table. The MSI-X table size is usually a formality
> that defines the upper limit on the number of entries the device might
> request. The reason why most drivers don't bother asking for it or
> reading it is because it is defined early as a part of the definition
> of the device itself.
> 
> Going back to my earlier example I am not going to size a MSI-X table
> at 2048 for a device that only has a few interrupt sources. Odds are I
> would size it such that I will have enough entries in the table to
> cover all my interrupt sources. Usually the limiting factor for an
> MSI-X request is the system itself as too many devices requesting a
> large number of interrupts may end up eating up all the vectors
> available for a given CPU.
> 
> > > > The standards based way to communicate that limit is the MSI table cap
> > > > size.  
> > >
> > > This only defines the maximum number of entries, not how many have to be used.  
> >
> > A driver can't use entries beyond the cap. We are not trying to
> > reclaim vectors that are available but not used by the OS.  
> 
> The MSI-X table consists of a MMIO region in one of the BARs on the
> device. It is easily possible to code something up in a driver that
> would go in and be able to access the region. Most sensible devices
> place it in a separate BAR but even then you have to worry about them
> possibly sharing the memory region internally among several devices.
> 
> The big thing I see as an issue with all this is that arbitrarily
> reducing the size of the MSI-X table doesn't have any actual effect on
> the MMIO resources. They are still there. So a bad firmware doing
> something like reducing the table size without also restricting access
> to the resources in the BAR potentially opens up a security hole as
> the MSI-X vector is really nothing more than a pre-programmed PCIe
> write waiting for something to trigger it. Odds are an IOMMU would
> block it, but still not necessarily a good thing.

Exactly, it's nothing more than a DMA write, which a malicious user can
trigger the device to perform in other ways if not through the vector
table.  We rely on the IOMMU for protection.

> As such my preference would be to leave the MSI-X table size static at
> the size of possible vectors that could be modified, and instead have
> the firmware guarantee that writing to those registers has no effect.
> Then when you do something like a direct assignment vfio_pci will also
> guard that region of MMIO instead of only guarding a portion of it.
> 
> > > I'm not offering up a non-standard way to do this. Just think about
> > > it. If I have a device that defines an MSI-X table size of 2048 but
> > > makes use of only one vector how would that be any different than what
> > > I am suggesting where you size your VF to whatever the maximum is you
> > > need but only make use of some fixed number from the hardware.  
> >
> > That is completely different, in the hypervisor there is no idea how
> > many vectors a guest OS will create. The FW is told to only permit 1
> > vector. How is the guest to know this if we don't update the config
> > space *as the standard requires* ?  
> 
> Doesn't the guest driver talk to the firmware? Last I knew it had to
> request additional resources such as queues and those come from the
> firmware don't they?
> 
> > > I will repeat what I said before. Why can't this be handled via the
> > > vfio interface?  
> >
> > 1) The FW has to be told of the limit and everything has to be in sync
> >    If the FW is out of sync with the config space then everything
> >    breaks if the user makes even a small mistake - for instance
> >    forgetting to use the ioctl to override vfio. This is needlessly
> >    frail and complicated.  
> 
> That is also the way I feel about the sysfs solution.
> 
> I just see VFIO making the call to the device to notify it that it
> only needs X number of vectors instead of having the VF sysfs doing
> it.

Where does this occur?  As above, the device owner shouldn't
necessarily have privileges to exhaust the entire PF of vectors.  Are
we really going to go to the trouble of creating cgroups for MSI-X
vectors?  Reducing vectors through emulation is easy, but that's only a
partial solution to the goal here I think.  Thanks,

Alex

> > 2) VFIO needs to know how to tell the FW the limit so it can override
> >    the config space with emulation. This is all device specific and I
> >    don't see that adding an extension to vfio is any better than
> >    adding one here.  
> 
> So it depends on the setup. In my suggestion I was suggesting VFIO
> defines the maximum, not the minimum. So the guest would automatically
> know exactly how many it would have as the table size would be
> specified inside the guest.
> 
> > 3) VFIO doesn't cover any other driver that binds to the VF, so
> >    this "solution" leaves the normal mlx5_core functionally broken on
> >    VFs which is a major regression.
> >
> > Overall the entire idea to have the config space not reflect the
> > actual current device configuration seems completely wrong to me - why
> > do this? For what gain? It breaks everything.  
> 
> Your configuration is admittedly foreign to me as well. So if I am
> understanding correctly your limiting factor isn't the number of
> interrupt causes in the platform, but the firmware deciding to provide
> too few interrupt vectors in the MSI-X table. I'm just kind of
> surprised the firmware itself isn't providing some sort of messaging
> on the number of interrupt vectors that a given device has since I
> assume that it is already providing you with information on the number
> of queues and such since that isn't provided by any other mechanism.
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-14 23:28                             ` Alex Williamson
@ 2021-01-15  1:56                               ` Alexander Duyck
  2021-01-15 14:06                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 49+ messages in thread
From: Alexander Duyck @ 2021-01-15  1:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile

On Thu, Jan 14, 2021 at 3:28 PM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> On Thu, 14 Jan 2021 13:43:57 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
> > On Thu, Jan 14, 2021 at 12:08 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Thu, Jan 14, 2021 at 11:24:12AM -0800, Alexander Duyck wrote:
> > >
> > > > > As you say BAR and MSI vector table are not so different, it seems
> > > > > perfectly in line with current PCI sig thinking to allow resizing the
> > > > > MSI as well
> > > >
> > > > The resizing of a BAR has an extended capability that goes with it and
> > > > isn't something that the device can just do on a whim. This patch set
> > > > is not based on implementing some ECN for resizable MSI-X tables. I
> > > > view it as arbitrarily rewriting the table size for a device after it
> > > > is created.
> > >
> > > The only difference is resizing the BAR is backed by an ECN, and this
> > > is an extension. The device does not "do it on a whim" the OS tells it
> > > when to change the size, exactly like for BAR resizing.
> > >
> > > > In addition Leon still hasn't answered my question on preventing the
> > > > VF driver from altering entries beyond the ones listed in the table.
> > >
> > > Of course this is blocked, the FW completely revokes the HW resource
> > > backing the vectors.
> >
> > One of the troubles with this is that I am having to take your word
> > for it. The worst case scenario in my mind is that this is just was
> > Leon described it earlier and the firmware interface is doing nothing
> > more than altering the table size in the MSI-X config space so that
> > the value can be picked up by VFIO and advertised to the guest. In
> > such a situation you end up opening up a backdoor as now there are
> > vectors that could be configured by userspace since the protections
> > provided by VFIO are disabled as you could be reporting a size that is
> > smaller than the actual number or vectors.
>
> This is why we have interrupt remappers.  We gave up trying to actually
> prevent vfio users from having access to write the MSI-X vector table,
> there are too many backdoors and page-size issues that this is
> impractical.  Instead we rely on interrupt remappers or equivalent
> IOMMU hardware to make sure that a device can only signal the interrupts
> we've programmed, which occurs via vfio ioctls making use of the MSI-X
> capability information.  So if vfio thinks the limit is lower than
> actually implemented in the vector table, the user shouldn't be able to
> do anything with the extra vectors anyway.
>
> Regarding the read-only nature of the MSI-X capability, would it help
> if this interface posted a value which was enabled on FLR of the VF?
> I think that would be legal relative to the spec, so really we're
> talking about the semantics of whether we really need those a device
> reset to change the value report in a read-only field.

I realized this kind of got derailed when we started arguing about the
MSI-X table size. I am okay with the table size changing, however I am
still not a fan of the VF driver having to probe the PF driver to do
so. I really think it would work much better the other way around with
the PF having to probe the VF.

That said, it only works at the driver level. So if the firmware is
the one that is having to do this it also occured to me that if this
update happened on FLR that would probably be preferred. Specifically
if it were tied in with the devlink resources interface I think it
would be a good fit for this issue as you could have a global pool of
vf-interrupt resources that are distributed though individual
subresources where each subresource would indicate an individual VF.
In addition the resources functionality for devlink has the concept of
a posted update where changes do not apply immediately if they cannot
be updated until some event occurs such as an FLR.

Since the mlx5 already supports devlink I don't see any reason why the
driver couldn't be extended to also support the devlink resource
interface and apply it to interrupts.

> > These are the kind of things I am worried about with this interface.
> > It just seems like this is altering the VF PCIe device config to
> > address an issue that would be better handled by the vfio interface
> > ath VM creation time. At least if this was left to vfio it could
> > prevent the VM from being able to access the unused entries and just
> > limit the guest to the ones that we want to have the VM access.
> > Instead we are just being expected to trust the firmware for security
> > from the VF should it decide to try and be malicious.
>
> It is possible for vfio to fake the MSI-X capability and limit what a
> user can access, but I don't think that's what is being done here.

Yeah, I am assuming that is what is being done here. However some of
the things Leon said gave me pause when he made a comment earlier that
it was just updating the MSI-X config. That may not have been what he
meant, but what worries me is an interface like this sets up things
for something like that in the future where someone decides to
short-cut the interface.

> There was a previous comment that the goal is actually to expand the
> number of MSI-X vectors for some devices.  vfio can't do that.  That
> requires interaction and coordination with a PF driver managing a pool
> of resources where we might require not only device ownership, but host
> administrative privileges to prevent a user exhausting the pool.  That's
> not ioctl material for vfio.

In this case the PF driver isn't really managing that, it is the
firmware. Basically it is an entity outside the kernel as the PF
doesn't even have control over its own resources, it is the firmware
that decides what it gets. Thus why the mlx5 driver is apparently at
the mercy of the MSI-X table size to determine how many interrupts it
can actually have.. :)

> > > > From what I can tell, the mlx5 Linux driver never reads the MSI-X
> > > > flags register so it isn't reading the MSI-X size either.
> > >
> > > I don't know why you say that. All Linux drivers call into something
> > > like pci_alloc_irq_vectors() requesting a maximum # of vectors and
> > > that call returns the actual allocated. Drivers can request more
> > > vectors than the system provides, which is what mlx5 does.
> > >
> > > Under the call chain of pci_alloc_irq_vectors() it calls
> > > pci_msix_vec_count() which does
> > >
> > >         pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &control);
> > >         return msix_table_size(control);
> > >
> > > And eventually uses that to return the result to the driver.
> > >
> > > So, yes, it reads the config space and ensures it doesn't allocate
> > > more vectors than that.
> > >
> > > Every driver using MSI does this in Linux.
> > >
> > > Adjusting config space *directly* limits the number of vectors the
> > > driver allocates.
> > >
> > > You should be able to find the call chain in mlx5 based on the above
> > > guidance.
> >
> > Yes, but at the same time you also end up passing a max_vecs into the
> > function as you have multiple limitations that come into account from
> > both the driver, the system, and the table. The MSI-X table size is
> > just one piece. Specifically the bit in the code for the mlx5 driver
> > is:
> >
> > nvec = MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
> >       MLX5_IRQ_VEC_COMP_BASE;
> > nvec = min_t(int, nvec, num_eqs);
> >
> > So saying that the MSI-X table size is what defines the number of
> > interrupts you can use is only partially true. What it defines is the
> > aperture available in MMIO to define the possible addresses and values
> > to be written to trigger the interrupts. The device itself plays a
> > large role in defining the number of interrupts ultimately requested.
> >
> > > > At a minimum I really think we need to go through and have a clear
> > > > definition on when updating the MSI-X table size is okay and when it
> > > > is not. I am not sure just saying to not update it when a driver isn't
> > > > attached is enough to guarantee all that.
> > >
> > > If you know of a real issue then please state it, other don't fear
> > > monger "maybe" issues that don't exist.
> >
> > Well I don't have visibility into your firmware so I am not sure what
> > is going on in response to this command so forgive me when I do a bit
> > of fear mongering when somebody tells me that all this patch set does
> > is modify the VF configuration space.
> >
> > As I have said my main concern is somebody really screwing something
> > like this up and creating a security vulnerability where they do
> > something exactly like updating just the MSI-X table size without
> > protecting the MMIO region that contains the remaining MSI-X table
> > entries. This is why I would be much more comfortable with something
> > like a vfio ioctl that says that while the device supports the number
> > reported in the MSI-X table size the vfio will only present some
> > subset of those entries to the guest. With that I could at least
> > review the code and verify that it is providing the expected
> > protections.
>
> We put a lot of trust in VF firmware otherwise, perhaps too much, but I
> don't see how this is outside of how we already trust VFs.  MSIs are
> just a DMA write, users don't need to use the vector table to make
> devices perform arbitrary DMA writes.  If we trust VF firmware to
> signal MSIs with the VF RID, then the IOMMU should provide protection.

Agreed. The IOMMU should take care of this. I'm not a fan of the
dynamic table resizing, however I can live with it. The one part that
is a hard no for me is the VF sysfs file. I would much rather see this
managed through the Devlink Resource interface.
https://www.kernel.org/doc/html/latest/networking/devlink/devlink-resource.html

That combined with the update taking place on FLR would likely be a
perfect fit in my eyes.

> > > > What we are talking about is the MSI-X table size. Not the number of
> > > > MSI-X vectors being requested by the device driver. Those are normally
> > > > two seperate things.
> > >
> > > Yes, table size is what is critical. The number of entries in that BAR
> > > memory is what needs to be controlled.
> >
> > That is where we disagree. My past experience is that in the device
> > you have to be able to associate an interrupt cause to an interrupt
> > vector. Normally as a part of that the device itself will place some
> > limit on how many causes and vectors you can associate before you even
> > get to the MSI-X table. The MSI-X table size is usually a formality
> > that defines the upper limit on the number of entries the device might
> > request. The reason why most drivers don't bother asking for it or
> > reading it is because it is defined early as a part of the definition
> > of the device itself.
> >
> > Going back to my earlier example I am not going to size a MSI-X table
> > at 2048 for a device that only has a few interrupt sources. Odds are I
> > would size it such that I will have enough entries in the table to
> > cover all my interrupt sources. Usually the limiting factor for an
> > MSI-X request is the system itself as too many devices requesting a
> > large number of interrupts may end up eating up all the vectors
> > available for a given CPU.
> >
> > > > > The standards based way to communicate that limit is the MSI table cap
> > > > > size.
> > > >
> > > > This only defines the maximum number of entries, not how many have to be used.
> > >
> > > A driver can't use entries beyond the cap. We are not trying to
> > > reclaim vectors that are available but not used by the OS.
> >
> > The MSI-X table consists of a MMIO region in one of the BARs on the
> > device. It is easily possible to code something up in a driver that
> > would go in and be able to access the region. Most sensible devices
> > place it in a separate BAR but even then you have to worry about them
> > possibly sharing the memory region internally among several devices.
> >
> > The big thing I see as an issue with all this is that arbitrarily
> > reducing the size of the MSI-X table doesn't have any actual effect on
> > the MMIO resources. They are still there. So a bad firmware doing
> > something like reducing the table size without also restricting access
> > to the resources in the BAR potentially opens up a security hole as
> > the MSI-X vector is really nothing more than a pre-programmed PCIe
> > write waiting for something to trigger it. Odds are an IOMMU would
> > block it, but still not necessarily a good thing.
>
> Exactly, it's nothing more than a DMA write, which a malicious user can
> trigger the device to perform in other ways if not through the vector
> table.  We rely on the IOMMU for protection.

I'm just not a fan of opening up more vectors, but as I said above I
would be willing to concede this part of the argument since it seems
like the mlx5 programming model requires the MSI-X table size to match
the limit.

> > As such my preference would be to leave the MSI-X table size static at
> > the size of possible vectors that could be modified, and instead have
> > the firmware guarantee that writing to those registers has no effect.
> > Then when you do something like a direct assignment vfio_pci will also
> > guard that region of MMIO instead of only guarding a portion of it.
> >
> > > > I'm not offering up a non-standard way to do this. Just think about
> > > > it. If I have a device that defines an MSI-X table size of 2048 but
> > > > makes use of only one vector how would that be any different than what
> > > > I am suggesting where you size your VF to whatever the maximum is you
> > > > need but only make use of some fixed number from the hardware.
> > >
> > > That is completely different, in the hypervisor there is no idea how
> > > many vectors a guest OS will create. The FW is told to only permit 1
> > > vector. How is the guest to know this if we don't update the config
> > > space *as the standard requires* ?
> >
> > Doesn't the guest driver talk to the firmware? Last I knew it had to
> > request additional resources such as queues and those come from the
> > firmware don't they?
> >
> > > > I will repeat what I said before. Why can't this be handled via the
> > > > vfio interface?
> > >
> > > 1) The FW has to be told of the limit and everything has to be in sync
> > >    If the FW is out of sync with the config space then everything
> > >    breaks if the user makes even a small mistake - for instance
> > >    forgetting to use the ioctl to override vfio. This is needlessly
> > >    frail and complicated.
> >
> > That is also the way I feel about the sysfs solution.
> >
> > I just see VFIO making the call to the device to notify it that it
> > only needs X number of vectors instead of having the VF sysfs doing
> > it.
>
> Where does this occur?  As above, the device owner shouldn't
> necessarily have privileges to exhaust the entire PF of vectors.  Are
> we really going to go to the trouble of creating cgroups for MSI-X
> vectors?  Reducing vectors through emulation is easy, but that's only a
> partial solution to the goal here I think.  Thanks,
>
> Alex

In my mind this would have been in response to having an ioctl call
come down indicating that we want to limit the number of MSI-X vectors
to some value. Keep in mind that also in this model I was suggesting
that the device just advertise whatever the maximum MSI-X table size
could be for the VF. So we would only ever be shrinking the device,
not growing it.

I was just kind of thinking that something like this might have uses
outside of the mlx5 case as I seem to recall, and not my memory could
be faulty, that there were some OSes such as Windows that would do the
kind of setup where it would allocate as many vectors as a device
could handle when the CPU count got high. In such a VM it might be
useful to have an option for the hypervisor to override the MSI-X
table size and tell the OS that it cannot allocate asw many vectors.

Anyway I was just kind of wandering into solution space on this trying
to come up with an alternate solution. As I said the ideal solution
for me at this point would be to see the Devlink Resource interface be
used to assign the interrupts, and then apply the changes on FLR.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-14 21:43                           ` Alexander Duyck
  2021-01-14 23:28                             ` Alex Williamson
@ 2021-01-15 13:51                             ` Jason Gunthorpe
  1 sibling, 0 replies; 49+ messages in thread
From: Jason Gunthorpe @ 2021-01-15 13:51 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson

On Thu, Jan 14, 2021 at 01:43:57PM -0800, Alexander Duyck wrote:

> > > In addition Leon still hasn't answered my question on preventing the
> > > VF driver from altering entries beyond the ones listed in the table.
> >
> > Of course this is blocked, the FW completely revokes the HW resource
> > backing the vectors.
> 
> One of the troubles with this is that I am having to take your word
> for it.

This is a Linux patch review, not a security review of a HW
implementation. There are million ways to screw up a PCI device
implementation and in SRIOV the PCI device HW implementation forms
part of the trust base of the hypervisor.

If the HW API can be implemented securely and the Linux code is
appropriate is the only question here.

In this case mlx5 HW is implemented correctly and securely, if you
don't belive then you are free not to use it.

> What it defines is the aperture available in MMIO to define the
> possible addresses and values to be written to trigger the
> interrupts. The device itself plays a large role in defining the
> number of interrupts ultimately requested.

Again you are confused about what is going on here - this is about
reconfiguring the HW so that MSI vector entries exist or not - it has
absoultely nothing to do with the driver. We are not optimizing for
the case where the driver does not use MSI vectors the VF has
available.

> > > At a minimum I really think we need to go through and have a clear
> > > definition on when updating the MSI-X table size is okay and when it
> > > is not. I am not sure just saying to not update it when a driver isn't
> > > attached is enough to guarantee all that.
> >
> > If you know of a real issue then please state it, other don't fear
> > monger "maybe" issues that don't exist.
>
> Well I don't have visibility into your firmware so I am not sure what
> is going on in response to this command so forgive me when I do a bit
> of fear mongering when somebody tells me that all this patch set does
> is modify the VF configuration space.

You were not talking about the FW, "is okay and when it is not" is a
*Linux* question.

> > > What we are talking about is the MSI-X table size. Not the number of
> > > MSI-X vectors being requested by the device driver. Those are normally
> > > two seperate things.
> >
> > Yes, table size is what is critical. The number of entries in that BAR
> > memory is what needs to be controlled.
> 
> That is where we disagree. 

Huh? You are disagreeing this is how the mlx5 device works?

> Normally as a part of that the device itself will place some
> limit on how many causes and vectors you can associate before you even
> get to the MSI-X table.

For mlx5 this cause limit is huge. With IMS it can even be higher than
the 2K MSI-X limit. Remember on an x86 system you get 256 interrupt
vectors per CPU *and* per vCPU, so with interrupt remapping there can
be huge numbers of interrupts required.

Your "normally" is for simplistic fixed function HW devices not
intended for use at this scale.

> The MSI-X table size is usually a formality that defines the upper
> limit on the number of entries the device might request.

It is not a formality. PCI rules require *actual physical HW* to be
dedicated to the MSI vector entries.

Think of it like this - the device has a large global MSI-X table of
say 2K entires. This is the actual physical HW SRAM backing MSI
entires required by PCIe.

The HW will map the MSI-X table BAR space in every PF/VF to a slice of
that global table. If the PCI Cap says 8 entries then the MSI-X page has
only 8 entries, everything else is /dev/null.

Global MSI entries cannot be shared - the total of all PF/VFs cap
field must not be more than 2K.

One application requires 2K MSI-X on a single function because it uses
VDPA devices and VT-d interrupt remapping

Another application requires 16 MSI-X on 128 VFs because it is using
SRIOV with VMs having 16 vCPUs.

The HW is configured differently in both cases. It is not something
that can be faked with VFIO!

> > That is completely different, in the hypervisor there is no idea how
> > many vectors a guest OS will create. The FW is told to only permit 1
> > vector. How is the guest to know this if we don't update the config
> > space *as the standard requires* ?
> 
> Doesn't the guest driver talk to the firmware? Last I knew it had to
> request additional resources such as queues and those come from the
> firmware don't they?

That is not how things work. Because VFIO has to be involved in
setting up interrupt remapping through its MSI emulation we don't get
to use a dynamic FW only path as something like IMS might imagine.

That would be so much better, but lots of things are not ready for
that.

> > 1) The FW has to be told of the limit and everything has to be in sync
> >    If the FW is out of sync with the config space then everything
> >    breaks if the user makes even a small mistake - for instance
> >    forgetting to use the ioctl to override vfio. This is needlessly
> >    frail and complicated.
> 
> That is also the way I feel about the sysfs solution.

Huh? The sysfs is much safer. If the write() succeeds I can't think of
any way the system would be left broken? Why do you think it is frail?

> I'm just kind of surprised the firmware itself isn't providing some
> sort of messaging on the number of interrupt vectors that a given

It does, it is the PCI cap, just because you keep saying it isn't used
doesn't make that true :)

> device has since I assume that it is already providing you with
> information on the number of queues and such since that isn't
> provided by any other mechanism.

Queues are effectively unlimited.

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-15  1:56                               ` Alexander Duyck
@ 2021-01-15 14:06                                 ` Jason Gunthorpe
  2021-01-15 15:53                                   ` Leon Romanovsky
  2021-01-16  4:32                                   ` Alexander Duyck
  0 siblings, 2 replies; 49+ messages in thread
From: Jason Gunthorpe @ 2021-01-15 14:06 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alex Williamson, Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile

On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:

> That said, it only works at the driver level. So if the firmware is
> the one that is having to do this it also occured to me that if this
> update happened on FLR that would probably be preferred. 

FLR is not free, I'd prefer not to require it just for some
philosophical reason.

> Since the mlx5 already supports devlink I don't see any reason why the
> driver couldn't be extended to also support the devlink resource
> interface and apply it to interrupts.

So you are OK with the PF changing the VF as long as it is devlink not
sysfs? Seems rather arbitary?

Leon knows best, but if I recall devlink becomes wonky when the VF
driver doesn't provide a devlink instance. How does it do reload of a
VF then?

I think you end up with essentially the same logic as presented here
with sysfs.

> > It is possible for vfio to fake the MSI-X capability and limit what a
> > user can access, but I don't think that's what is being done here.
> 
> Yeah, I am assuming that is what is being done here. 

Just to be really clear, that assumption is wrong

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-15 14:06                                 ` Jason Gunthorpe
@ 2021-01-15 15:53                                   ` Leon Romanovsky
  2021-01-16  1:48                                     ` Alexander Duyck
  2021-01-16  4:32                                   ` Alexander Duyck
  1 sibling, 1 reply; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-15 15:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Duyck, Alex Williamson, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile

On Fri, Jan 15, 2021 at 10:06:19AM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:
>
> > That said, it only works at the driver level. So if the firmware is
> > the one that is having to do this it also occured to me that if this
> > update happened on FLR that would probably be preferred.
>
> FLR is not free, I'd prefer not to require it just for some
> philosophical reason.
>
> > Since the mlx5 already supports devlink I don't see any reason why the
> > driver couldn't be extended to also support the devlink resource
> > interface and apply it to interrupts.
>
> So you are OK with the PF changing the VF as long as it is devlink not
> sysfs? Seems rather arbitary?
>
> Leon knows best, but if I recall devlink becomes wonky when the VF
> driver doesn't provide a devlink instance. How does it do reload of a
> VF then?
>
> I think you end up with essentially the same logic as presented here
> with sysfs.

The reasons why I decided to go with sysfs are:
1. This MSI-X table size change is applicable to ALL devices in the world,
and not only netdev.
2. This is purely PCI field and apply equally with same logic to all
subsystems and not to netdev only.
3. The sysfs interface is the standard way of configuring PCI/core, not
devlink.
4. This is how orchestration software provisioning VFs already. It fits
real world usage of SR-IOV, not the artificial one that is proposed during
the discussion.

So the idea to use devlink just because mlx5 supports it, sound really
wrong to me. If it was other driver from another subsystem without
devlink support, the request to use devlink won't never come.

Thanks

>
> > > It is possible for vfio to fake the MSI-X capability and limit what a
> > > user can access, but I don't think that's what is being done here.
> >
> > Yeah, I am assuming that is what is being done here.
>
> Just to be really clear, that assumption is wrong
>
> Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-15 15:53                                   ` Leon Romanovsky
@ 2021-01-16  1:48                                     ` Alexander Duyck
  2021-01-16  8:20                                       ` Leon Romanovsky
  0 siblings, 1 reply; 49+ messages in thread
From: Alexander Duyck @ 2021-01-16  1:48 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Alex Williamson, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile

On Fri, Jan 15, 2021 at 7:53 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Fri, Jan 15, 2021 at 10:06:19AM -0400, Jason Gunthorpe wrote:
> > On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:
> >
> > > That said, it only works at the driver level. So if the firmware is
> > > the one that is having to do this it also occured to me that if this
> > > update happened on FLR that would probably be preferred.
> >
> > FLR is not free, I'd prefer not to require it just for some
> > philosophical reason.
> >
> > > Since the mlx5 already supports devlink I don't see any reason why the
> > > driver couldn't be extended to also support the devlink resource
> > > interface and apply it to interrupts.
> >
> > So you are OK with the PF changing the VF as long as it is devlink not
> > sysfs? Seems rather arbitary?
> >
> > Leon knows best, but if I recall devlink becomes wonky when the VF
> > driver doesn't provide a devlink instance. How does it do reload of a
> > VF then?
> >
> > I think you end up with essentially the same logic as presented here
> > with sysfs.
>
> The reasons why I decided to go with sysfs are:
> 1. This MSI-X table size change is applicable to ALL devices in the world,
> and not only netdev.

In the PCI world MSI-X table size is a read only value. That is why I
am pushing back on this as a PCI interface.

> 2. This is purely PCI field and apply equally with same logic to all
> subsystems and not to netdev only.

Again, calling this "purely PCI" is the sort of wording that has me
concerned. I would prefer it if we avoid that wording. There is much
more to this than just modifying the table size field. The firmware is
having to shift resources between devices and this potentially has an
effect on the entire part, not just one VF.

> 3. The sysfs interface is the standard way of configuring PCI/core, not
> devlink.

This isn't PCI core that is being configured. It is the firmware for
the device. You are working with resources that are shared between
multiple functions.

> 4. This is how orchestration software provisioning VFs already. It fits
> real world usage of SR-IOV, not the artificial one that is proposed during
> the discussion.

What do you mean this is how they are doing it already? Do you have
something out-of-tree and that is why you are fighting to keep the
sysfs? If so that isn't a valid argument.

> So the idea to use devlink just because mlx5 supports it, sound really
> wrong to me. If it was other driver from another subsystem without
> devlink support, the request to use devlink won't never come.
>
> Thanks

I am suggesting the devlink resources interface because it would be a
VERY good fit for something like this. By the definition of it:
``devlink`` provides the ability for drivers to register resources, which
can allow administrators to see the device restrictions for a given
resource, as well as how much of the given resource is currently
in use. Additionally, these resources can optionally have configurable size.
This could enable the administrator to limit the number of resources that
are used.

Even looking over the example usage I don't see there being much to
prevent you from applying it to this issue. In addition it has the
idea of handling changes that cannot be immediately applied already
included. Your current solution doesn't have a good way of handling
that and instead just aborts with an error.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-15 14:06                                 ` Jason Gunthorpe
  2021-01-15 15:53                                   ` Leon Romanovsky
@ 2021-01-16  4:32                                   ` Alexander Duyck
  2021-01-18 15:47                                     ` Jason Gunthorpe
  1 sibling, 1 reply; 49+ messages in thread
From: Alexander Duyck @ 2021-01-16  4:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile

On Fri, Jan 15, 2021 at 6:06 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:
>
> > That said, it only works at the driver level. So if the firmware is
> > the one that is having to do this it also occured to me that if this
> > update happened on FLR that would probably be preferred.
>
> FLR is not free, I'd prefer not to require it just for some
> philosophical reason.

It wasn't so much a philosophical thing as the fact that it can sort
of take the place as a reload. Essentially with an FLR we are
rewriting the configuration so if the driver were involved it would be
a good time to pull in the MSI-X table update. However looking over
the mlx5 code I don't see any handling of FLR in there so I am
assuming that is handled by the firmware.

> > Since the mlx5 already supports devlink I don't see any reason why the
> > driver couldn't be extended to also support the devlink resource
> > interface and apply it to interrupts.
>
> So you are OK with the PF changing the VF as long as it is devlink not
> sysfs? Seems rather arbitary?

It is about the setup of things. The sysfs existing in the VF is kind
of ugly since it is a child device calling up to the parent and
telling it how it is supposed to be configured. I'm sure in theory we
could probably even have the VF request something like that itself
through some sort of mailbox and cut out the middle-man but that would
be even uglier.

If you take a look at the usage of pci_physfn it is usually in spots
where the PF is being looked for in order to find the policy that is
supposed to be applied to the VF. This is one of the first few cases
where it is being used to set the policy for the VF.

> Leon knows best, but if I recall devlink becomes wonky when the VF
> driver doesn't provide a devlink instance. How does it do reload of a
> VF then?

In my mind it was the PF driver providing a devlink instance for the
VF if a driver isn't loaded. Then if the mlx5 driver was loaded on the
VF you would replace that instance with one supported by the VF itself
in order to coordinate with the VF driver. That way if the mlx5 driver
is loaded on the VF you can still change the settings instead of being
blocked by your own driver.

As far as a reload the non-driver loaded case would probably look a
lot like how things are handled now with the taking of the device
lock, verifying no driver is loaded, notifying the firmware, and
releasing the lock when it is complete. If the mlx5 driver is loaded
on the VF it could be a more complete setup that would probably look
more like your standard driver reinit.

> I think you end up with essentially the same logic as presented here
> with sysfs.

It is similar, however the big difference is how the control is setup.
With the VF sysfs file running things it feels sort of like the tail
wagging the dog. You are having to go through and verify that this is
a VF, that the PF is present, that the PF supports this operation and
so on. If the PF is in charge of managing the configuration it should
be the one registering the interfaces, not the VF. That is my view on
this anyway as I feel it simplifies this quite a bit as the interface
won't be there if it isn't supported.:

> > > It is possible for vfio to fake the MSI-X capability and limit what a
> > > user can access, but I don't think that's what is being done here.
> >
> > Yeah, I am assuming that is what is being done here.
>
> Just to be really clear, that assumption is wrong

I misspoke and meant to agree with Alex's comment. If you are saying I
was wrong, then yes, I was wrong. I meant that I was assuming you were
resizing the actual table in the MMIO region where the MSI-X table and
PBA bits are present.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-16  1:48                                     ` Alexander Duyck
@ 2021-01-16  8:20                                       ` Leon Romanovsky
  2021-01-18  3:16                                         ` Alexander Duyck
  0 siblings, 1 reply; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-16  8:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jason Gunthorpe, Alex Williamson, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile

On Fri, Jan 15, 2021 at 05:48:59PM -0800, Alexander Duyck wrote:
> On Fri, Jan 15, 2021 at 7:53 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Fri, Jan 15, 2021 at 10:06:19AM -0400, Jason Gunthorpe wrote:
> > > On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:
> > >
> > > > That said, it only works at the driver level. So if the firmware is
> > > > the one that is having to do this it also occured to me that if this
> > > > update happened on FLR that would probably be preferred.
> > >
> > > FLR is not free, I'd prefer not to require it just for some
> > > philosophical reason.
> > >
> > > > Since the mlx5 already supports devlink I don't see any reason why the
> > > > driver couldn't be extended to also support the devlink resource
> > > > interface and apply it to interrupts.
> > >
> > > So you are OK with the PF changing the VF as long as it is devlink not
> > > sysfs? Seems rather arbitary?
> > >
> > > Leon knows best, but if I recall devlink becomes wonky when the VF
> > > driver doesn't provide a devlink instance. How does it do reload of a
> > > VF then?
> > >
> > > I think you end up with essentially the same logic as presented here
> > > with sysfs.
> >
> > The reasons why I decided to go with sysfs are:
> > 1. This MSI-X table size change is applicable to ALL devices in the world,
> > and not only netdev.
>
> In the PCI world MSI-X table size is a read only value. That is why I
> am pushing back on this as a PCI interface.

And it stays read-only.

>
> > 2. This is purely PCI field and apply equally with same logic to all
> > subsystems and not to netdev only.
>
> Again, calling this "purely PCI" is the sort of wording that has me
> concerned. I would prefer it if we avoid that wording. There is much
> more to this than just modifying the table size field. The firmware is
> having to shift resources between devices and this potentially has an
> effect on the entire part, not just one VF.

It is internal to HW implementation, dumb device can solve it differently.

>
> > 3. The sysfs interface is the standard way of configuring PCI/core, not
> > devlink.
>
> This isn't PCI core that is being configured. It is the firmware for
> the device. You are working with resources that are shared between
> multiple functions.

I'm ensuring that "lspci -vv .." will work correctly after such change.
It is PCI core responsibility.

>
> > 4. This is how orchestration software provisioning VFs already. It fits
> > real world usage of SR-IOV, not the artificial one that is proposed during
> > the discussion.
>
> What do you mean this is how they are doing it already? Do you have
> something out-of-tree and that is why you are fighting to keep the
> sysfs? If so that isn't a valid argument.

I have Kubernetes and OpenStack, indeed they are not part of the kernel tree.
They already use sriov_driver_autoprobe sysfs knob to disable autobind
before even starting. They configure MACs and bind VFs through sysfs/netlink
already. For them, the read/write of sysfs that is going to be bound to
the already created VM with known CPU properties, fits perfectly.

>
> > So the idea to use devlink just because mlx5 supports it, sound really
> > wrong to me. If it was other driver from another subsystem without
> > devlink support, the request to use devlink won't never come.
> >
> > Thanks
>
> I am suggesting the devlink resources interface because it would be a
> VERY good fit for something like this. By the definition of it:
> ``devlink`` provides the ability for drivers to register resources, which
> can allow administrators to see the device restrictions for a given
> resource, as well as how much of the given resource is currently
> in use. Additionally, these resources can optionally have configurable size.
> This could enable the administrator to limit the number of resources that
> are used.

It is not resource, but HW objects. The devlink doesn't even see the VFs
as long as they are not bound to the drivers.

This is an example:

[root@vm ~]# echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_drivers_autoprobe
[root@vm ~]# echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
[ 2370.579711] mlx5_core 0000:01:00.0: E-Switch: Disable: mode(LEGACY), nvfs(2), active vports(3)
[root@vm ~]# echo 2 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
[ 2377.663666] mlx5_core 0000:01:00.0: E-Switch: Enable: mode(LEGACY), nvfs(2), active vports(3)
[ 2377.777010] pci 0000:01:00.1: [15b3:101c] type 00 class 0x020000
[ 2377.784903] pci 0000:01:00.2: [15b3:101c] type 00 class 0x020000
[root@vm ~]# devlink dev
pci/0000:01:00.0
[root@vm ~]# lspci |grep nox
01:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
01:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
01:00.2 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]

So despite us having 2 VFs ready to be given to VMs, administrator doesn't
see them as devices.

>
> Even looking over the example usage I don't see there being much to
> prevent you from applying it to this issue. In addition it has the
> idea of handling changes that cannot be immediately applied already
> included. Your current solution doesn't have a good way of handling
> that and instead just aborts with an error.

Yes, because it is HW resource that should be applied immediately to
make sure that it is honored, before it is committed to the users.

It is very tempting to use devlink everywhere, but it is really wrong
tool for this scenario.

Thanks

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-16  8:20                                       ` Leon Romanovsky
@ 2021-01-18  3:16                                         ` Alexander Duyck
  2021-01-18  7:20                                           ` Leon Romanovsky
  0 siblings, 1 reply; 49+ messages in thread
From: Alexander Duyck @ 2021-01-18  3:16 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Alex Williamson, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile

On Sat, Jan 16, 2021 at 12:20 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Fri, Jan 15, 2021 at 05:48:59PM -0800, Alexander Duyck wrote:
> > On Fri, Jan 15, 2021 at 7:53 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Fri, Jan 15, 2021 at 10:06:19AM -0400, Jason Gunthorpe wrote:
> > > > On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:
> > > >
> > > > > That said, it only works at the driver level. So if the firmware is
> > > > > the one that is having to do this it also occured to me that if this
> > > > > update happened on FLR that would probably be preferred.
> > > >
> > > > FLR is not free, I'd prefer not to require it just for some
> > > > philosophical reason.
> > > >
> > > > > Since the mlx5 already supports devlink I don't see any reason why the
> > > > > driver couldn't be extended to also support the devlink resource
> > > > > interface and apply it to interrupts.
> > > >
> > > > So you are OK with the PF changing the VF as long as it is devlink not
> > > > sysfs? Seems rather arbitary?
> > > >
> > > > Leon knows best, but if I recall devlink becomes wonky when the VF
> > > > driver doesn't provide a devlink instance. How does it do reload of a
> > > > VF then?
> > > >
> > > > I think you end up with essentially the same logic as presented here
> > > > with sysfs.
> > >
> > > The reasons why I decided to go with sysfs are:
> > > 1. This MSI-X table size change is applicable to ALL devices in the world,
> > > and not only netdev.
> >
> > In the PCI world MSI-X table size is a read only value. That is why I
> > am pushing back on this as a PCI interface.
>
> And it stays read-only.

Only if you come at it directly. What this is adding is a back door
that is visible as a part of the VF sysfs.

> >
> > > 2. This is purely PCI field and apply equally with same logic to all
> > > subsystems and not to netdev only.
> >
> > Again, calling this "purely PCI" is the sort of wording that has me
> > concerned. I would prefer it if we avoid that wording. There is much
> > more to this than just modifying the table size field. The firmware is
> > having to shift resources between devices and this potentially has an
> > effect on the entire part, not just one VF.
>
> It is internal to HW implementation, dumb device can solve it differently.

That is my point. I am worried about "dumb devices" that may follow. I
would like to see the steps that should be taken to prevent these sort
of things called out specifically. Basically this isn't just modifying
the PCIe config space, it is actually resizing the PBA and MSI-X
table.

> >
> > > 3. The sysfs interface is the standard way of configuring PCI/core, not
> > > devlink.
> >
> > This isn't PCI core that is being configured. It is the firmware for
> > the device. You are working with resources that are shared between
> > multiple functions.
>
> I'm ensuring that "lspci -vv .." will work correctly after such change.
> It is PCI core responsibility.

The current code doesn't work on anything with a driver loaded on it.
In addition the messaging provided is fairly minimal which results in
an interface that will be difficult to understand when it doesn't
work. In addition there is currently only one piece of hardware that
works with this interface which is the mlx5. My concern is this is
adding overhead to all VFs that will not be used by most SR-IOV
capable devices. In my view it would make much more sense to have a
top-down approach instead of bottom-up where the PF is registering
interfaces for the VFs.

If you want yet another compromise I would be much happier with the PF
registering the sysfs interfaces on the VFs rather than the VFs
registering the interface and hoping the PF supports it. At least with
that you are guaranteed the PF will respond to the interface when it
is registered.

> >
> > > 4. This is how orchestration software provisioning VFs already. It fits
> > > real world usage of SR-IOV, not the artificial one that is proposed during
> > > the discussion.
> >
> > What do you mean this is how they are doing it already? Do you have
> > something out-of-tree and that is why you are fighting to keep the
> > sysfs? If so that isn't a valid argument.
>
> I have Kubernetes and OpenStack, indeed they are not part of the kernel tree.
> They already use sriov_driver_autoprobe sysfs knob to disable autobind
> before even starting. They configure MACs and bind VFs through sysfs/netlink
> already. For them, the read/write of sysfs that is going to be bound to
> the already created VM with known CPU properties, fits perfectly.

By that argument the same could be said about netlink. What I don't
get is why it is okay to configure the MAC through netlink but
suddenly when we are talking about interrupts it is out of the
question. As far as the binding that is the driver interface which is
more or less grandfathered in anyway as there aren't too many ways to
deal with them as there isn't an alternate interface for the drivers
to define support.

> >
> > > So the idea to use devlink just because mlx5 supports it, sound really
> > > wrong to me. If it was other driver from another subsystem without
> > > devlink support, the request to use devlink won't never come.
> > >
> > > Thanks
> >
> > I am suggesting the devlink resources interface because it would be a
> > VERY good fit for something like this. By the definition of it:
> > ``devlink`` provides the ability for drivers to register resources, which
> > can allow administrators to see the device restrictions for a given
> > resource, as well as how much of the given resource is currently
> > in use. Additionally, these resources can optionally have configurable size.
> > This could enable the administrator to limit the number of resources that
> > are used.
>
> It is not resource, but HW objects. The devlink doesn't even see the VFs
> as long as they are not bound to the drivers.
>
> This is an example:
>
> [root@vm ~]# echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_drivers_autoprobe
> [root@vm ~]# echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
> [ 2370.579711] mlx5_core 0000:01:00.0: E-Switch: Disable: mode(LEGACY), nvfs(2), active vports(3)
> [root@vm ~]# echo 2 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
> [ 2377.663666] mlx5_core 0000:01:00.0: E-Switch: Enable: mode(LEGACY), nvfs(2), active vports(3)
> [ 2377.777010] pci 0000:01:00.1: [15b3:101c] type 00 class 0x020000
> [ 2377.784903] pci 0000:01:00.2: [15b3:101c] type 00 class 0x020000
> [root@vm ~]# devlink dev
> pci/0000:01:00.0
> [root@vm ~]# lspci |grep nox
> 01:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> 01:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
> 01:00.2 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
>
> So despite us having 2 VFs ready to be given to VMs, administrator doesn't
> see them as devices.

The MSI-X vectors are a resource assigned to hardware objects. It just
depends on how you want to look at things. Right now you have the VFs
register an interface on behalf of the PF. I am arguing it would be
better to have the PF register an interface on behalf of the VFs.
Ultimately the PF is responsible for creating the VFs in the first
place. I don't see it as that much of a leap to have the
mlx5_sriov_enable call register interfaces for the VFs so that you can
configure the MSI-X vectors from the PF, and then tear them down
before it frees the VFs. Having the VFs do the work seems error prone
since it is assuming the interfaces are there on the PF when in all
cases but one (mlx5) it currently isn't.

> >
> > Even looking over the example usage I don't see there being much to
> > prevent you from applying it to this issue. In addition it has the
> > idea of handling changes that cannot be immediately applied already
> > included. Your current solution doesn't have a good way of handling
> > that and instead just aborts with an error.
>
> Yes, because it is HW resource that should be applied immediately to
> make sure that it is honored, before it is committed to the users.

The problem is you cannot do that at all if the driver is already
loaded. One advantage of using something like devlink is that you
could potentially have the VF driver help to coordinate things so you
could have the case where the VF has the mlx5 driver loaded work
correctly where you would update the MSI-X vector count and then
trigger the driver reload via devlink.

> It is very tempting to use devlink everywhere, but it is really wrong
> tool for this scenario.

We can agree to disagree there. I am not a fan of sysfs being applied
everywhere either. The problem is it is an easy goto when someone is
looking for a quick and dirty solution and often leads to more
problems later as it usually misses critical path locking issues and
the like.

Especially when it is making a subordinate interface look like the
MSI-X table size is somehow writable. I would much rather the creation
of any interface controlled more directly by the PF, or at a minimum
have the PF registering the interfaces rather than leaving this up to
the VF in the hopes that the PF provides the functionality needed to
service the request.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-18  3:16                                         ` Alexander Duyck
@ 2021-01-18  7:20                                           ` Leon Romanovsky
  2021-01-18 13:28                                             ` Leon Romanovsky
  0 siblings, 1 reply; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-18  7:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jason Gunthorpe, Alex Williamson, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile

On Sun, Jan 17, 2021 at 07:16:30PM -0800, Alexander Duyck wrote:
> On Sat, Jan 16, 2021 at 12:20 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Fri, Jan 15, 2021 at 05:48:59PM -0800, Alexander Duyck wrote:
> > > On Fri, Jan 15, 2021 at 7:53 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > On Fri, Jan 15, 2021 at 10:06:19AM -0400, Jason Gunthorpe wrote:
> > > > > On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:
> > > > >
> > > > > > That said, it only works at the driver level. So if the firmware is
> > > > > > the one that is having to do this it also occured to me that if this
> > > > > > update happened on FLR that would probably be preferred.
> > > > >
> > > > > FLR is not free, I'd prefer not to require it just for some
> > > > > philosophical reason.
> > > > >
> > > > > > Since the mlx5 already supports devlink I don't see any reason why the
> > > > > > driver couldn't be extended to also support the devlink resource
> > > > > > interface and apply it to interrupts.
> > > > >
> > > > > So you are OK with the PF changing the VF as long as it is devlink not
> > > > > sysfs? Seems rather arbitary?
> > > > >
> > > > > Leon knows best, but if I recall devlink becomes wonky when the VF
> > > > > driver doesn't provide a devlink instance. How does it do reload of a
> > > > > VF then?
> > > > >
> > > > > I think you end up with essentially the same logic as presented here
> > > > > with sysfs.
> > > >
> > > > The reasons why I decided to go with sysfs are:
> > > > 1. This MSI-X table size change is applicable to ALL devices in the world,
> > > > and not only netdev.
> > >
> > > In the PCI world MSI-X table size is a read only value. That is why I
> > > am pushing back on this as a PCI interface.
> >
> > And it stays read-only.
>
> Only if you come at it directly. What this is adding is a back door
> that is visible as a part of the VF sysfs.
>
> > >
> > > > 2. This is purely PCI field and apply equally with same logic to all
> > > > subsystems and not to netdev only.
> > >
> > > Again, calling this "purely PCI" is the sort of wording that has me
> > > concerned. I would prefer it if we avoid that wording. There is much
> > > more to this than just modifying the table size field. The firmware is
> > > having to shift resources between devices and this potentially has an
> > > effect on the entire part, not just one VF.
> >
> > It is internal to HW implementation, dumb device can solve it differently.
>
> That is my point. I am worried about "dumb devices" that may follow. I
> would like to see the steps that should be taken to prevent these sort
> of things called out specifically. Basically this isn't just modifying
> the PCIe config space, it is actually resizing the PBA and MSI-X
> table.

Exactly the last line the dumb device can implement differently. The
request is simple - configure MSI-X table size to be the new size.

>
> > >
> > > > 3. The sysfs interface is the standard way of configuring PCI/core, not
> > > > devlink.
> > >
> > > This isn't PCI core that is being configured. It is the firmware for
> > > the device. You are working with resources that are shared between
> > > multiple functions.
> >
> > I'm ensuring that "lspci -vv .." will work correctly after such change.
> > It is PCI core responsibility.
>
> The current code doesn't work on anything with a driver loaded on it.

The problem that no one care about this case, because in opposite to
other devices that usually operates in the hypervisor and probed during
the boot, the VFs are used differently. They run in VMs, probed there
and (usually) not needed in hypervisor.

The driver reload would make sense if PF MSI-X table was changed.

> In addition the messaging provided is fairly minimal which results in
> an interface that will be difficult to understand when it doesn't
> work.

I'm fond of simple interfaces: 0, EBUSY and EINVAL are common way
to inform user. We must remember that this interface is for low-level
PCI property and is needed for expert users who needs to squeeze maximum
for their VMs out of expensive high speed network card that supports SR-IOV.

According to the ebay, the CX6 card costs between 1000 and 1700 USDs,
not really home equipment.

> In addition there is currently only one piece of hardware that
> works with this interface which is the mlx5.

It is not different from any other feature, someone should be first.
This has very clear purpose, scoped well and understandable when and
why it is needed.

Kernel is full of devices and features that exist in one device only.

> My concern is this is
> adding overhead to all VFs that will not be used by most SR-IOV
> capable devices. In my view it would make much more sense to have a
> top-down approach instead of bottom-up where the PF is registering
> interfaces for the VFs.
>
> If you want yet another compromise I would be much happier with the PF
> registering the sysfs interfaces on the VFs rather than the VFs
> registering the interface and hoping the PF supports it. At least with
> that you are guaranteed the PF will respond to the interface when it
> is registered.

Thanks a lot, I appreciate it, will take a look now.

>
> > >
> > > > 4. This is how orchestration software provisioning VFs already. It fits
> > > > real world usage of SR-IOV, not the artificial one that is proposed during
> > > > the discussion.
> > >
> > > What do you mean this is how they are doing it already? Do you have
> > > something out-of-tree and that is why you are fighting to keep the
> > > sysfs? If so that isn't a valid argument.
> >
> > I have Kubernetes and OpenStack, indeed they are not part of the kernel tree.
> > They already use sriov_driver_autoprobe sysfs knob to disable autobind
> > before even starting. They configure MACs and bind VFs through sysfs/netlink
> > already. For them, the read/write of sysfs that is going to be bound to
> > the already created VM with known CPU properties, fits perfectly.
>
> By that argument the same could be said about netlink. What I don't
> get is why it is okay to configure the MAC through netlink but
> suddenly when we are talking about interrupts it is out of the
> question.

They belong to different subsystems, while MAC is applicable to the
netdev (both PF and VFs), MSI-X is applicable to all devices.

I'm not arguing about netlink vs. sysfs, just saying that devlink doesn't
fit here.

> As far as the binding that is the driver interface which is
> more or less grandfathered in anyway as there aren't too many ways to
> deal with them as there isn't an alternate interface for the drivers
> to define support.
>
> > >
> > > > So the idea to use devlink just because mlx5 supports it, sound really
> > > > wrong to me. If it was other driver from another subsystem without
> > > > devlink support, the request to use devlink won't never come.
> > > >
> > > > Thanks
> > >
> > > I am suggesting the devlink resources interface because it would be a
> > > VERY good fit for something like this. By the definition of it:
> > > ``devlink`` provides the ability for drivers to register resources, which
> > > can allow administrators to see the device restrictions for a given
> > > resource, as well as how much of the given resource is currently
> > > in use. Additionally, these resources can optionally have configurable size.
> > > This could enable the administrator to limit the number of resources that
> > > are used.
> >
> > It is not resource, but HW objects. The devlink doesn't even see the VFs
> > as long as they are not bound to the drivers.
> >
> > This is an example:
> >
> > [root@vm ~]# echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_drivers_autoprobe
> > [root@vm ~]# echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
> > [ 2370.579711] mlx5_core 0000:01:00.0: E-Switch: Disable: mode(LEGACY), nvfs(2), active vports(3)
> > [root@vm ~]# echo 2 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
> > [ 2377.663666] mlx5_core 0000:01:00.0: E-Switch: Enable: mode(LEGACY), nvfs(2), active vports(3)
> > [ 2377.777010] pci 0000:01:00.1: [15b3:101c] type 00 class 0x020000
> > [ 2377.784903] pci 0000:01:00.2: [15b3:101c] type 00 class 0x020000
> > [root@vm ~]# devlink dev
> > pci/0000:01:00.0
> > [root@vm ~]# lspci |grep nox
> > 01:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> > 01:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
> > 01:00.2 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
> >
> > So despite us having 2 VFs ready to be given to VMs, administrator doesn't
> > see them as devices.
>
> The MSI-X vectors are a resource assigned to hardware objects. It just
> depends on how you want to look at things. Right now you have the VFs
> register an interface on behalf of the PF. I am arguing it would be
> better to have the PF register an interface on behalf of the VFs.
> Ultimately the PF is responsible for creating the VFs in the first
> place. I don't see it as that much of a leap to have the
> mlx5_sriov_enable call register interfaces for the VFs so that you can
> configure the MSI-X vectors from the PF, and then tear them down
> before it frees the VFs. Having the VFs do the work seems error prone
> since it is assuming the interfaces are there on the PF when in all
> cases but one (mlx5) it currently isn't.

I'm not sure that I understood your last sentence correctly. If VF device
is not on hypervisor, it will mean the device is already probed and
change of MSI-X table is prohibited. I don't know how you can configure
VF devices to be passthrough to the VMs without SR-IOV enable call first.

I would say that all devices start their life at the same place where PF
is located.

>
> > >
> > > Even looking over the example usage I don't see there being much to
> > > prevent you from applying it to this issue. In addition it has the
> > > idea of handling changes that cannot be immediately applied already
> > > included. Your current solution doesn't have a good way of handling
> > > that and instead just aborts with an error.
> >
> > Yes, because it is HW resource that should be applied immediately to
> > make sure that it is honored, before it is committed to the users.
>
> The problem is you cannot do that at all if the driver is already
> loaded. One advantage of using something like devlink is that you
> could potentially have the VF driver help to coordinate things so you
> could have the case where the VF has the mlx5 driver loaded work
> correctly where you would update the MSI-X vector count and then
> trigger the driver reload via devlink.

The thing is that it is not needed for VFs at all.

>
> > It is very tempting to use devlink everywhere, but it is really wrong
> > tool for this scenario.
>
> We can agree to disagree there. I am not a fan of sysfs being applied
> everywhere either. The problem is it is an easy goto when someone is
> looking for a quick and dirty solution and often leads to more
> problems later as it usually misses critical path locking issues and
> the like.

It is fun that you mentioned that devlink as an example of good locking scheme.
Without going into to much details, right now Parav and myself are trying to fix
devlink locking around reload functionality. It was close to DOA for me when I
worked on auxiliary bus patches.

So no, devlink is not better. It is another (good) tool that needs more love
and care to be real PCI configuration utility. At lest, it should step out of
netdev shadow.

The block subsystem built whole stack around sysfs and they doesn't seem
upset about it.

Thanks

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-18  7:20                                           ` Leon Romanovsky
@ 2021-01-18 13:28                                             ` Leon Romanovsky
  2021-01-18 18:21                                               ` Alexander Duyck
  0 siblings, 1 reply; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-18 13:28 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jason Gunthorpe, Alex Williamson, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile

On Mon, Jan 18, 2021 at 09:20:08AM +0200, Leon Romanovsky wrote:
> On Sun, Jan 17, 2021 at 07:16:30PM -0800, Alexander Duyck wrote:
> > On Sat, Jan 16, 2021 at 12:20 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Fri, Jan 15, 2021 at 05:48:59PM -0800, Alexander Duyck wrote:
> > > > On Fri, Jan 15, 2021 at 7:53 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > >
> > > > > On Fri, Jan 15, 2021 at 10:06:19AM -0400, Jason Gunthorpe wrote:
> > > > > > On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:

<...>

> > If you want yet another compromise I would be much happier with the PF
> > registering the sysfs interfaces on the VFs rather than the VFs
> > registering the interface and hoping the PF supports it. At least with
> > that you are guaranteed the PF will respond to the interface when it
> > is registered.
>
> Thanks a lot, I appreciate it, will take a look now.

I found only two solutions to implement it in this way.
Option 1.
Allow multi entry write to some new sysfs knob that will receive BDF (or another VF
identification) and vector count. Something like this:

 echo "0000:01:00.2 123" > sriov_vf_msix_count

From one side, that solution is unlikely to be welcomed by Greg KH and from another,
it will require a lot of boilerplate code to make it safe and correct.

Option 2.
Create directory under PF device with files writable and organized by VF numbers.
It is doable, but will cause to code bloat with no gain at all. Cleaner than now,
it won't be.

Why the current approach with one file per-proper VF device is not good enough?

Thanks

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-16  4:32                                   ` Alexander Duyck
@ 2021-01-18 15:47                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 49+ messages in thread
From: Jason Gunthorpe @ 2021-01-18 15:47 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alex Williamson, Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile

On Fri, Jan 15, 2021 at 08:32:19PM -0800, Alexander Duyck wrote:
> On Fri, Jan 15, 2021 at 6:06 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:
> >
> > > That said, it only works at the driver level. So if the firmware is
> > > the one that is having to do this it also occured to me that if this
> > > update happened on FLR that would probably be preferred.
> >
> > FLR is not free, I'd prefer not to require it just for some
> > philosophical reason.
> 
> It wasn't so much a philosophical thing as the fact that it can sort
> of take the place as a reload. 

Asserting no driver is present and doing some SW-only "FLR" is pretty
much the same thing.

We can't issue FLR unless no driver is present anyhow, so really all
this does is add a useless step. If some HW needs FLR then it can do
it in here, but I don't see a value to inject it when not needed. 

Yes, if we were PCI-SIG we'd probably insist that a FLR be done, but
we are not PCI-SIG, this is just Linux, and asserting there are no
users of the MSI is sufficient.

> However looking over the mlx5 code I don't see any handling of FLR
> in there so I am assuming that is handled by the firmware.

The device does the device side of the FLR, the mlx5 driver should
trigger FLR during error recovery flows.

> It is about the setup of things. The sysfs existing in the VF is kind
> of ugly since it is a child device calling up to the parent and
> telling it how it is supposed to be configured. 

Well, the logical place to put that sysfs file is under the VF,
otherwise it becomes ugly in a different way. I agree it would be
nicer if the file only existed when the right driver is loaded, and
there was a better way to get from the PF to VF.

> I'm sure in theory we could probably even have the VF request
> something like that itself through some sort of mailbox and cut out
> the middle-man but that would be even uglier.

No, not ever. The VF is in a security domain that can't make those
kinds of changes to itself.

> In my mind it was the PF driver providing a devlink instance for the
> VF if a driver isn't loaded.

I think hacking up devlink to provide dummy devlink objects for VFs
that otherwise wouldn't exist and then ensuring handover to/from real
drivers that might want those objects natively, just for the sake of
using devlink to instead of the existing PCI sysfs is major overkill.

If we are even thinking of moving PCI to devlink I'd want to see
devlink taken out of net and a whole devlink PCI subsystem
infrastructure created to manage all this sanely.

Hacking a subystem into devlink on the side with some small niche
feature is not the way to approach such fundamental things.

I also don't know if PCI will get much value from netlinkification, or
if devlink is even the right netlink representation for PCI in the
first place.

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-10 15:07 ` [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors Leon Romanovsky
  2021-01-11 19:30   ` Alexander Duyck
@ 2021-01-18 17:03   ` Greg KH
  2021-01-19  5:38     ` Leon Romanovsky
  1 sibling, 1 reply; 49+ messages in thread
From: Greg KH @ 2021-01-18 17:03 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson

On Sun, Jan 10, 2021 at 05:07:24PM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Some SR-IOV capable devices provide an ability to configure specific
> number of MSI-X vectors on their VF prior driver is probed on that VF.
> 
> In order to make management easy, provide new read-only sysfs file that
> returns a total number of possible to configure MSI-X vectors.
> 
> cat /sys/bus/pci/devices/.../sriov_vf_total_msix
>   = 0 - feature is not supported
>   > 0 - total number of MSI-X vectors to consume by the VFs
> 
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci | 14 +++++++++++
>  drivers/pci/iov.c                       | 31 +++++++++++++++++++++++++
>  drivers/pci/pci.h                       |  3 +++
>  include/linux/pci.h                     |  2 ++
>  4 files changed, 50 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 05e26e5da54e..64e9b700acc9 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -395,3 +395,17 @@ Description:
>  		The file is writable if the PF is bound to a driver that
>  		supports the ->sriov_set_msix_vec_count() callback and there
>  		is no driver bound to the VF.
> +
> +What:		/sys/bus/pci/devices/.../sriov_vf_total_msix
> +Date:		January 2021
> +Contact:	Leon Romanovsky <leonro@nvidia.com>
> +Description:
> +		This file is associated with the SR-IOV PFs.
> +		It returns a total number of possible to configure MSI-X
> +		vectors on the enabled VFs.
> +
> +		The values returned are:
> +		 * > 0 - this will be total number possible to consume by VFs,
> +		 * = 0 - feature is not supported
> +
> +		If no SR-IOV VFs are enabled, this value will return 0.
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 42c0df4158d1..0a6ddf3230fd 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev,
>  	return count;
>  }
> 
> +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> +					struct device_attribute *attr,
> +					char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	return sprintf(buf, "%d\n", pdev->sriov->vf_total_msix);

Nit, please use sysfs_emit() for new sysfs files.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-18 13:28                                             ` Leon Romanovsky
@ 2021-01-18 18:21                                               ` Alexander Duyck
  2021-01-19  5:43                                                 ` Leon Romanovsky
  0 siblings, 1 reply; 49+ messages in thread
From: Alexander Duyck @ 2021-01-18 18:21 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Alex Williamson, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile

On Mon, Jan 18, 2021 at 5:28 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Mon, Jan 18, 2021 at 09:20:08AM +0200, Leon Romanovsky wrote:
> > On Sun, Jan 17, 2021 at 07:16:30PM -0800, Alexander Duyck wrote:
> > > On Sat, Jan 16, 2021 at 12:20 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > On Fri, Jan 15, 2021 at 05:48:59PM -0800, Alexander Duyck wrote:
> > > > > On Fri, Jan 15, 2021 at 7:53 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > >
> > > > > > On Fri, Jan 15, 2021 at 10:06:19AM -0400, Jason Gunthorpe wrote:
> > > > > > > On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:
>
> <...>
>
> > > If you want yet another compromise I would be much happier with the PF
> > > registering the sysfs interfaces on the VFs rather than the VFs
> > > registering the interface and hoping the PF supports it. At least with
> > > that you are guaranteed the PF will respond to the interface when it
> > > is registered.
> >
> > Thanks a lot, I appreciate it, will take a look now.
>
> I found only two solutions to implement it in this way.
> Option 1.
> Allow multi entry write to some new sysfs knob that will receive BDF (or another VF
> identification) and vector count. Something like this:
>
>  echo "0000:01:00.2 123" > sriov_vf_msix_count
>
> From one side, that solution is unlikely to be welcomed by Greg KH and from another,
> it will require a lot of boilerplate code to make it safe and correct.

You are overthinking this. I didn't say the sysfs had to be in the PF
directory itself. My request was that the PF is what placed the sysfs
file in the directory since indirectly it is responsible for spawning
the VF anyway it shouldn't be too much of a lift to have the PF place
sysfs files in the VF hierarchy.

The main piece I am not a fan of is the fact that the VF is blindly
registering an interface and presenting it without knowing if it even
works.

The secondary issue that I see as important, but I am willing to
compromise on is that the interface makes it appear as though the VF
configuration space is writable via this sysfs file. My preference
would be to somehow make it transparent that the PF is providing this
functionality. I thought it might be easier to do with devlink rather
than with sysfs which is why I have been preferring devlink. However
based on your pushback I am willing to give up on that, but I think we
still need to restructure how the sysfs is being managed.

> Option 2.
> Create directory under PF device with files writable and organized by VF numbers.
> It is doable, but will cause to code bloat with no gain at all. Cleaner than now,
> it won't be.
>
> Why the current approach with one file per-proper VF device is not good enough?

Because it is muddying the waters in terms of what is control taking
place from the VF versus the PF. In my mind the ideal solution if you
insist on going with the VF sysfs route would be to look at spawning a
directory inside the VF sysfs specifically for all of the instances
that will be PF management controls. At least that would give some
hint that this is a backdoor control and not actually interacting with
the VF PCI device directly. Then if in the future you have to add more
to this you have a spot already laid out and the controls won't be
mistaken for standard PCI controls as they are PF management controls.

In addition you could probably even create a directory on the PF with
the new control you had added for getting the master count as well as
look at adding symlinks to the VF files so that you could manage all
of the resources in one spot. That would result in the controls being
nicely organized and easy to use.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-18 17:03   ` Greg KH
@ 2021-01-19  5:38     ` Leon Romanovsky
  0 siblings, 0 replies; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-19  5:38 UTC (permalink / raw)
  To: Greg KH
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, netdev, Don Dutile, Alex Williamson

On Mon, Jan 18, 2021 at 06:03:22PM +0100, Greg KH wrote:
> On Sun, Jan 10, 2021 at 05:07:24PM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > Some SR-IOV capable devices provide an ability to configure specific
> > number of MSI-X vectors on their VF prior driver is probed on that VF.
> >
> > In order to make management easy, provide new read-only sysfs file that
> > returns a total number of possible to configure MSI-X vectors.
> >
> > cat /sys/bus/pci/devices/.../sriov_vf_total_msix
> >   = 0 - feature is not supported
> >   > 0 - total number of MSI-X vectors to consume by the VFs
> >
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  Documentation/ABI/testing/sysfs-bus-pci | 14 +++++++++++
> >  drivers/pci/iov.c                       | 31 +++++++++++++++++++++++++
> >  drivers/pci/pci.h                       |  3 +++
> >  include/linux/pci.h                     |  2 ++
> >  4 files changed, 50 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > index 05e26e5da54e..64e9b700acc9 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > @@ -395,3 +395,17 @@ Description:
> >  		The file is writable if the PF is bound to a driver that
> >  		supports the ->sriov_set_msix_vec_count() callback and there
> >  		is no driver bound to the VF.
> > +
> > +What:		/sys/bus/pci/devices/.../sriov_vf_total_msix
> > +Date:		January 2021
> > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > +Description:
> > +		This file is associated with the SR-IOV PFs.
> > +		It returns a total number of possible to configure MSI-X
> > +		vectors on the enabled VFs.
> > +
> > +		The values returned are:
> > +		 * > 0 - this will be total number possible to consume by VFs,
> > +		 * = 0 - feature is not supported
> > +
> > +		If no SR-IOV VFs are enabled, this value will return 0.
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index 42c0df4158d1..0a6ddf3230fd 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev,
> >  	return count;
> >  }
> >
> > +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> > +					struct device_attribute *attr,
> > +					char *buf)
> > +{
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > +	return sprintf(buf, "%d\n", pdev->sriov->vf_total_msix);
>
> Nit, please use sysfs_emit() for new sysfs files.

I'll do, thanks.

>
> thanks,
>
> greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
  2021-01-18 18:21                                               ` Alexander Duyck
@ 2021-01-19  5:43                                                 ` Leon Romanovsky
  0 siblings, 0 replies; 49+ messages in thread
From: Leon Romanovsky @ 2021-01-19  5:43 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jason Gunthorpe, Alex Williamson, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile

On Mon, Jan 18, 2021 at 10:21:03AM -0800, Alexander Duyck wrote:
> On Mon, Jan 18, 2021 at 5:28 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Mon, Jan 18, 2021 at 09:20:08AM +0200, Leon Romanovsky wrote:
> > > On Sun, Jan 17, 2021 at 07:16:30PM -0800, Alexander Duyck wrote:
> > > > On Sat, Jan 16, 2021 at 12:20 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > >
> > > > > On Fri, Jan 15, 2021 at 05:48:59PM -0800, Alexander Duyck wrote:
> > > > > > On Fri, Jan 15, 2021 at 7:53 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > > >
> > > > > > > On Fri, Jan 15, 2021 at 10:06:19AM -0400, Jason Gunthorpe wrote:
> > > > > > > > On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:
> >
> > <...>
> >
> > > > If you want yet another compromise I would be much happier with the PF
> > > > registering the sysfs interfaces on the VFs rather than the VFs
> > > > registering the interface and hoping the PF supports it. At least with
> > > > that you are guaranteed the PF will respond to the interface when it
> > > > is registered.
> > >
> > > Thanks a lot, I appreciate it, will take a look now.
> >
> > I found only two solutions to implement it in this way.
> > Option 1.
> > Allow multi entry write to some new sysfs knob that will receive BDF (or another VF
> > identification) and vector count. Something like this:
> >
> >  echo "0000:01:00.2 123" > sriov_vf_msix_count
> >
> > From one side, that solution is unlikely to be welcomed by Greg KH and from another,
> > it will require a lot of boilerplate code to make it safe and correct.
>
> You are overthinking this. I didn't say the sysfs had to be in the PF
> directory itself. My request was that the PF is what placed the sysfs
> file in the directory since indirectly it is responsible for spawning
> the VF anyway it shouldn't be too much of a lift to have the PF place
> sysfs files in the VF hierarchy.
>
> The main piece I am not a fan of is the fact that the VF is blindly
> registering an interface and presenting it without knowing if it even
> works.
>
> The secondary issue that I see as important, but I am willing to
> compromise on is that the interface makes it appear as though the VF
> configuration space is writable via this sysfs file. My preference
> would be to somehow make it transparent that the PF is providing this
> functionality. I thought it might be easier to do with devlink rather
> than with sysfs which is why I have been preferring devlink. However
> based on your pushback I am willing to give up on that, but I think we
> still need to restructure how the sysfs is being managed.
>
> > Option 2.
> > Create directory under PF device with files writable and organized by VF numbers.
> > It is doable, but will cause to code bloat with no gain at all. Cleaner than now,
> > it won't be.
> >
> > Why the current approach with one file per-proper VF device is not good enough?
>
> Because it is muddying the waters in terms of what is control taking
> place from the VF versus the PF. In my mind the ideal solution if you
> insist on going with the VF sysfs route would be to look at spawning a
> directory inside the VF sysfs specifically for all of the instances
> that will be PF management controls. At least that would give some
> hint that this is a backdoor control and not actually interacting with
> the VF PCI device directly. Then if in the future you have to add more
> to this you have a spot already laid out and the controls won't be
> mistaken for standard PCI controls as they are PF management controls.
>
> In addition you could probably even create a directory on the PF with
> the new control you had added for getting the master count as well as
> look at adding symlinks to the VF files so that you could manage all
> of the resources in one spot. That would result in the controls being
> nicely organized and easy to use.

Thanks, for you inputs.

I'll try offline different variants and will post v4 soon.

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2021-01-19  6:22 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-10 15:07 [PATCH mlx5-next v1 0/5] Dynamically assign MSI-X vectors count Leon Romanovsky
2021-01-10 15:07 ` [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
2021-01-11 19:30   ` Alexander Duyck
2021-01-12  3:25     ` Don Dutile
2021-01-12  6:15       ` Leon Romanovsky
2021-01-12  6:39     ` Leon Romanovsky
2021-01-12 21:59       ` Alexander Duyck
2021-01-13  6:09         ` Leon Romanovsky
2021-01-13 20:00           ` Alexander Duyck
2021-01-14  7:16             ` Leon Romanovsky
2021-01-14 16:51               ` Alexander Duyck
2021-01-14 17:12                 ` Leon Romanovsky
2021-01-13 17:50   ` Alex Williamson
2021-01-13 19:49     ` Leon Romanovsky
2021-01-10 15:07 ` [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors Leon Romanovsky
2021-01-11 19:30   ` Alexander Duyck
2021-01-12  6:56     ` Leon Romanovsky
2021-01-12 21:34       ` Alexander Duyck
2021-01-13  6:19         ` Leon Romanovsky
2021-01-13 22:44           ` Alexander Duyck
2021-01-14  6:50             ` Leon Romanovsky
2021-01-14 16:40               ` Alexander Duyck
2021-01-14 16:48                 ` Jason Gunthorpe
2021-01-14 17:55                   ` Alexander Duyck
2021-01-14 18:29                     ` Jason Gunthorpe
2021-01-14 19:24                       ` Alexander Duyck
2021-01-14 19:56                         ` Leon Romanovsky
2021-01-14 20:08                         ` Jason Gunthorpe
2021-01-14 21:43                           ` Alexander Duyck
2021-01-14 23:28                             ` Alex Williamson
2021-01-15  1:56                               ` Alexander Duyck
2021-01-15 14:06                                 ` Jason Gunthorpe
2021-01-15 15:53                                   ` Leon Romanovsky
2021-01-16  1:48                                     ` Alexander Duyck
2021-01-16  8:20                                       ` Leon Romanovsky
2021-01-18  3:16                                         ` Alexander Duyck
2021-01-18  7:20                                           ` Leon Romanovsky
2021-01-18 13:28                                             ` Leon Romanovsky
2021-01-18 18:21                                               ` Alexander Duyck
2021-01-19  5:43                                                 ` Leon Romanovsky
2021-01-16  4:32                                   ` Alexander Duyck
2021-01-18 15:47                                     ` Jason Gunthorpe
2021-01-15 13:51                             ` Jason Gunthorpe
2021-01-14 17:26                 ` Leon Romanovsky
2021-01-18 17:03   ` Greg KH
2021-01-19  5:38     ` Leon Romanovsky
2021-01-10 15:07 ` [PATCH mlx5-next v1 3/5] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
2021-01-10 15:07 ` [PATCH mlx5-next v1 4/5] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
2021-01-10 15:07 ` [PATCH mlx5-next v1 5/5] net/mlx5: Allow to the users to configure number of MSI-X vectors Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).