netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mlx5-next v5 0/4] Dynamically assign MSI-X vectors count
@ 2021-01-26  8:57 Leon Romanovsky
  2021-01-26  8:57 ` [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Leon Romanovsky @ 2021-01-26  8:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

From: Leon Romanovsky <leonro@nvidia.com>

Changelog
v5:
 * Patch 1:
  * Added forgotten "inline" keyword when declaring empty functions.
v4: https://lore.kernel.org/linux-pci/20210124131119.558563-1-leon@kernel.org
 * Used sysfs_emit() instead of sprintf() in new sysfs entries.
 * Changed EXPORT_SYMBOL to be EXPORT_SYMBOL_GPL for pci_iov_virtfn_devfn().
 * Rewrote sysfs registration code to be driven by PF that wants to enable VF
   overlay instead of creating to all SR-IOV devices.
 * Grouped all such functionality under new "vfs_overlay" folder.
 * Combined two PCI patches into one.
v3: https://lore.kernel.org/linux-pci/20210117081548.1278992-1-leon@kernel.org
 * Renamed pci_set_msix_vec_count to be pci_vf_set_msix_vec_count.
 * Added VF msix_cap check to hide sysfs entry if device doesn't support msix.
 * Changed "-" to be ":" in the mlx5 patch to silence CI warnings about missing
   kdoc description.
 * Split differently error print in mlx5 driver to avoid checkpatch warning.
v2: https://lore.kernel.org/linux-pci/20210114103140.866141-1-leon@kernel.org
 * Patch 1:
  * Renamed vf_msix_vec sysfs knob to be sriov_vf_msix_count
  * Added PF and VF device locks during set MSI-X call to protect from parallel
    driver bind/unbind operations.
  * Removed extra checks when reading sriov_vf_msix, because users will
    be able to distinguish between supported/not supported by looking on
    sriov_vf_total_msix count.
  * Changed all occurrences of "numb" to be "count"
  * Changed returned error from EOPNOTSUPP to be EBUSY if user tries to set
    MSI-X count after driver already bound to the VF.
  * Added extra comment in pci_set_msix_vec_count() to emphasize that driver
    should not be bound.
 * Patch 2:
  * Changed vf_total_msix from int to be u32 and updated function signatures
    accordingly.
  * Improved patch title
v1: https://lore.kernel.org/linux-pci/20210110150727.1965295-1-leon@kernel.org
 * Improved wording and commit messages of first PCI patch
 * Added extra PCI patch to provide total number of MSI-X vectors
 * Prohibited read of vf_msix_vec sysfs file if driver doesn't support write
 * Removed extra function definition in pci.h
v0: https://lore.kernel.org/linux-pci/20210103082440.34994-1-leon@kernel.org

--------------------------------------------------------------------
Hi,

The number of MSI-X vectors is PCI property visible through lspci, that
field is read-only and configured by the device.

The static assignment of an amount of MSI-X vectors doesn't allow utilize
the newly created VF because it is not known to the device the future load
and configuration where that VF will be used.

The VFs are created on the hypervisor and forwarded to the VMs that have
different properties (for example number of CPUs).

To overcome the inefficiency in the spread of such MSI-X vectors, we
allow the kernel to instruct the device with the needed number of such
vectors, before VF is initialized and bounded to the driver.

Before this series:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
        Capabilities: [9c] MSI-X: Enable- Count=12 Masked-

Configuration script:
1. Start fresh
echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
modprobe -q -r mlx5_ib mlx5_core
2. Ensure that driver doesn't run and it is safe to change MSI-X
echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_drivers_autoprobe
3. Load driver for the PF
modprobe mlx5_core
4. Configure one of the VFs with new number
echo 2 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
echo 21 > /sys/bus/pci/devices/0000\:08\:00.2/vfs_overlay/sriov_vf_msix_count

After this series:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
        Capabilities: [9c] MSI-X: Enable- Count=21 Masked-

Thanks

Leon Romanovsky (4):
  PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  net/mlx5: Add dynamic MSI-X capabilities bits
  net/mlx5: Dynamically assign MSI-X vectors count
  net/mlx5: Allow to the users to configure number of MSI-X vectors

 Documentation/ABI/testing/sysfs-bus-pci       |  32 ++++
 .../net/ethernet/mellanox/mlx5/core/main.c    |  16 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   6 +
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  72 +++++++
 .../net/ethernet/mellanox/mlx5/core/sriov.c   |  59 +++++-
 drivers/pci/iov.c                             | 180 ++++++++++++++++++
 drivers/pci/msi.c                             |  47 +++++
 drivers/pci/pci.h                             |   4 +
 include/linux/mlx5/mlx5_ifc.h                 |  11 +-
 include/linux/pci.h                           |  10 +
 10 files changed, 434 insertions(+), 3 deletions(-)

--
2.29.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-26  8:57 [PATCH mlx5-next v5 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
@ 2021-01-26  8:57 ` Leon Romanovsky
  2021-02-02  7:02   ` Leon Romanovsky
  2021-02-02 18:06   ` Bjorn Helgaas
  2021-01-26  8:57 ` [PATCH mlx5-next v5 2/4] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 16+ messages in thread
From: Leon Romanovsky @ 2021-01-26  8:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

From: Leon Romanovsky <leonro@nvidia.com>

Extend PCI sysfs interface with a new callback that allows configure
the number of MSI-X vectors for specific SR-IO VF. This is needed
to optimize the performance of newly bound devices by allocating
the number of vectors based on the administrator knowledge of targeted VM.

This function is applicable for SR-IOV VF because such devices allocate
their MSI-X table before they will run on the VMs and HW can't guess the
right number of vectors, so the HW allocates them statically and equally.

1) The newly added /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
file will be seen for the VFs and it is writable as long as a driver is not
bounded to the VF.

The values accepted are:
 * > 0 - this will be number reported by the VF's MSI-X capability
 * < 0 - not valid
 * = 0 - will reset to the device default value

2) In order to make management easy, provide new read-only sysfs file that
returns a total number of possible to configure MSI-X vectors.

cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
  = 0 - feature is not supported
  > 0 - total number of MSI-X vectors to consume by the VFs

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 Documentation/ABI/testing/sysfs-bus-pci |  32 +++++
 drivers/pci/iov.c                       | 180 ++++++++++++++++++++++++
 drivers/pci/msi.c                       |  47 +++++++
 drivers/pci/pci.h                       |   4 +
 include/linux/pci.h                     |  10 ++
 5 files changed, 273 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 25c9c39770c6..4d206ade5331 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -375,3 +375,35 @@ Description:
 		The value comes from the PCI kernel device state and can be one
 		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
 		The file is read only.
+
+What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
+Date:		January 2021
+Contact:	Leon Romanovsky <leonro@nvidia.com>
+Description:
+		This file is associated with the SR-IOV VFs.
+		It allows configuration of the number of MSI-X vectors for
+		the VF. This is needed to optimize performance of newly bound
+		devices by allocating the number of vectors based on the
+		administrator knowledge of targeted VM.
+
+		The values accepted are:
+		 * > 0 - this will be number reported by the VF's MSI-X
+			 capability
+		 * < 0 - not valid
+		 * = 0 - will reset to the device default value
+
+		The file is writable if the PF is bound to a driver that
+		set sriov_vf_total_msix > 0 and there is no driver bound
+		to the VF.
+
+What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
+Date:		January 2021
+Contact:	Leon Romanovsky <leonro@nvidia.com>
+Description:
+		This file is associated with the SR-IOV PFs.
+		It returns a total number of possible to configure MSI-X
+		vectors on the enabled VFs.
+
+		The values returned are:
+		 * > 0 - this will be total number possible to consume by VFs,
+		 * = 0 - feature is not supported
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 4afd4ee4f7f0..3e95f835eba5 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
 	return (dev->devfn + dev->sriov->offset +
 		dev->sriov->stride * vf_id) & 0xff;
 }
+EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);

 /*
  * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
@@ -157,6 +158,166 @@ int pci_iov_sysfs_link(struct pci_dev *dev,
 	return rc;
 }

+#ifdef CONFIG_PCI_MSI
+static ssize_t sriov_vf_msix_count_show(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	int count = pci_msix_vec_count(pdev);
+
+	if (count < 0)
+		return count;
+
+	return sysfs_emit(buf, "%d\n", count);
+}
+
+static ssize_t sriov_vf_msix_count_store(struct device *dev,
+					 struct device_attribute *attr,
+					 const char *buf, size_t count)
+{
+	struct pci_dev *vf_dev = to_pci_dev(dev);
+	int val, ret;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	ret = pci_vf_set_msix_vec_count(vf_dev, val);
+	if (ret)
+		return ret;
+
+	return count;
+}
+static DEVICE_ATTR_RW(sriov_vf_msix_count);
+
+static ssize_t sriov_vf_total_msix_show(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return sysfs_emit(buf, "%u\n", pdev->sriov->vf_total_msix);
+}
+static DEVICE_ATTR_RO(sriov_vf_total_msix);
+#endif
+
+static struct attribute *sriov_pf_dev_attrs[] = {
+#ifdef CONFIG_PCI_MSI
+	&dev_attr_sriov_vf_total_msix.attr,
+#endif
+	NULL,
+};
+
+static struct attribute *sriov_vf_dev_attrs[] = {
+#ifdef CONFIG_PCI_MSI
+	&dev_attr_sriov_vf_msix_count.attr,
+#endif
+	NULL,
+};
+
+static umode_t sriov_pf_attrs_are_visible(struct kobject *kobj,
+					  struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	if (!pdev->msix_cap || !dev_is_pf(dev))
+		return 0;
+
+	return a->mode;
+}
+
+static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
+					  struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	if (!pdev->msix_cap || dev_is_pf(dev))
+		return 0;
+
+	return a->mode;
+}
+
+static const struct attribute_group sriov_pf_dev_attr_group = {
+	.attrs = sriov_pf_dev_attrs,
+	.is_visible = sriov_pf_attrs_are_visible,
+	.name = "vfs_overlay",
+};
+
+static const struct attribute_group sriov_vf_dev_attr_group = {
+	.attrs = sriov_vf_dev_attrs,
+	.is_visible = sriov_vf_attrs_are_visible,
+	.name = "vfs_overlay",
+};
+
+int pci_enable_vfs_overlay(struct pci_dev *dev)
+{
+	struct pci_dev *virtfn;
+	int id, ret;
+
+	if (!dev->is_physfn || !dev->sriov->num_VFs)
+		return 0;
+
+	ret = sysfs_create_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
+	if (ret)
+		return ret;
+
+	for (id = 0; id < dev->sriov->num_VFs; id++) {
+		virtfn = pci_get_domain_bus_and_slot(
+			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
+			pci_iov_virtfn_devfn(dev, id));
+
+		if (!virtfn)
+			continue;
+
+		ret = sysfs_create_group(&virtfn->dev.kobj,
+					 &sriov_vf_dev_attr_group);
+		if (ret)
+			goto out;
+	}
+	return 0;
+
+out:
+	while (id--) {
+		virtfn = pci_get_domain_bus_and_slot(
+			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
+			pci_iov_virtfn_devfn(dev, id));
+
+		if (!virtfn)
+			continue;
+
+		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
+	}
+	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_enable_vfs_overlay);
+
+void pci_disable_vfs_overlay(struct pci_dev *dev)
+{
+	struct pci_dev *virtfn;
+	int id;
+
+	if (!dev->is_physfn || !dev->sriov->num_VFs)
+		return;
+
+	id = dev->sriov->num_VFs;
+	while (id--) {
+		virtfn = pci_get_domain_bus_and_slot(
+			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
+			pci_iov_virtfn_devfn(dev, id));
+
+		if (!virtfn)
+			continue;
+
+		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
+	}
+	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
+}
+EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);
+
 int pci_iov_add_virtfn(struct pci_dev *dev, int id)
 {
 	int i;
@@ -596,6 +757,7 @@ static void sriov_disable(struct pci_dev *dev)
 		sysfs_remove_link(&dev->dev.kobj, "dep_link");

 	iov->num_VFs = 0;
+	iov->vf_total_msix = 0;
 	pci_iov_set_numvfs(dev, 0);
 }

@@ -1054,6 +1216,24 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);

+/**
+ * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
+ * @dev: the PCI PF device
+ * @count: the total number of MSI-X vector to consume by the VFs
+ *
+ * Sets the number of MSI-X vectors that is possible to consume by the VFs.
+ * This interface is complimentary part of the pci_vf_set_msix_vec_count()
+ * that will be used to configure the required number on the VF.
+ */
+void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count)
+{
+	if (!dev->is_physfn)
+		return;
+
+	dev->sriov->vf_total_msix = count;
+}
+EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
+
 /**
  * pci_sriov_configure_simple - helper to configure SR-IOV
  * @dev: the PCI device
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 3162f88fe940..1022fe9e6efd 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -991,6 +991,53 @@ int pci_msix_vec_count(struct pci_dev *dev)
 }
 EXPORT_SYMBOL(pci_msix_vec_count);

+/**
+ * pci_vf_set_msix_vec_count - change the reported number of MSI-X vectors
+ * This function is applicable for SR-IOV VF because such devices allocate
+ * their MSI-X table before they will run on the VMs and HW can't guess the
+ * right number of vectors, so the HW allocates them statically and equally.
+ * @dev: VF device that is going to be changed
+ * @count: amount of MSI-X vectors
+ **/
+int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count)
+{
+	struct pci_dev *pdev = pci_physfn(dev);
+	int ret;
+
+	if (count < 0)
+		/*
+		 * We don't support negative numbers for now,
+		 * but maybe in the future it will make sense.
+		 */
+		return -EINVAL;
+
+	device_lock(&pdev->dev);
+	if (!pdev->driver) {
+		ret = -EOPNOTSUPP;
+		goto err_pdev;
+	}
+
+	device_lock(&dev->dev);
+	if (dev->driver) {
+		/*
+		 * Driver already probed this VF and configured itself
+		 * based on previously configured (or default) MSI-X vector
+		 * count. It is too late to change this field for this
+		 * specific VF.
+		 */
+		ret = -EBUSY;
+		goto err_dev;
+	}
+
+	ret = pdev->driver->sriov_set_msix_vec_count(dev, count);
+
+err_dev:
+	device_unlock(&dev->dev);
+err_pdev:
+	device_unlock(&pdev->dev);
+	return ret;
+}
+
 static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
 			     int nvec, struct irq_affinity *affd, int flags)
 {
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5c59365092fa..2bd6560d91e2 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -183,6 +183,7 @@ extern unsigned int pci_pm_d3hot_delay;

 #ifdef CONFIG_PCI_MSI
 void pci_no_msi(void);
+int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count);
 #else
 static inline void pci_no_msi(void) { }
 #endif
@@ -326,6 +327,9 @@ struct pci_sriov {
 	u16		subsystem_device; /* VF subsystem device */
 	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
 	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
+	u32		vf_total_msix;  /* Total number of MSI-X vectors the VFs
+					 * can consume
+					 */
 };

 /**
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b32126d26997..24d118ad6e7b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -856,6 +856,8 @@ struct module;
  *		e.g. drivers/net/e100.c.
  * @sriov_configure: Optional driver callback to allow configuration of
  *		number of VFs to enable via sysfs "sriov_numvfs" file.
+ * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
+ *              exposed by the sysfs "vf_msix_vec" entry.
  * @err_handler: See Documentation/PCI/pci-error-recovery.rst
  * @groups:	Sysfs attribute groups.
  * @driver:	Driver model structure.
@@ -871,6 +873,7 @@ struct pci_driver {
 	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
 	void (*shutdown)(struct pci_dev *dev);
 	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
+	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
 	const struct pci_error_handlers *err_handler;
 	const struct attribute_group **groups;
 	struct device_driver	driver;
@@ -2059,6 +2062,9 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
 int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
 int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);

+int pci_enable_vfs_overlay(struct pci_dev *dev);
+void pci_disable_vfs_overlay(struct pci_dev *dev);
+
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);

@@ -2072,6 +2078,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev);
 int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
 resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
 void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
+void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count);

 /* Arch may override these (weak) */
 int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs);
@@ -2100,6 +2107,8 @@ static inline int pci_iov_add_virtfn(struct pci_dev *dev, int id)
 }
 static inline void pci_iov_remove_virtfn(struct pci_dev *dev,
 					 int id) { }
+static inline int pci_enable_vfs_overlay(struct pci_dev *dev) { return 0; }
+static inline void pci_disable_vfs_overlay(struct pci_dev *dev) {}
 static inline void pci_disable_sriov(struct pci_dev *dev) { }
 static inline int pci_num_vf(struct pci_dev *dev) { return 0; }
 static inline int pci_vfs_assigned(struct pci_dev *dev)
@@ -2112,6 +2121,7 @@ static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
 static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
 { return 0; }
 static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { }
+static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {}
 #endif

 #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
--
2.29.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH mlx5-next v5 2/4] net/mlx5: Add dynamic MSI-X capabilities bits
  2021-01-26  8:57 [PATCH mlx5-next v5 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
  2021-01-26  8:57 ` [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
@ 2021-01-26  8:57 ` Leon Romanovsky
  2021-01-26  8:57 ` [PATCH mlx5-next v5 3/4] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
  2021-01-26  8:57 ` [PATCH mlx5-next v5 4/4] net/mlx5: Allow to the users to configure number of MSI-X vectors Leon Romanovsky
  3 siblings, 0 replies; 16+ messages in thread
From: Leon Romanovsky @ 2021-01-26  8:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

From: Leon Romanovsky <leonro@nvidia.com>

These new fields declare the number of MSI-X vectors that is
possible to allocate on the VF through PF configuration.

Value must be in range defined by min_dynamic_vf_msix_table_size
and max_dynamic_vf_msix_table_size.

The driver should continue to query its MSI-X table through PCI
configuration header.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index b96f99f1198e..31e6eac67f51 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -1657,7 +1657,16 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8	   reserved_at_6e0[0x10];
 	u8	   sf_base_id[0x10];
 
-	u8	   reserved_at_700[0x80];
+	u8	   reserved_at_700[0x8];
+	u8	   num_total_dynamic_vf_msix[0x18];
+	u8	   reserved_at_720[0x14];
+	u8	   dynamic_msix_table_size[0xc];
+	u8	   reserved_at_740[0xc];
+	u8	   min_dynamic_vf_msix_table_size[0x4];
+	u8	   reserved_at_750[0x4];
+	u8	   max_dynamic_vf_msix_table_size[0xc];
+
+	u8	   reserved_at_760[0x20];
 	u8	   vhca_tunnel_commands[0x40];
 	u8	   reserved_at_7c0[0x40];
 };
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH mlx5-next v5 3/4] net/mlx5: Dynamically assign MSI-X vectors count
  2021-01-26  8:57 [PATCH mlx5-next v5 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
  2021-01-26  8:57 ` [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
  2021-01-26  8:57 ` [PATCH mlx5-next v5 2/4] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
@ 2021-01-26  8:57 ` Leon Romanovsky
  2021-02-02 17:25   ` Bjorn Helgaas
  2021-01-26  8:57 ` [PATCH mlx5-next v5 4/4] net/mlx5: Allow to the users to configure number of MSI-X vectors Leon Romanovsky
  3 siblings, 1 reply; 16+ messages in thread
From: Leon Romanovsky @ 2021-01-26  8:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

From: Leon Romanovsky <leonro@nvidia.com>

The number of MSI-X vectors is PCI property visible through lspci, that
field is read-only and configured by the device. The static assignment
of an amount of MSI-X vectors doesn't allow utilize the newly created
VF because it is not known to the device the future load and configuration
where that VF will be used.

To overcome the inefficiency in the spread of such MSI-X vectors, we
allow the kernel to instruct the device with the needed number of such
vectors.

Such change immediately increases the amount of MSI-X vectors for the
system with @ VFs from 12 vectors per-VF, to be 32 vectors per-VF.

Before this patch:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
	Capabilities: [9c] MSI-X: Enable- Count=12 Masked-

After this patch:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
	Capabilities: [9c] MSI-X: Enable- Count=32 Masked-

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/main.c    |  4 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  5 ++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 72 +++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 13 +++-
 4 files changed, 92 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index ca6f2fc39ea0..79cfcc844156 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -567,6 +567,10 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
 	if (MLX5_CAP_GEN_MAX(dev, mkey_by_name))
 		MLX5_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1);
 
+	if (MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix))
+		MLX5_SET(cmd_hca_cap, set_hca_cap, num_total_dynamic_vf_msix,
+			 MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix));
+
 	return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 0a0302ce7144..5babb4434a87 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -172,6 +172,11 @@ int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, int vecidx,
 		       struct notifier_block *nb);
 int mlx5_irq_detach_nb(struct mlx5_irq_table *irq_table, int vecidx,
 		       struct notifier_block *nb);
+
+int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int devfn,
+			    int msix_vec_count);
+int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs);
+
 struct cpumask *
 mlx5_irq_get_affinity_mask(struct mlx5_irq_table *irq_table, int vecidx);
 struct cpu_rmap *mlx5_irq_get_rmap(struct mlx5_irq_table *table);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 6fd974920394..2a35888fcff0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -55,6 +55,78 @@ static struct mlx5_irq *mlx5_irq_get(struct mlx5_core_dev *dev, int vecidx)
 	return &irq_table->irq[vecidx];
 }
 
+/**
+ * mlx5_get_default_msix_vec_count() - Get defaults of number of MSI-X vectors
+ * to be set
+ * @dev: PF to work on
+ * @num_vfs: Number of VFs was asked when SR-IOV was enabled
+ **/
+int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs)
+{
+	int num_vf_msix, min_msix, max_msix;
+
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+	if (!num_vf_msix)
+		return 0;
+
+	min_msix = MLX5_CAP_GEN(dev, min_dynamic_vf_msix_table_size);
+	max_msix = MLX5_CAP_GEN(dev, max_dynamic_vf_msix_table_size);
+
+	/* Limit maximum number of MSI-X to leave some of them free in the
+	 * pool and ready to be assigned by the users without need to resize
+	 * other Vfs.
+	 */
+	return max(min(num_vf_msix / num_vfs, max_msix / 2), min_msix);
+}
+
+/**
+ * mlx5_set_msix_vec_count() - Set dynamically allocated MSI-X to the VF
+ * @dev: PF to work on
+ * @function_id: Internal PCI VF function id
+ * @msix_vec_count: Number of MSI-X to set
+ **/
+int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int function_id,
+			    int msix_vec_count)
+{
+	int sz = MLX5_ST_SZ_BYTES(set_hca_cap_in);
+	int num_vf_msix, min_msix, max_msix;
+	void *hca_cap, *cap;
+	int ret;
+
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+	if (!num_vf_msix)
+		return 0;
+
+	if (!MLX5_CAP_GEN(dev, vport_group_manager) || !mlx5_core_is_pf(dev))
+		return -EOPNOTSUPP;
+
+	min_msix = MLX5_CAP_GEN(dev, min_dynamic_vf_msix_table_size);
+	max_msix = MLX5_CAP_GEN(dev, max_dynamic_vf_msix_table_size);
+
+	if (msix_vec_count < min_msix)
+		return -EINVAL;
+
+	if (msix_vec_count > max_msix)
+		return -EOVERFLOW;
+
+	hca_cap = kzalloc(sz, GFP_KERNEL);
+	if (!hca_cap)
+		return -ENOMEM;
+
+	cap = MLX5_ADDR_OF(set_hca_cap_in, hca_cap, capability);
+	MLX5_SET(cmd_hca_cap, cap, dynamic_msix_table_size, msix_vec_count);
+
+	MLX5_SET(set_hca_cap_in, hca_cap, opcode, MLX5_CMD_OP_SET_HCA_CAP);
+	MLX5_SET(set_hca_cap_in, hca_cap, other_function, 1);
+	MLX5_SET(set_hca_cap_in, hca_cap, function_id, function_id);
+
+	MLX5_SET(set_hca_cap_in, hca_cap, op_mod,
+		 MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE << 1);
+	ret = mlx5_cmd_exec_in(dev, set_hca_cap, hca_cap);
+	kfree(hca_cap);
+	return ret;
+}
+
 int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, int vecidx,
 		       struct notifier_block *nb)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index 3094d20297a9..f0ec86a1c8a6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -71,8 +71,7 @@ static int sriov_restore_guids(struct mlx5_core_dev *dev, int vf)
 static int mlx5_device_enable_sriov(struct mlx5_core_dev *dev, int num_vfs)
 {
 	struct mlx5_core_sriov *sriov = &dev->priv.sriov;
-	int err;
-	int vf;
+	int err, vf, num_msix_count;
 
 	if (!MLX5_ESWITCH_MANAGER(dev))
 		goto enable_vfs_hca;
@@ -85,12 +84,22 @@ static int mlx5_device_enable_sriov(struct mlx5_core_dev *dev, int num_vfs)
 	}
 
 enable_vfs_hca:
+	num_msix_count = mlx5_get_default_msix_vec_count(dev, num_vfs);
 	for (vf = 0; vf < num_vfs; vf++) {
 		err = mlx5_core_enable_hca(dev, vf + 1);
 		if (err) {
 			mlx5_core_warn(dev, "failed to enable VF %d (%d)\n", vf, err);
 			continue;
 		}
+
+		err = mlx5_set_msix_vec_count(dev, vf + 1, num_msix_count);
+		if (err) {
+			mlx5_core_warn(dev,
+				       "failed to set MSI-X vector counts VF %d, err %d\n",
+				       vf, err);
+			continue;
+		}
+
 		sriov->vfs_ctx[vf].enabled = 1;
 		if (MLX5_CAP_GEN(dev, port_type) == MLX5_CAP_PORT_TYPE_IB) {
 			err = sriov_restore_guids(dev, vf);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH mlx5-next v5 4/4] net/mlx5: Allow to the users to configure number of MSI-X vectors
  2021-01-26  8:57 [PATCH mlx5-next v5 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
                   ` (2 preceding siblings ...)
  2021-01-26  8:57 ` [PATCH mlx5-next v5 3/4] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
@ 2021-01-26  8:57 ` Leon Romanovsky
  3 siblings, 0 replies; 16+ messages in thread
From: Leon Romanovsky @ 2021-01-26  8:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

From: Leon Romanovsky <leonro@nvidia.com>

Implement ability to configure MSI-X for the SR-IOV VFs.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/main.c    | 12 +++++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  1 +
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 46 +++++++++++++++++++
 3 files changed, 59 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 79cfcc844156..228765c38cf8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1395,6 +1395,14 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *id)
 		goto err_load_one;
 	}
 
+	err = pci_enable_vfs_overlay(pdev);
+	if (err) {
+		mlx5_core_err(dev,
+			      "pci_enable_vfs_overlay failed with error code %d\n",
+			      err);
+		goto err_vfs_overlay;
+	}
+
 	err = mlx5_crdump_enable(dev);
 	if (err)
 		dev_err(&pdev->dev, "mlx5_crdump_enable failed with error code %d\n", err);
@@ -1403,6 +1411,8 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *id)
 	devlink_reload_enable(devlink);
 	return 0;
 
+err_vfs_overlay:
+	mlx5_unload_one(dev, true);
 err_load_one:
 	mlx5_pci_close(dev);
 pci_init_err:
@@ -1422,6 +1432,7 @@ static void remove_one(struct pci_dev *pdev)
 
 	devlink_reload_disable(devlink);
 	mlx5_crdump_disable(dev);
+	pci_disable_vfs_overlay(pdev);
 	mlx5_drain_health_wq(dev);
 	mlx5_unload_one(dev, true);
 	mlx5_pci_close(dev);
@@ -1650,6 +1661,7 @@ static struct pci_driver mlx5_core_driver = {
 	.shutdown	= shutdown,
 	.err_handler	= &mlx5_err_handler,
 	.sriov_configure   = mlx5_core_sriov_configure,
+	.sriov_set_msix_vec_count = mlx5_core_sriov_set_msix_vec_count,
 };
 
 static void mlx5_core_verify_params(void)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 5babb4434a87..8a2523d2d43a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -138,6 +138,7 @@ void mlx5_sriov_cleanup(struct mlx5_core_dev *dev);
 int mlx5_sriov_attach(struct mlx5_core_dev *dev);
 void mlx5_sriov_detach(struct mlx5_core_dev *dev);
 int mlx5_core_sriov_configure(struct pci_dev *dev, int num_vfs);
+int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count);
 int mlx5_core_enable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_core_disable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_create_scheduling_element_cmd(struct mlx5_core_dev *dev, u8 hierarchy,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index f0ec86a1c8a6..252aa44ffbe3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -144,6 +144,7 @@ mlx5_device_disable_sriov(struct mlx5_core_dev *dev, int num_vfs, bool clear_vf)
 static int mlx5_sriov_enable(struct pci_dev *pdev, int num_vfs)
 {
 	struct mlx5_core_dev *dev  = pci_get_drvdata(pdev);
+	u32 num_vf_msix;
 	int err;
 
 	err = mlx5_device_enable_sriov(dev, num_vfs);
@@ -152,11 +153,20 @@ static int mlx5_sriov_enable(struct pci_dev *pdev, int num_vfs)
 		return err;
 	}
 
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+	pci_sriov_set_vf_total_msix(pdev, num_vf_msix);
 	err = pci_enable_sriov(pdev, num_vfs);
 	if (err) {
 		mlx5_core_warn(dev, "pci_enable_sriov failed : %d\n", err);
 		mlx5_device_disable_sriov(dev, num_vfs, true);
 	}
+	err = pci_enable_vfs_overlay(pdev);
+	if (err) {
+		mlx5_core_warn(dev, "pci_enable_vfs_overlay failed : %d\n",
+			       err);
+		pci_disable_sriov(pdev);
+		mlx5_device_disable_sriov(dev, num_vfs, true);
+	}
 	return err;
 }
 
@@ -165,6 +175,7 @@ static void mlx5_sriov_disable(struct pci_dev *pdev)
 	struct mlx5_core_dev *dev  = pci_get_drvdata(pdev);
 	int num_vfs = pci_num_vf(dev->pdev);
 
+	pci_disable_vfs_overlay(pdev);
 	pci_disable_sriov(pdev);
 	mlx5_device_disable_sriov(dev, num_vfs, true);
 }
@@ -187,6 +198,41 @@ int mlx5_core_sriov_configure(struct pci_dev *pdev, int num_vfs)
 	return err ? err : num_vfs;
 }
 
+int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count)
+{
+	struct pci_dev *pf = pci_physfn(vf);
+	struct mlx5_core_sriov *sriov;
+	struct mlx5_core_dev *dev;
+	int num_vf_msix, id;
+
+	dev = pci_get_drvdata(pf);
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+	if (!num_vf_msix)
+		return -EOPNOTSUPP;
+
+	if (!msix_vec_count)
+		msix_vec_count =
+			mlx5_get_default_msix_vec_count(dev, pci_num_vf(pf));
+
+	sriov = &dev->priv.sriov;
+
+	/* Reversed translation of PCI VF function number to the internal
+	 * function_id, which exists in the name of virtfn symlink.
+	 */
+	for (id = 0; id < pci_num_vf(pf); id++) {
+		if (!sriov->vfs_ctx[id].enabled)
+			continue;
+
+		if (vf->devfn == pci_iov_virtfn_devfn(pf, id))
+			break;
+	}
+
+	if (id == pci_num_vf(pf) || !sriov->vfs_ctx[id].enabled)
+		return -EINVAL;
+
+	return mlx5_set_msix_vec_count(dev, id + 1, msix_vec_count);
+}
+
 int mlx5_sriov_attach(struct mlx5_core_dev *dev)
 {
 	if (!mlx5_core_is_pf(dev) || !pci_num_vf(dev->pdev))
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-26  8:57 ` [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
@ 2021-02-02  7:02   ` Leon Romanovsky
  2021-02-02 18:06   ` Bjorn Helgaas
  1 sibling, 0 replies; 16+ messages in thread
From: Leon Romanovsky @ 2021-02-02  7:02 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Jason Gunthorpe, Alexander Duyck, Jakub Kicinski, linux-pci,
	linux-rdma, netdev, Don Dutile, Alex Williamson,
	David S . Miller

On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
>
> Extend PCI sysfs interface with a new callback that allows configure
> the number of MSI-X vectors for specific SR-IO VF. This is needed
> to optimize the performance of newly bound devices by allocating
> the number of vectors based on the administrator knowledge of targeted VM.
>
> This function is applicable for SR-IOV VF because such devices allocate
> their MSI-X table before they will run on the VMs and HW can't guess the
> right number of vectors, so the HW allocates them statically and equally.
>
> 1) The newly added /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
> file will be seen for the VFs and it is writable as long as a driver is not
> bounded to the VF.
>
> The values accepted are:
>  * > 0 - this will be number reported by the VF's MSI-X capability
>  * < 0 - not valid
>  * = 0 - will reset to the device default value
>
> 2) In order to make management easy, provide new read-only sysfs file that
> returns a total number of possible to configure MSI-X vectors.
>
> cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
>   = 0 - feature is not supported
>   > 0 - total number of MSI-X vectors to consume by the VFs
>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci |  32 +++++
>  drivers/pci/iov.c                       | 180 ++++++++++++++++++++++++
>  drivers/pci/msi.c                       |  47 +++++++
>  drivers/pci/pci.h                       |   4 +
>  include/linux/pci.h                     |  10 ++
>  5 files changed, 273 insertions(+)

Bjorn,

Can I please get your Acked-by on this so it will go through
mlx5-next -> netdev submission flow?

Thanks

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mlx5-next v5 3/4] net/mlx5: Dynamically assign MSI-X vectors count
  2021-01-26  8:57 ` [PATCH mlx5-next v5 3/4] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
@ 2021-02-02 17:25   ` Bjorn Helgaas
  2021-02-02 19:11     ` Leon Romanovsky
  0 siblings, 1 reply; 16+ messages in thread
From: Bjorn Helgaas @ 2021-02-02 17:25 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky, Jason Gunthorpe,
	Alexander Duyck, Jakub Kicinski, linux-pci, linux-rdma, netdev,
	Don Dutile, Alex Williamson, David S . Miller

On Tue, Jan 26, 2021 at 10:57:29AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> The number of MSI-X vectors is PCI property visible through lspci, that
> field is read-only and configured by the device. The static assignment
> of an amount of MSI-X vectors doesn't allow utilize the newly created
> VF because it is not known to the device the future load and configuration
> where that VF will be used.
> 
> To overcome the inefficiency in the spread of such MSI-X vectors, we
> allow the kernel to instruct the device with the needed number of such
> vectors.
> 
> Such change immediately increases the amount of MSI-X vectors for the
> system with @ VFs from 12 vectors per-VF, to be 32 vectors per-VF.

Not knowing anything about mlx5, it looks like maybe this gets some
parameters from firmware on the device, then changes the way MSI-X
vectors are distributed among VFs?

I don't understand the implications above about "static assignment"
and "inefficiency in the spread."  I guess maybe this takes advantage
of the fact that you know how many VFs are enabled, so if NumVFs is
less that TotalVFs, you can assign more vectors to each VF?

If that's the case, spell it out a little bit.  The current text makes
it sound like you discovered brand new MSI-X vectors somewhere,
regardless of how many VFs are enabled, which doesn't sound right.

> Before this patch:
> [root@server ~]# lspci -vs 0000:08:00.2
> 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
> ....
> 	Capabilities: [9c] MSI-X: Enable- Count=12 Masked-
> 
> After this patch:
> [root@server ~]# lspci -vs 0000:08:00.2
> 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
> ....
> 	Capabilities: [9c] MSI-X: Enable- Count=32 Masked-
> 
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  .../net/ethernet/mellanox/mlx5/core/main.c    |  4 ++
>  .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  5 ++
>  .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 72 +++++++++++++++++++
>  .../net/ethernet/mellanox/mlx5/core/sriov.c   | 13 +++-
>  4 files changed, 92 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index ca6f2fc39ea0..79cfcc844156 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -567,6 +567,10 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
>  	if (MLX5_CAP_GEN_MAX(dev, mkey_by_name))
>  		MLX5_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1);
>  
> +	if (MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix))
> +		MLX5_SET(cmd_hca_cap, set_hca_cap, num_total_dynamic_vf_msix,
> +			 MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix));
> +
>  	return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE);
>  }
>  
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> index 0a0302ce7144..5babb4434a87 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> @@ -172,6 +172,11 @@ int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, int vecidx,
>  		       struct notifier_block *nb);
>  int mlx5_irq_detach_nb(struct mlx5_irq_table *irq_table, int vecidx,
>  		       struct notifier_block *nb);
> +
> +int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int devfn,
> +			    int msix_vec_count);
> +int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs);
> +
>  struct cpumask *
>  mlx5_irq_get_affinity_mask(struct mlx5_irq_table *irq_table, int vecidx);
>  struct cpu_rmap *mlx5_irq_get_rmap(struct mlx5_irq_table *table);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> index 6fd974920394..2a35888fcff0 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> @@ -55,6 +55,78 @@ static struct mlx5_irq *mlx5_irq_get(struct mlx5_core_dev *dev, int vecidx)
>  	return &irq_table->irq[vecidx];
>  }
>  
> +/**
> + * mlx5_get_default_msix_vec_count() - Get defaults of number of MSI-X vectors
> + * to be set

s/defaults of number of/default number of/
s/to be set/to be assigned to each VF/ ?

> + * @dev: PF to work on
> + * @num_vfs: Number of VFs was asked when SR-IOV was enabled

s/Number of VFs was asked when SR-IOV was enabled/Number of enabled VFs/ ?

> + **/

Documentation/doc-guide/kernel-doc.rst says kernel-doc comments end
with just "*/" (not "**/").

> +int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs)
> +{
> +	int num_vf_msix, min_msix, max_msix;
> +
> +	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
> +	if (!num_vf_msix)
> +		return 0;
> +
> +	min_msix = MLX5_CAP_GEN(dev, min_dynamic_vf_msix_table_size);
> +	max_msix = MLX5_CAP_GEN(dev, max_dynamic_vf_msix_table_size);
> +
> +	/* Limit maximum number of MSI-X to leave some of them free in the
> +	 * pool and ready to be assigned by the users without need to resize
> +	 * other Vfs.

s/number of MSI-X/number of MSI-X vectors/
s/Vfs/VFs/

> +	 */
> +	return max(min(num_vf_msix / num_vfs, max_msix / 2), min_msix);
> +}
> +
> +/**
> + * mlx5_set_msix_vec_count() - Set dynamically allocated MSI-X to the VF
> + * @dev: PF to work on
> + * @function_id: Internal PCI VF function id
> + * @msix_vec_count: Number of MSI-X to set

s/id/ID/
s/Number of MSI-X/Number of MSI-X vectors/

> + **/
> +int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int function_id,
> +			    int msix_vec_count)
> +{
> +	int sz = MLX5_ST_SZ_BYTES(set_hca_cap_in);
> +	int num_vf_msix, min_msix, max_msix;
> +	void *hca_cap, *cap;
> +	int ret;
> +
> +	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
> +	if (!num_vf_msix)
> +		return 0;
> +
> +	if (!MLX5_CAP_GEN(dev, vport_group_manager) || !mlx5_core_is_pf(dev))
> +		return -EOPNOTSUPP;
> +
> +	min_msix = MLX5_CAP_GEN(dev, min_dynamic_vf_msix_table_size);
> +	max_msix = MLX5_CAP_GEN(dev, max_dynamic_vf_msix_table_size);
> +
> +	if (msix_vec_count < min_msix)
> +		return -EINVAL;
> +
> +	if (msix_vec_count > max_msix)
> +		return -EOVERFLOW;
> +
> +	hca_cap = kzalloc(sz, GFP_KERNEL);
> +	if (!hca_cap)
> +		return -ENOMEM;
> +
> +	cap = MLX5_ADDR_OF(set_hca_cap_in, hca_cap, capability);
> +	MLX5_SET(cmd_hca_cap, cap, dynamic_msix_table_size, msix_vec_count);
> +
> +	MLX5_SET(set_hca_cap_in, hca_cap, opcode, MLX5_CMD_OP_SET_HCA_CAP);
> +	MLX5_SET(set_hca_cap_in, hca_cap, other_function, 1);
> +	MLX5_SET(set_hca_cap_in, hca_cap, function_id, function_id);
> +
> +	MLX5_SET(set_hca_cap_in, hca_cap, op_mod,
> +		 MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE << 1);
> +	ret = mlx5_cmd_exec_in(dev, set_hca_cap, hca_cap);
> +	kfree(hca_cap);
> +	return ret;
> +}
> +
>  int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, int vecidx,
>  		       struct notifier_block *nb)
>  {
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
> index 3094d20297a9..f0ec86a1c8a6 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
> @@ -71,8 +71,7 @@ static int sriov_restore_guids(struct mlx5_core_dev *dev, int vf)
>  static int mlx5_device_enable_sriov(struct mlx5_core_dev *dev, int num_vfs)
>  {
>  	struct mlx5_core_sriov *sriov = &dev->priv.sriov;
> -	int err;
> -	int vf;
> +	int err, vf, num_msix_count;
>  
>  	if (!MLX5_ESWITCH_MANAGER(dev))
>  		goto enable_vfs_hca;
> @@ -85,12 +84,22 @@ static int mlx5_device_enable_sriov(struct mlx5_core_dev *dev, int num_vfs)
>  	}
>  
>  enable_vfs_hca:
> +	num_msix_count = mlx5_get_default_msix_vec_count(dev, num_vfs);
>  	for (vf = 0; vf < num_vfs; vf++) {
>  		err = mlx5_core_enable_hca(dev, vf + 1);
>  		if (err) {
>  			mlx5_core_warn(dev, "failed to enable VF %d (%d)\n", vf, err);
>  			continue;
>  		}
> +
> +		err = mlx5_set_msix_vec_count(dev, vf + 1, num_msix_count);
> +		if (err) {
> +			mlx5_core_warn(dev,
> +				       "failed to set MSI-X vector counts VF %d, err %d\n",
> +				       vf, err);
> +			continue;
> +		}
> +
>  		sriov->vfs_ctx[vf].enabled = 1;
>  		if (MLX5_CAP_GEN(dev, port_type) == MLX5_CAP_PORT_TYPE_IB) {
>  			err = sriov_restore_guids(dev, vf);
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-01-26  8:57 ` [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
  2021-02-02  7:02   ` Leon Romanovsky
@ 2021-02-02 18:06   ` Bjorn Helgaas
  2021-02-02 19:44     ` Leon Romanovsky
  1 sibling, 1 reply; 16+ messages in thread
From: Bjorn Helgaas @ 2021-02-02 18:06 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky, Jason Gunthorpe,
	Alexander Duyck, Jakub Kicinski, linux-pci, linux-rdma, netdev,
	Don Dutile, Alex Williamson, David S . Miller

On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Extend PCI sysfs interface with a new callback that allows configure
> the number of MSI-X vectors for specific SR-IO VF. This is needed
> to optimize the performance of newly bound devices by allocating
> the number of vectors based on the administrator knowledge of targeted VM.

s/configure/configuration of/
s/SR-IO/SR-IOV/
s/newly bound/VFs/ ?
s/VF/VFs/
s/knowledge of targeted VM/knowledge of the intended use of the VF/
(I'm not a VF expert, but I think they can be used even without VMs)

I'm reading between the lines here, but IIUC the point is that you
have a PF that supports a finite number of MSI-X vectors for use by
all the VFs, and this interface is to control the distribution of
those MSI-X vectors among the VFs.

> This function is applicable for SR-IOV VF because such devices allocate
> their MSI-X table before they will run on the VMs and HW can't guess the
> right number of vectors, so the HW allocates them statically and equally.

This is written in a way that suggests this is behavior required by
the PCIe spec.  If it is indeed related to something in the spec,
please cite it.

But I think this is actually something device-specific, not something
we can derive directly from the spec.  If that's the case, be clear
that we're addressing a device-specific need, and we're hoping that
this will be useful for other devices as well.

"such devices allocate their MSI-X table before they will run on the
VMs": Let's be specific here.  This MSI-X Table allocation apparently
doesn't happen when we set VF Enable in the PF, because these sysfs
files are attached to the VFs, which don't exist yet.  It's not the VF
driver binding, because that's a software construct.  What is the
hardware event that triggers the allocation?

Obviously the distribution among VFs can be changed after VF Enable is
set.  Maybe the distribution is dynamic, and the important point is
that it must be changed before the VF driver reads the Message Control
register for Table Size?

But that isn't the same as "devices allocating their MSI-X table
before being passed through to a VM," so it's confusing.  The
language about allocating the MSI-X table needs to be made precise
here and in the code comments below.

"before they will run on the VMs": Devices don't "run on VMs".  I
think the usual terminology is that a device may be "passed through to
a VM".

"HW allocates them statically and equally" sounds like a description
of some device-specific behavior (unless there's something in the spec
that requires this, in which case you should cite it).  It's OK if
this is device-specific; just don't pretend that it's generic if it's
not.

> 1) The newly added /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
> file will be seen for the VFs and it is writable as long as a driver is not
> bounded to the VF.

"bound to the VF"

> The values accepted are:
>  * > 0 - this will be number reported by the VF's MSI-X capability

Specifically, I guess by Table Size in the VF's MSI-X Message Control
register?

>  * < 0 - not valid
>  * = 0 - will reset to the device default value
> 
> 2) In order to make management easy, provide new read-only sysfs file that
> returns a total number of possible to configure MSI-X vectors.
> 
> cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
>   = 0 - feature is not supported
>   > 0 - total number of MSI-X vectors to consume by the VFs

"total number of MSI-X vectors available for distribution among the
VFs"?

> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci |  32 +++++
>  drivers/pci/iov.c                       | 180 ++++++++++++++++++++++++
>  drivers/pci/msi.c                       |  47 +++++++
>  drivers/pci/pci.h                       |   4 +
>  include/linux/pci.h                     |  10 ++
>  5 files changed, 273 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 25c9c39770c6..4d206ade5331 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -375,3 +375,35 @@ Description:
>  		The value comes from the PCI kernel device state and can be one
>  		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
>  		The file is read only.
> +
> +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
> +Date:		January 2021
> +Contact:	Leon Romanovsky <leonro@nvidia.com>
> +Description:
> +		This file is associated with the SR-IOV VFs.
> +		It allows configuration of the number of MSI-X vectors for
> +		the VF. This is needed to optimize performance of newly bound
> +		devices by allocating the number of vectors based on the
> +		administrator knowledge of targeted VM.
> +
> +		The values accepted are:
> +		 * > 0 - this will be number reported by the VF's MSI-X
> +			 capability
> +		 * < 0 - not valid
> +		 * = 0 - will reset to the device default value
> +
> +		The file is writable if the PF is bound to a driver that
> +		set sriov_vf_total_msix > 0 and there is no driver bound
> +		to the VF.
> +
> +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> +Date:		January 2021
> +Contact:	Leon Romanovsky <leonro@nvidia.com>
> +Description:
> +		This file is associated with the SR-IOV PFs.
> +		It returns a total number of possible to configure MSI-X
> +		vectors on the enabled VFs.
> +
> +		The values returned are:
> +		 * > 0 - this will be total number possible to consume by VFs,
> +		 * = 0 - feature is not supported

What does "vfs_overlay" mean?  "vfs" sounds like the Virtual File
System.

Do these filenames really need to contain both "sriov" and "vf"?

Should these be next to the existing SR-IOV sysfs files, i.e., in or
alongside sriov_dev_attr_group?

Hmmm, I see pci_enable_vfs_overlay() is called by the driver.  I don't
really like that because then we're dependent on drivers to maintain
the PCI sysfs hierarchy.  E.g., a driver might forget to call
pci_disable_vfs_overlay(), and then a future driver load will fail.

Maybe this could be done with .is_visible() functions that call driver
callbacks.

> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 4afd4ee4f7f0..3e95f835eba5 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
>  	return (dev->devfn + dev->sriov->offset +
>  		dev->sriov->stride * vf_id) & 0xff;
>  }
> +EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);
> 
>  /*
>   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
> @@ -157,6 +158,166 @@ int pci_iov_sysfs_link(struct pci_dev *dev,
>  	return rc;
>  }
> 
> +#ifdef CONFIG_PCI_MSI
> +static ssize_t sriov_vf_msix_count_show(struct device *dev,
> +					struct device_attribute *attr,
> +					char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	int count = pci_msix_vec_count(pdev);
> +
> +	if (count < 0)
> +		return count;
> +
> +	return sysfs_emit(buf, "%d\n", count);
> +}
> +
> +static ssize_t sriov_vf_msix_count_store(struct device *dev,
> +					 struct device_attribute *attr,
> +					 const char *buf, size_t count)
> +{
> +	struct pci_dev *vf_dev = to_pci_dev(dev);
> +	int val, ret;
> +
> +	ret = kstrtoint(buf, 0, &val);
> +	if (ret)
> +		return ret;
> +
> +	ret = pci_vf_set_msix_vec_count(vf_dev, val);
> +	if (ret)
> +		return ret;
> +
> +	return count;
> +}
> +static DEVICE_ATTR_RW(sriov_vf_msix_count);
> +
> +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> +					struct device_attribute *attr,
> +					char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	return sysfs_emit(buf, "%u\n", pdev->sriov->vf_total_msix);
> +}
> +static DEVICE_ATTR_RO(sriov_vf_total_msix);
> +#endif
> +
> +static struct attribute *sriov_pf_dev_attrs[] = {
> +#ifdef CONFIG_PCI_MSI
> +	&dev_attr_sriov_vf_total_msix.attr,
> +#endif
> +	NULL,
> +};
> +
> +static struct attribute *sriov_vf_dev_attrs[] = {
> +#ifdef CONFIG_PCI_MSI
> +	&dev_attr_sriov_vf_msix_count.attr,
> +#endif
> +	NULL,
> +};
> +
> +static umode_t sriov_pf_attrs_are_visible(struct kobject *kobj,
> +					  struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	if (!pdev->msix_cap || !dev_is_pf(dev))
> +		return 0;
> +
> +	return a->mode;
> +}
> +
> +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
> +					  struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	if (!pdev->msix_cap || dev_is_pf(dev))
> +		return 0;
> +
> +	return a->mode;
> +}
> +
> +static const struct attribute_group sriov_pf_dev_attr_group = {
> +	.attrs = sriov_pf_dev_attrs,
> +	.is_visible = sriov_pf_attrs_are_visible,
> +	.name = "vfs_overlay",
> +};
> +
> +static const struct attribute_group sriov_vf_dev_attr_group = {
> +	.attrs = sriov_vf_dev_attrs,
> +	.is_visible = sriov_vf_attrs_are_visible,
> +	.name = "vfs_overlay",
> +};
> +
> +int pci_enable_vfs_overlay(struct pci_dev *dev)
> +{
> +	struct pci_dev *virtfn;
> +	int id, ret;
> +
> +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> +		return 0;
> +
> +	ret = sysfs_create_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> +	if (ret)
> +		return ret;
> +
> +	for (id = 0; id < dev->sriov->num_VFs; id++) {
> +		virtfn = pci_get_domain_bus_and_slot(
> +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> +			pci_iov_virtfn_devfn(dev, id));
> +
> +		if (!virtfn)
> +			continue;
> +
> +		ret = sysfs_create_group(&virtfn->dev.kobj,
> +					 &sriov_vf_dev_attr_group);
> +		if (ret)
> +			goto out;
> +	}
> +	return 0;
> +
> +out:
> +	while (id--) {
> +		virtfn = pci_get_domain_bus_and_slot(
> +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> +			pci_iov_virtfn_devfn(dev, id));
> +
> +		if (!virtfn)
> +			continue;
> +
> +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> +	}
> +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pci_enable_vfs_overlay);
> +
> +void pci_disable_vfs_overlay(struct pci_dev *dev)
> +{
> +	struct pci_dev *virtfn;
> +	int id;
> +
> +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> +		return;
> +
> +	id = dev->sriov->num_VFs;
> +	while (id--) {
> +		virtfn = pci_get_domain_bus_and_slot(
> +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> +			pci_iov_virtfn_devfn(dev, id));
> +
> +		if (!virtfn)
> +			continue;
> +
> +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> +	}
> +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> +}
> +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);

I'm not convinced all this sysfs wrangling is necessary.  If it is,
add a hint in a comment about why this is special and can't use
something like sriov_dev_attr_group.

>  int pci_iov_add_virtfn(struct pci_dev *dev, int id)
>  {
>  	int i;
> @@ -596,6 +757,7 @@ static void sriov_disable(struct pci_dev *dev)
>  		sysfs_remove_link(&dev->dev.kobj, "dep_link");
> 
>  	iov->num_VFs = 0;
> +	iov->vf_total_msix = 0;
>  	pci_iov_set_numvfs(dev, 0);
>  }
> 
> @@ -1054,6 +1216,24 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> 
> +/**
> + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
> + * @dev: the PCI PF device
> + * @count: the total number of MSI-X vector to consume by the VFs
> + *
> + * Sets the number of MSI-X vectors that is possible to consume by the VFs.
> + * This interface is complimentary part of the pci_vf_set_msix_vec_count()

s/Sets the/Set the/
s/complimentary part of the/complementary to/

> + * that will be used to configure the required number on the VF.
> + */
> +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count)
> +{
> +	if (!dev->is_physfn)
> +		return;
> +
> +	dev->sriov->vf_total_msix = count;

The PCI core doesn't use vf_total_msix at all.  The driver, e.g.,
mlx5, calls this, and all the PCI core does is hang onto the value and
expose it via sysfs.  I think I'd rather have a callback in struct
pci_driver and let the driver supply the value when needed.  I.e.,
sriov_vf_total_msix_show() would call the callback instead of looking
at pdev->sriov->vf_total_msix.

> +}
> +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
> +
>  /**
>   * pci_sriov_configure_simple - helper to configure SR-IOV
>   * @dev: the PCI device
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 3162f88fe940..1022fe9e6efd 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -991,6 +991,53 @@ int pci_msix_vec_count(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL(pci_msix_vec_count);
> 
> +/**
> + * pci_vf_set_msix_vec_count - change the reported number of MSI-X vectors
> + * This function is applicable for SR-IOV VF because such devices allocate
> + * their MSI-X table before they will run on the VMs and HW can't guess the
> + * right number of vectors, so the HW allocates them statically and equally.
> + * @dev: VF device that is going to be changed
> + * @count: amount of MSI-X vectors

s/amount/number/

> + **/
> +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count)
> +{
> +	struct pci_dev *pdev = pci_physfn(dev);
> +	int ret;
> +
> +	if (count < 0)
> +		/*
> +		 * We don't support negative numbers for now,
> +		 * but maybe in the future it will make sense.
> +		 */

Drop the comment; I don't think it adds useful information.

> +		return -EINVAL;
> +
> +	device_lock(&pdev->dev);
> +	if (!pdev->driver) {
> +		ret = -EOPNOTSUPP;
> +		goto err_pdev;
> +	}
> +
> +	device_lock(&dev->dev);
> +	if (dev->driver) {
> +		/*
> +		 * Driver already probed this VF and configured itself
> +		 * based on previously configured (or default) MSI-X vector
> +		 * count. It is too late to change this field for this
> +		 * specific VF.
> +		 */
> +		ret = -EBUSY;
> +		goto err_dev;
> +	}
> +
> +	ret = pdev->driver->sriov_set_msix_vec_count(dev, count);

This looks like a NULL pointer dereference.

> +err_dev:
> +	device_unlock(&dev->dev);
> +err_pdev:
> +	device_unlock(&pdev->dev);
> +	return ret;
> +}
> +
>  static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
>  			     int nvec, struct irq_affinity *affd, int flags)
>  {
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 5c59365092fa..2bd6560d91e2 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -183,6 +183,7 @@ extern unsigned int pci_pm_d3hot_delay;
> 
>  #ifdef CONFIG_PCI_MSI
>  void pci_no_msi(void);
> +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count);
>  #else
>  static inline void pci_no_msi(void) { }
>  #endif
> @@ -326,6 +327,9 @@ struct pci_sriov {
>  	u16		subsystem_device; /* VF subsystem device */
>  	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
>  	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
> +	u32		vf_total_msix;  /* Total number of MSI-X vectors the VFs
> +					 * can consume
> +					 */

  * can consume */

Hopefully you can eliminate vf_total_msix altogether.

>  };
> 
>  /**
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index b32126d26997..24d118ad6e7b 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -856,6 +856,8 @@ struct module;
>   *		e.g. drivers/net/e100.c.
>   * @sriov_configure: Optional driver callback to allow configuration of
>   *		number of VFs to enable via sysfs "sriov_numvfs" file.
> + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> + *              exposed by the sysfs "vf_msix_vec" entry.

"vf_msix_vec" is apparently stale?  There's no other reference in this
patch.

I think the important part is that this changes the number of vectors
advertised in the VF's MSI-X Message Control register, which will be
read when the VF driver enables MSI-X.

If that's true, why would we expose this via a sysfs file?  We can
easily read it via lspci.

>   * @err_handler: See Documentation/PCI/pci-error-recovery.rst
>   * @groups:	Sysfs attribute groups.
>   * @driver:	Driver model structure.
> @@ -871,6 +873,7 @@ struct pci_driver {
>  	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
>  	void (*shutdown)(struct pci_dev *dev);
>  	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
> +	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
>  	const struct pci_error_handlers *err_handler;
>  	const struct attribute_group **groups;
>  	struct device_driver	driver;
> @@ -2059,6 +2062,9 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
>  int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
>  int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
> 
> +int pci_enable_vfs_overlay(struct pci_dev *dev);
> +void pci_disable_vfs_overlay(struct pci_dev *dev);
> +
>  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>  void pci_disable_sriov(struct pci_dev *dev);
> 
> @@ -2072,6 +2078,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev);
>  int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
>  resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
>  void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
> +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count);
> 
>  /* Arch may override these (weak) */
>  int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs);
> @@ -2100,6 +2107,8 @@ static inline int pci_iov_add_virtfn(struct pci_dev *dev, int id)
>  }
>  static inline void pci_iov_remove_virtfn(struct pci_dev *dev,
>  					 int id) { }
> +static inline int pci_enable_vfs_overlay(struct pci_dev *dev) { return 0; }
> +static inline void pci_disable_vfs_overlay(struct pci_dev *dev) {}

s/{}/{ }/
Please make your code match the rest of the file, e.g., the very next line!

>  static inline void pci_disable_sriov(struct pci_dev *dev) { }
>  static inline int pci_num_vf(struct pci_dev *dev) { return 0; }
>  static inline int pci_vfs_assigned(struct pci_dev *dev)
> @@ -2112,6 +2121,7 @@ static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
>  static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
>  { return 0; }
>  static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { }
> +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {}

Also here.  Unless removing the space would make this fit in 80
columns.

>  #endif
> 
>  #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
> --
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mlx5-next v5 3/4] net/mlx5: Dynamically assign MSI-X vectors count
  2021-02-02 17:25   ` Bjorn Helgaas
@ 2021-02-02 19:11     ` Leon Romanovsky
  0 siblings, 0 replies; 16+ messages in thread
From: Leon Romanovsky @ 2021-02-02 19:11 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

On Tue, Feb 02, 2021 at 11:25:08AM -0600, Bjorn Helgaas wrote:
> On Tue, Jan 26, 2021 at 10:57:29AM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > The number of MSI-X vectors is PCI property visible through lspci, that
> > field is read-only and configured by the device. The static assignment
> > of an amount of MSI-X vectors doesn't allow utilize the newly created
> > VF because it is not known to the device the future load and configuration
> > where that VF will be used.
> >
> > To overcome the inefficiency in the spread of such MSI-X vectors, we
> > allow the kernel to instruct the device with the needed number of such
> > vectors.
> >
> > Such change immediately increases the amount of MSI-X vectors for the
> > system with @ VFs from 12 vectors per-VF, to be 32 vectors per-VF.
>
> Not knowing anything about mlx5, it looks like maybe this gets some
> parameters from firmware on the device, then changes the way MSI-X
> vectors are distributed among VFs?

The mlx5 devices can operate in one of two modes: static MSI-X vector
table size and dynamic.

For the same number of VFs, the device will get 12 vectors per-VF in static
mode. In dynamic, the total number is higher and we will be able to distribute
new amount better.

>
> I don't understand the implications above about "static assignment"
> and "inefficiency in the spread."  I guess maybe this takes advantage
> of the fact that you know how many VFs are enabled, so if NumVFs is
> less that TotalVFs, you can assign more vectors to each VF?

Internally, in the FW, we are using different pool and configuration scheme
for such distribution. In static mode, the amount is pre-configured through
our FW configuration tool (nvconfig), in dynamic, the driver is fully
responsible. And yes. NumVFs helps to utilize it is better.

>
> If that's the case, spell it out a little bit.  The current text makes
> it sound like you discovered brand new MSI-X vectors somewhere,
> regardless of how many VFs are enabled, which doesn't sound right.

I will do.

>
> > Before this patch:
> > [root@server ~]# lspci -vs 0000:08:00.2
> > 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
> > ....
> > 	Capabilities: [9c] MSI-X: Enable- Count=12 Masked-
> >
> > After this patch:
> > [root@server ~]# lspci -vs 0000:08:00.2
> > 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
> > ....
> > 	Capabilities: [9c] MSI-X: Enable- Count=32 Masked-
> >
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  .../net/ethernet/mellanox/mlx5/core/main.c    |  4 ++
> >  .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  5 ++
> >  .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 72 +++++++++++++++++++
> >  .../net/ethernet/mellanox/mlx5/core/sriov.c   | 13 +++-
> >  4 files changed, 92 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > index ca6f2fc39ea0..79cfcc844156 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > @@ -567,6 +567,10 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
> >  	if (MLX5_CAP_GEN_MAX(dev, mkey_by_name))
> >  		MLX5_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1);
> >
> > +	if (MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix))
> > +		MLX5_SET(cmd_hca_cap, set_hca_cap, num_total_dynamic_vf_msix,
> > +			 MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix));
> > +
> >  	return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE);
> >  }
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> > index 0a0302ce7144..5babb4434a87 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
> > @@ -172,6 +172,11 @@ int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, int vecidx,
> >  		       struct notifier_block *nb);
> >  int mlx5_irq_detach_nb(struct mlx5_irq_table *irq_table, int vecidx,
> >  		       struct notifier_block *nb);
> > +
> > +int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int devfn,
> > +			    int msix_vec_count);
> > +int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs);
> > +
> >  struct cpumask *
> >  mlx5_irq_get_affinity_mask(struct mlx5_irq_table *irq_table, int vecidx);
> >  struct cpu_rmap *mlx5_irq_get_rmap(struct mlx5_irq_table *table);
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> > index 6fd974920394..2a35888fcff0 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> > @@ -55,6 +55,78 @@ static struct mlx5_irq *mlx5_irq_get(struct mlx5_core_dev *dev, int vecidx)
> >  	return &irq_table->irq[vecidx];
> >  }
> >
> > +/**
> > + * mlx5_get_default_msix_vec_count() - Get defaults of number of MSI-X vectors
> > + * to be set
>
> s/defaults of number of/default number of/
> s/to be set/to be assigned to each VF/ ?
>
> > + * @dev: PF to work on
> > + * @num_vfs: Number of VFs was asked when SR-IOV was enabled
>
> s/Number of VFs was asked when SR-IOV was enabled/Number of enabled VFs/ ?
>
> > + **/
>
> Documentation/doc-guide/kernel-doc.rst says kernel-doc comments end
> with just "*/" (not "**/").

The netdev uses this style all other the place. Also it is internal API
call, the kdoc is not needed here, so I followed existing format.

I'll fix all comments and resubmit.

Thanks

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-02 18:06   ` Bjorn Helgaas
@ 2021-02-02 19:44     ` Leon Romanovsky
  2021-02-04  0:10       ` Bjorn Helgaas
  0 siblings, 1 reply; 16+ messages in thread
From: Leon Romanovsky @ 2021-02-02 19:44 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:
> On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > Extend PCI sysfs interface with a new callback that allows configure
> > the number of MSI-X vectors for specific SR-IO VF. This is needed
> > to optimize the performance of newly bound devices by allocating
> > the number of vectors based on the administrator knowledge of targeted VM.
>
> s/configure/configuration of/
> s/SR-IO/SR-IOV/
> s/newly bound/VFs/ ?
> s/VF/VFs/
> s/knowledge of targeted VM/knowledge of the intended use of the VF/
> (I'm not a VF expert, but I think they can be used even without VMs)
>
> I'm reading between the lines here, but IIUC the point is that you
> have a PF that supports a finite number of MSI-X vectors for use by
> all the VFs, and this interface is to control the distribution of
> those MSI-X vectors among the VFs.

The MSI-X is HW resource, all devices in the world have limitation here.

>
> > This function is applicable for SR-IOV VF because such devices allocate
> > their MSI-X table before they will run on the VMs and HW can't guess the
> > right number of vectors, so the HW allocates them statically and equally.
>
> This is written in a way that suggests this is behavior required by
> the PCIe spec.  If it is indeed related to something in the spec,
> please cite it.

Spec doesn't say it directly, but you will need to really hurt brain of your
users if you decide to do it differently. You have one enable bit to create
all VFs at the same time without any option to configure them in advance.

Of course, you can create some partition map, upload it to FW and create from
there.

>
> But I think this is actually something device-specific, not something
> we can derive directly from the spec.  If that's the case, be clear
> that we're addressing a device-specific need, and we're hoping that
> this will be useful for other devices as well.

I will add.

>
> "such devices allocate their MSI-X table before they will run on the
> VMs": Let's be specific here.  This MSI-X Table allocation apparently
> doesn't happen when we set VF Enable in the PF, because these sysfs
> files are attached to the VFs, which don't exist yet.  It's not the VF
> driver binding, because that's a software construct.  What is the
> hardware event that triggers the allocation?

Write of MSI-X vector count to the FW through PF.

>
> Obviously the distribution among VFs can be changed after VF Enable is
> set.  Maybe the distribution is dynamic, and the important point is
> that it must be changed before the VF driver reads the Message Control
> register for Table Size?

Yes

>
> But that isn't the same as "devices allocating their MSI-X table
> before being passed through to a VM," so it's confusing.  The
> language about allocating the MSI-X table needs to be made precise
> here and in the code comments below.
>
> "before they will run on the VMs": Devices don't "run on VMs".  I
> think the usual terminology is that a device may be "passed through to
> a VM".
>
> "HW allocates them statically and equally" sounds like a description
> of some device-specific behavior (unless there's something in the spec
> that requires this, in which case you should cite it).  It's OK if
> this is device-specific; just don't pretend that it's generic if it's
> not.

I will change.

>
> > 1) The newly added /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
> > file will be seen for the VFs and it is writable as long as a driver is not
> > bounded to the VF.
>
> "bound to the VF"
>
> > The values accepted are:
> >  * > 0 - this will be number reported by the VF's MSI-X capability
>
> Specifically, I guess by Table Size in the VF's MSI-X Message Control
> register?

Right

>
> >  * < 0 - not valid
> >  * = 0 - will reset to the device default value
> >
> > 2) In order to make management easy, provide new read-only sysfs file that
> > returns a total number of possible to configure MSI-X vectors.
> >
> > cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> >   = 0 - feature is not supported
> >   > 0 - total number of MSI-X vectors to consume by the VFs
>
> "total number of MSI-X vectors available for distribution among the
> VFs"?

Users need to be aware of how much vectors exist in the system.

>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  Documentation/ABI/testing/sysfs-bus-pci |  32 +++++
> >  drivers/pci/iov.c                       | 180 ++++++++++++++++++++++++
> >  drivers/pci/msi.c                       |  47 +++++++
> >  drivers/pci/pci.h                       |   4 +
> >  include/linux/pci.h                     |  10 ++
> >  5 files changed, 273 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > index 25c9c39770c6..4d206ade5331 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > @@ -375,3 +375,35 @@ Description:
> >  		The value comes from the PCI kernel device state and can be one
> >  		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
> >  		The file is read only.
> > +
> > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
> > +Date:		January 2021
> > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > +Description:
> > +		This file is associated with the SR-IOV VFs.
> > +		It allows configuration of the number of MSI-X vectors for
> > +		the VF. This is needed to optimize performance of newly bound
> > +		devices by allocating the number of vectors based on the
> > +		administrator knowledge of targeted VM.
> > +
> > +		The values accepted are:
> > +		 * > 0 - this will be number reported by the VF's MSI-X
> > +			 capability
> > +		 * < 0 - not valid
> > +		 * = 0 - will reset to the device default value
> > +
> > +		The file is writable if the PF is bound to a driver that
> > +		set sriov_vf_total_msix > 0 and there is no driver bound
> > +		to the VF.
> > +
> > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > +Date:		January 2021
> > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > +Description:
> > +		This file is associated with the SR-IOV PFs.
> > +		It returns a total number of possible to configure MSI-X
> > +		vectors on the enabled VFs.
> > +
> > +		The values returned are:
> > +		 * > 0 - this will be total number possible to consume by VFs,
> > +		 * = 0 - feature is not supported
>
> What does "vfs_overlay" mean?  "vfs" sounds like the Virtual File
> System.

VFs == many VF?

>
> Do these filenames really need to contain both "sriov" and "vf"?

This is what I was asked at some point. In previous versions the name
was without "sriov".

>
> Should these be next to the existing SR-IOV sysfs files, i.e., in or
> alongside sriov_dev_attr_group?

This was suggestion in previous versions. I didn't hear anyone
supporting it.

>
> Hmmm, I see pci_enable_vfs_overlay() is called by the driver.  I don't
> really like that because then we're dependent on drivers to maintain
> the PCI sysfs hierarchy.  E.g., a driver might forget to call
> pci_disable_vfs_overlay(), and then a future driver load will fail.

It is not different from any other kernel bug. I have gazillion ways to
break the system with crappy driver.

>
> Maybe this could be done with .is_visible() functions that call driver
> callbacks.

It was in previous versions too, but this solution allows PF to control
the VFs overlay dynamically.

>
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index 4afd4ee4f7f0..3e95f835eba5 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
> >  	return (dev->devfn + dev->sriov->offset +
> >  		dev->sriov->stride * vf_id) & 0xff;
> >  }
> > +EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);
> >
> >  /*
> >   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
> > @@ -157,6 +158,166 @@ int pci_iov_sysfs_link(struct pci_dev *dev,
> >  	return rc;
> >  }
> >
> > +#ifdef CONFIG_PCI_MSI
> > +static ssize_t sriov_vf_msix_count_show(struct device *dev,
> > +					struct device_attribute *attr,
> > +					char *buf)
> > +{
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +	int count = pci_msix_vec_count(pdev);
> > +
> > +	if (count < 0)
> > +		return count;
> > +
> > +	return sysfs_emit(buf, "%d\n", count);
> > +}
> > +
> > +static ssize_t sriov_vf_msix_count_store(struct device *dev,
> > +					 struct device_attribute *attr,
> > +					 const char *buf, size_t count)
> > +{
> > +	struct pci_dev *vf_dev = to_pci_dev(dev);
> > +	int val, ret;
> > +
> > +	ret = kstrtoint(buf, 0, &val);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = pci_vf_set_msix_vec_count(vf_dev, val);
> > +	if (ret)
> > +		return ret;
> > +
> > +	return count;
> > +}
> > +static DEVICE_ATTR_RW(sriov_vf_msix_count);
> > +
> > +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> > +					struct device_attribute *attr,
> > +					char *buf)
> > +{
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > +	return sysfs_emit(buf, "%u\n", pdev->sriov->vf_total_msix);
> > +}
> > +static DEVICE_ATTR_RO(sriov_vf_total_msix);
> > +#endif
> > +
> > +static struct attribute *sriov_pf_dev_attrs[] = {
> > +#ifdef CONFIG_PCI_MSI
> > +	&dev_attr_sriov_vf_total_msix.attr,
> > +#endif
> > +	NULL,
> > +};
> > +
> > +static struct attribute *sriov_vf_dev_attrs[] = {
> > +#ifdef CONFIG_PCI_MSI
> > +	&dev_attr_sriov_vf_msix_count.attr,
> > +#endif
> > +	NULL,
> > +};
> > +
> > +static umode_t sriov_pf_attrs_are_visible(struct kobject *kobj,
> > +					  struct attribute *a, int n)
> > +{
> > +	struct device *dev = kobj_to_dev(kobj);
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > +	if (!pdev->msix_cap || !dev_is_pf(dev))
> > +		return 0;
> > +
> > +	return a->mode;
> > +}
> > +
> > +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
> > +					  struct attribute *a, int n)
> > +{
> > +	struct device *dev = kobj_to_dev(kobj);
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > +	if (!pdev->msix_cap || dev_is_pf(dev))
> > +		return 0;
> > +
> > +	return a->mode;
> > +}
> > +
> > +static const struct attribute_group sriov_pf_dev_attr_group = {
> > +	.attrs = sriov_pf_dev_attrs,
> > +	.is_visible = sriov_pf_attrs_are_visible,
> > +	.name = "vfs_overlay",
> > +};
> > +
> > +static const struct attribute_group sriov_vf_dev_attr_group = {
> > +	.attrs = sriov_vf_dev_attrs,
> > +	.is_visible = sriov_vf_attrs_are_visible,
> > +	.name = "vfs_overlay",
> > +};
> > +
> > +int pci_enable_vfs_overlay(struct pci_dev *dev)
> > +{
> > +	struct pci_dev *virtfn;
> > +	int id, ret;
> > +
> > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > +		return 0;
> > +
> > +	ret = sysfs_create_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > +	if (ret)
> > +		return ret;
> > +
> > +	for (id = 0; id < dev->sriov->num_VFs; id++) {
> > +		virtfn = pci_get_domain_bus_and_slot(
> > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > +			pci_iov_virtfn_devfn(dev, id));
> > +
> > +		if (!virtfn)
> > +			continue;
> > +
> > +		ret = sysfs_create_group(&virtfn->dev.kobj,
> > +					 &sriov_vf_dev_attr_group);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +	return 0;
> > +
> > +out:
> > +	while (id--) {
> > +		virtfn = pci_get_domain_bus_and_slot(
> > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > +			pci_iov_virtfn_devfn(dev, id));
> > +
> > +		if (!virtfn)
> > +			continue;
> > +
> > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > +	}
> > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_enable_vfs_overlay);
> > +
> > +void pci_disable_vfs_overlay(struct pci_dev *dev)
> > +{
> > +	struct pci_dev *virtfn;
> > +	int id;
> > +
> > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > +		return;
> > +
> > +	id = dev->sriov->num_VFs;
> > +	while (id--) {
> > +		virtfn = pci_get_domain_bus_and_slot(
> > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > +			pci_iov_virtfn_devfn(dev, id));
> > +
> > +		if (!virtfn)
> > +			continue;
> > +
> > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > +	}
> > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > +}
> > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);
>
> I'm not convinced all this sysfs wrangling is necessary.  If it is,
> add a hint in a comment about why this is special and can't use
> something like sriov_dev_attr_group.

This makes the overlay to be PF-driven. Alexander insisted on this flow.

>
> >  int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> >  {
> >  	int i;
> > @@ -596,6 +757,7 @@ static void sriov_disable(struct pci_dev *dev)
> >  		sysfs_remove_link(&dev->dev.kobj, "dep_link");
> >
> >  	iov->num_VFs = 0;
> > +	iov->vf_total_msix = 0;
> >  	pci_iov_set_numvfs(dev, 0);
> >  }
> >
> > @@ -1054,6 +1216,24 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
> >  }
> >  EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> >
> > +/**
> > + * pci_sriov_set_vf_total_msix - set total number of MSI-X vectors for the VFs
> > + * @dev: the PCI PF device
> > + * @count: the total number of MSI-X vector to consume by the VFs
> > + *
> > + * Sets the number of MSI-X vectors that is possible to consume by the VFs.
> > + * This interface is complimentary part of the pci_vf_set_msix_vec_count()
>
> s/Sets the/Set the/
> s/complimentary part of the/complementary to/
>
> > + * that will be used to configure the required number on the VF.
> > + */
> > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count)
> > +{
> > +	if (!dev->is_physfn)
> > +		return;
> > +
> > +	dev->sriov->vf_total_msix = count;
>
> The PCI core doesn't use vf_total_msix at all.  The driver, e.g.,
> mlx5, calls this, and all the PCI core does is hang onto the value and
> expose it via sysfs.  I think I'd rather have a callback in struct
> pci_driver and let the driver supply the value when needed.  I.e.,
> sriov_vf_total_msix_show() would call the callback instead of looking
> at pdev->sriov->vf_total_msix.

It will cause to unnecessary locking to ensure that driver doesn't
vanish during sysfs read. I can change, but don't think that it is right
decision.

>
> > +}
> > +EXPORT_SYMBOL_GPL(pci_sriov_set_vf_total_msix);
> > +
> >  /**
> >   * pci_sriov_configure_simple - helper to configure SR-IOV
> >   * @dev: the PCI device
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 3162f88fe940..1022fe9e6efd 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -991,6 +991,53 @@ int pci_msix_vec_count(struct pci_dev *dev)
> >  }
> >  EXPORT_SYMBOL(pci_msix_vec_count);
> >
> > +/**
> > + * pci_vf_set_msix_vec_count - change the reported number of MSI-X vectors
> > + * This function is applicable for SR-IOV VF because such devices allocate
> > + * their MSI-X table before they will run on the VMs and HW can't guess the
> > + * right number of vectors, so the HW allocates them statically and equally.
> > + * @dev: VF device that is going to be changed
> > + * @count: amount of MSI-X vectors
>
> s/amount/number/
>
> > + **/
> > +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count)
> > +{
> > +	struct pci_dev *pdev = pci_physfn(dev);
> > +	int ret;
> > +
> > +	if (count < 0)
> > +		/*
> > +		 * We don't support negative numbers for now,
> > +		 * but maybe in the future it will make sense.
> > +		 */
>
> Drop the comment; I don't think it adds useful information.
>
> > +		return -EINVAL;
> > +
> > +	device_lock(&pdev->dev);
> > +	if (!pdev->driver) {
> > +		ret = -EOPNOTSUPP;
> > +		goto err_pdev;
> > +	}
> > +
> > +	device_lock(&dev->dev);
> > +	if (dev->driver) {
> > +		/*
> > +		 * Driver already probed this VF and configured itself
> > +		 * based on previously configured (or default) MSI-X vector
> > +		 * count. It is too late to change this field for this
> > +		 * specific VF.
> > +		 */
> > +		ret = -EBUSY;
> > +		goto err_dev;
> > +	}
> > +
> > +	ret = pdev->driver->sriov_set_msix_vec_count(dev, count);
>
> This looks like a NULL pointer dereference.

In current code, it is impossible, the call to pci_vf_set_msix_vec_count()
will be only for devices that supports sriov_vf_msix_count which is
possible with ->sriov_set_msix_vec_count() only.

But I will add the check to be bullet proof for the future extensions.

>
> > +err_dev:
> > +	device_unlock(&dev->dev);
> > +err_pdev:
> > +	device_unlock(&pdev->dev);
> > +	return ret;
> > +}
> > +
> >  static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
> >  			     int nvec, struct irq_affinity *affd, int flags)
> >  {
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 5c59365092fa..2bd6560d91e2 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -183,6 +183,7 @@ extern unsigned int pci_pm_d3hot_delay;
> >
> >  #ifdef CONFIG_PCI_MSI
> >  void pci_no_msi(void);
> > +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count);
> >  #else
> >  static inline void pci_no_msi(void) { }
> >  #endif
> > @@ -326,6 +327,9 @@ struct pci_sriov {
> >  	u16		subsystem_device; /* VF subsystem device */
> >  	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
> >  	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
> > +	u32		vf_total_msix;  /* Total number of MSI-X vectors the VFs
> > +					 * can consume
> > +					 */
>
>   * can consume */
>
> Hopefully you can eliminate vf_total_msix altogether.

I think that will be worse from the flow point of view (extra locks) and the memory
if you are worried about it. This variable consumes 4 bytes, the pointer to the function
will take 8 bytes.

>
> >  };
> >
> >  /**
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index b32126d26997..24d118ad6e7b 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -856,6 +856,8 @@ struct module;
> >   *		e.g. drivers/net/e100.c.
> >   * @sriov_configure: Optional driver callback to allow configuration of
> >   *		number of VFs to enable via sysfs "sriov_numvfs" file.
> > + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> > + *              exposed by the sysfs "vf_msix_vec" entry.
>
> "vf_msix_vec" is apparently stale?  There's no other reference in this
> patch.
>
> I think the important part is that this changes the number of vectors
> advertised in the VF's MSI-X Message Control register, which will be
> read when the VF driver enables MSI-X.
>
> If that's true, why would we expose this via a sysfs file?  We can
> easily read it via lspci.

I did it for feature complete, we don't need read of this sysfs.

>
> >   * @err_handler: See Documentation/PCI/pci-error-recovery.rst
> >   * @groups:	Sysfs attribute groups.
> >   * @driver:	Driver model structure.
> > @@ -871,6 +873,7 @@ struct pci_driver {
> >  	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
> >  	void (*shutdown)(struct pci_dev *dev);
> >  	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
> > +	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
> >  	const struct pci_error_handlers *err_handler;
> >  	const struct attribute_group **groups;
> >  	struct device_driver	driver;
> > @@ -2059,6 +2062,9 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
> >  int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
> >  int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
> >
> > +int pci_enable_vfs_overlay(struct pci_dev *dev);
> > +void pci_disable_vfs_overlay(struct pci_dev *dev);
> > +
> >  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
> >  void pci_disable_sriov(struct pci_dev *dev);
> >
> > @@ -2072,6 +2078,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev);
> >  int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
> >  resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
> >  void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
> > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count);
> >
> >  /* Arch may override these (weak) */
> >  int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs);
> > @@ -2100,6 +2107,8 @@ static inline int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> >  }
> >  static inline void pci_iov_remove_virtfn(struct pci_dev *dev,
> >  					 int id) { }
> > +static inline int pci_enable_vfs_overlay(struct pci_dev *dev) { return 0; }
> > +static inline void pci_disable_vfs_overlay(struct pci_dev *dev) {}
>
> s/{}/{ }/
> Please make your code match the rest of the file, e.g., the very next line!
>
> >  static inline void pci_disable_sriov(struct pci_dev *dev) { }
> >  static inline int pci_num_vf(struct pci_dev *dev) { return 0; }
> >  static inline int pci_vfs_assigned(struct pci_dev *dev)
> > @@ -2112,6 +2121,7 @@ static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
> >  static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
> >  { return 0; }
> >  static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { }
> > +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {}
>
> Also here.  Unless removing the space would make this fit in 80
> columns.

Yes, it is 80 columns without space.

Thanks

>
> >  #endif
> >
> >  #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
> > --
> > 2.29.2
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-02 19:44     ` Leon Romanovsky
@ 2021-02-04  0:10       ` Bjorn Helgaas
  2021-02-04 15:50         ` Leon Romanovsky
  0 siblings, 1 reply; 16+ messages in thread
From: Bjorn Helgaas @ 2021-02-04  0:10 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote:
> On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:
> > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > Extend PCI sysfs interface with a new callback that allows
> > > configure the number of MSI-X vectors for specific SR-IO VF.
> > > This is needed to optimize the performance of newly bound
> > > devices by allocating the number of vectors based on the
> > > administrator knowledge of targeted VM.
> >
> > I'm reading between the lines here, but IIUC the point is that you
> > have a PF that supports a finite number of MSI-X vectors for use
> > by all the VFs, and this interface is to control the distribution
> > of those MSI-X vectors among the VFs.
> 
> The MSI-X is HW resource, all devices in the world have limitation
> here.
>
> > > This function is applicable for SR-IOV VF because such devices
> > > allocate their MSI-X table before they will run on the VMs and
> > > HW can't guess the right number of vectors, so the HW allocates
> > > them statically and equally.
> >
> > This is written in a way that suggests this is behavior required
> > by the PCIe spec.  If it is indeed related to something in the
> > spec, please cite it.
> 
> Spec doesn't say it directly, but you will need to really hurt brain
> of your users if you decide to do it differently. You have one
> enable bit to create all VFs at the same time without any option to
> configure them in advance.
> 
> Of course, you can create some partition map, upload it to FW and
> create from there.

Of course all devices have limitations.  But let's add some details
about the way *this* device works.  That will help avoid giving the
impression that this is the *only* way spec-conforming devices can
work.

> > "such devices allocate their MSI-X table before they will run on
> > the VMs": Let's be specific here.  This MSI-X Table allocation
> > apparently doesn't happen when we set VF Enable in the PF, because
> > these sysfs files are attached to the VFs, which don't exist yet.
> > It's not the VF driver binding, because that's a software
> > construct.  What is the hardware event that triggers the
> > allocation?
> 
> Write of MSI-X vector count to the FW through PF.

This is an example of something that is obviously specific to this
mlx5 device.  The Table Size field in Message Control is RO per spec,
and obviously firmware on the device is completely outside the scope
of the PCIe spec.

This commit log should describe how *this* device manages this
allocation and how the PF Table Size and the VF Table Sizes are
related.  Per PCIe, there is no necessary connection between them.

> > > cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > >   = 0 - feature is not supported
> > >   > 0 - total number of MSI-X vectors to consume by the VFs
> >
> > "total number of MSI-X vectors available for distribution among the
> > VFs"?
> 
> Users need to be aware of how much vectors exist in the system.

Understood -- if there's an interface to influence the distribution of
vectors among VFs, one needs to know how many vectors there are to
work with.

My point was that "number of vectors to consume by VFs" is awkward
wording, so I suggested an alternative.

> > > +What:            /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
> > > +Date:            January 2021
> > > +Contact: Leon Romanovsky <leonro@nvidia.com>
> > > +Description:
> > > +         This file is associated with the SR-IOV VFs.
> > > +         It allows configuration of the number of MSI-X vectors for
> > > +         the VF. This is needed to optimize performance of newly bound
> > > +         devices by allocating the number of vectors based on the
> > > +         administrator knowledge of targeted VM.
> > > +
> > > +         The values accepted are:
> > > +          * > 0 - this will be number reported by the VF's MSI-X
> > > +                  capability
> > > +          * < 0 - not valid
> > > +          * = 0 - will reset to the device default value
> > > +
> > > +         The file is writable if the PF is bound to a driver that
> > > +         set sriov_vf_total_msix > 0 and there is no driver bound
> > > +         to the VF.

Drivers don't actually set "sriov_vf_total_msix".  This should
probably say something like "the PF is bound to a driver that
implements ->sriov_set_msix_vec_count()."

> > > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > > +Date:		January 2021
> > > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > > +Description:
> > > +		This file is associated with the SR-IOV PFs.
> > > +		It returns a total number of possible to configure MSI-X
> > > +		vectors on the enabled VFs.
> > > +
> > > +		The values returned are:
> > > +		 * > 0 - this will be total number possible to consume by VFs,
> > > +		 * = 0 - feature is not supported

Can you swap the order of these two files in the documentation?
sriov_vf_total_msix is associated with the PF and logically precedes
sriov_vf_msix_count, which is associated with VFs.

Not sure the "= 0" description is necessary here.  If the value
returned is the number of MSI-X vectors available for assignment to
VFs, "0" is a perfectly legitimate value.  It just means there are
none.  It doesn't need to be described separately.

> > Do these filenames really need to contain both "sriov" and "vf"?
> 
> This is what I was asked at some point. In previous versions the name
> was without "sriov".

We currently have:

  $ grep DEVICE_ATTR drivers/pci/iov.c
  static DEVICE_ATTR_RO(sriov_totalvfs);
  static DEVICE_ATTR_RW(sriov_numvfs);
  static DEVICE_ATTR_RO(sriov_offset);
  static DEVICE_ATTR_RO(sriov_stride);
  static DEVICE_ATTR_RO(sriov_vf_device);
  static DEVICE_ATTR_RW(sriov_drivers_autoprobe);

If we put them in a new "vfs_overlay" directory, it seems like
overkill to repeat the "vf" part, but I'm hoping the new files can end
up next to these existing files.  In that case, I think it makes sense
to include "sriov".  And it probably does make sense to include "vf"
as well.

> > Should these be next to the existing SR-IOV sysfs files, i.e., in or
> > alongside sriov_dev_attr_group?
> 
> This was suggestion in previous versions. I didn't hear anyone
> supporting it.

Pointer to this discussion?  I'm not sure what value is added by a new
directory.

> > Hmmm, I see pci_enable_vfs_overlay() is called by the driver.  I don't
> > really like that because then we're dependent on drivers to maintain
> > the PCI sysfs hierarchy.  E.g., a driver might forget to call
> > pci_disable_vfs_overlay(), and then a future driver load will fail.
> > 
> > Maybe this could be done with .is_visible() functions that call driver
> > callbacks.
> 
> It was in previous versions too, but this solution allows PF to control
> the VFs overlay dynamically.

See below; I think it might be possible to do this dynamically even
without pci_enable_vfs_overlay().

> > > +int pci_enable_vfs_overlay(struct pci_dev *dev)
> > > +{
> > > +	struct pci_dev *virtfn;
> > > +	int id, ret;
> > > +
> > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > +		return 0;
> > > +
> > > +	ret = sysfs_create_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	for (id = 0; id < dev->sriov->num_VFs; id++) {
> > > +		virtfn = pci_get_domain_bus_and_slot(
> > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > +			pci_iov_virtfn_devfn(dev, id));
> > > +
> > > +		if (!virtfn)
> > > +			continue;
> > > +
> > > +		ret = sysfs_create_group(&virtfn->dev.kobj,
> > > +					 &sriov_vf_dev_attr_group);
> > > +		if (ret)
> > > +			goto out;
> > > +	}
> > > +	return 0;
> > > +
> > > +out:
> > > +	while (id--) {
> > > +		virtfn = pci_get_domain_bus_and_slot(
> > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > +			pci_iov_virtfn_devfn(dev, id));
> > > +
> > > +		if (!virtfn)
> > > +			continue;
> > > +
> > > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > > +	}
> > > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(pci_enable_vfs_overlay);
> > > +
> > > +void pci_disable_vfs_overlay(struct pci_dev *dev)
> > > +{
> > > +	struct pci_dev *virtfn;
> > > +	int id;
> > > +
> > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > +		return;
> > > +
> > > +	id = dev->sriov->num_VFs;
> > > +	while (id--) {
> > > +		virtfn = pci_get_domain_bus_and_slot(
> > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > +			pci_iov_virtfn_devfn(dev, id));
> > > +
> > > +		if (!virtfn)
> > > +			continue;
> > > +
> > > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > > +	}
> > > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > +}
> > > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);
> >
> > I'm not convinced all this sysfs wrangling is necessary.  If it is,
> > add a hint in a comment about why this is special and can't use
> > something like sriov_dev_attr_group.
> 
> This makes the overlay to be PF-driven. Alexander insisted on this flow.

If you're referring to [1], I think "insisted on this flow" might be
an overstatement of what Alex was looking for.  IIUC Alex wants the
sysfs files to be visible only when they're useful, i.e., when a
driver implements ->sriov_set_msix_vec_count().

That seems reasonable and also seems like something a smarter
.is_visible() function could accomplish without having drivers call
pci_enable_vfs_overlay(), e.g., maybe some variation of this:

  static umode_t sriov_vf_attrs_are_visible(...)
  {
    if (!pdev->msix_cap || dev_is_pf(dev))
      return 0;

    pf = pci_physfn(pdev);
    if (pf->driver && pf->driver->sriov_set_msix_vec_count)
      return a->mode;

    return 0;
  }

(Probably needs locking while we look at pf->driver, just like in
pci_vf_set_msix_vec_count().)
 
pci_enable_vfs_overlay() significantly complicates the code and it's
the sort of sysfs code we're trying to avoid, so if we really do need
it, please add a brief comment about *why* we have to do it that way.

> > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count)
> > > +{
> > > +	if (!dev->is_physfn)
> > > +		return;
> > > +
> > > +	dev->sriov->vf_total_msix = count;
> >
> > The PCI core doesn't use vf_total_msix at all.  The driver, e.g.,
> > mlx5, calls this, and all the PCI core does is hang onto the value and
> > expose it via sysfs.  I think I'd rather have a callback in struct
> > pci_driver and let the driver supply the value when needed.  I.e.,
> > sriov_vf_total_msix_show() would call the callback instead of looking
> > at pdev->sriov->vf_total_msix.
> 
> It will cause to unnecessary locking to ensure that driver doesn't
> vanish during sysfs read. I can change, but don't think that it is right
> decision.

Doesn't sysfs already ensure the driver can't vanish while we're
executing a DEVICE_ATTR accessor?

> > > +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count)
> > > +{
> > > +	struct pci_dev *pdev = pci_physfn(dev);
> > > +	int ret;
> > > +
> > > +	if (count < 0)
> > > +		/*
> > > +		 * We don't support negative numbers for now,
> > > +		 * but maybe in the future it will make sense.
> > > +		 */
> >
> > Drop the comment; I don't think it adds useful information.
> >
> > > +		return -EINVAL;
> > > +
> > > +	device_lock(&pdev->dev);
> > > +	if (!pdev->driver) {
> > > +		ret = -EOPNOTSUPP;
> > > +		goto err_pdev;
> > > +	}
> > > +
> > > +	device_lock(&dev->dev);
> > > +	if (dev->driver) {
> > > +		/*
> > > +		 * Driver already probed this VF and configured itself
> > > +		 * based on previously configured (or default) MSI-X vector
> > > +		 * count. It is too late to change this field for this
> > > +		 * specific VF.
> > > +		 */
> > > +		ret = -EBUSY;
> > > +		goto err_dev;
> > > +	}
> > > +
> > > +	ret = pdev->driver->sriov_set_msix_vec_count(dev, count);
> >
> > This looks like a NULL pointer dereference.
> 
> In current code, it is impossible, the call to pci_vf_set_msix_vec_count()
> will be only for devices that supports sriov_vf_msix_count which is
> possible with ->sriov_set_msix_vec_count() only.

OK.  I think you're right, but it takes quite a lot of analysis to
prove that right now.  If the .is_visible() function for
sriov_vf_msix_count checked whether the driver implements
->sriov_set_msix_vec_count(), it would be quite a bit easier,
and it might even help get rid of pci_enable_vfs_overlay().

Also, pci_vf_set_msix_vec_count() is in pci/msi.c, but AFAICT there's
no actual *reason* for it to be there other than the fact that it has
"msix" in the name.  It uses no MSI data structures.  Maybe it could
be folded into sriov_vf_msix_count_store(), which would make the
analysis even easier.

> > > @@ -326,6 +327,9 @@ struct pci_sriov {
> > >  	u16		subsystem_device; /* VF subsystem device */
> > >  	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
> > >  	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
> > > +	u32		vf_total_msix;  /* Total number of MSI-X vectors the VFs
> > > +					 * can consume
> > > +					 */
> >
> >   * can consume */
> >
> > Hopefully you can eliminate vf_total_msix altogether.
> 
> I think that will be worse from the flow point of view (extra locks)
> and the memory if you are worried about it. This variable consumes 4
> bytes, the pointer to the function will take 8 bytes.

I'm not concerned about the space.  The function pointer is in struct
pci_driver, whereas the variable is in struct pci_sriov, so the
variable would likely consume more space because there are probably
more VFs than pci_drivers.

My bigger concern is that vf_total_msix is really *driver* state, not
PCI core state, and I'd prefer if the PCI core were not responsible
for it.

> > > +++ b/include/linux/pci.h
> > > @@ -856,6 +856,8 @@ struct module;
> > >   *		e.g. drivers/net/e100.c.
> > >   * @sriov_configure: Optional driver callback to allow configuration of
> > >   *		number of VFs to enable via sysfs "sriov_numvfs" file.
> > > + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> > > + *              exposed by the sysfs "vf_msix_vec" entry.
> >
> > "vf_msix_vec" is apparently stale?  There's no other reference in this
> > patch.
> >
> > I think the important part is that this changes the number of vectors
> > advertised in the VF's MSI-X Message Control register, which will be
> > read when the VF driver enables MSI-X.
> >
> > If that's true, why would we expose this via a sysfs file?  We can
> > easily read it via lspci.
> 
> I did it for feature complete, we don't need read of this sysfs.

If you don't need this sysfs file, just remove it.  If you do need it,
please update the "vf_msix_vec" so it's correct.  Also, clarify the
description so we can tell that we're changing the Table Size VFs will
see in their Message Control registers.

> > > +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {}
> >
> > Also here.  Unless removing the space would make this fit in 80
> > columns.
> 
> Yes, it is 80 columns without space.

No, I don't think it is.  In an 80 column window, the result looks
like:

  static inline void pci_sriov_set_vf_total_msix(...) {
  }

Please change it so it looks like this so it matches the rest of the
file:

  static inline void pci_sriov_set_vf_total_msix(...)
  { }

Bjorn

[1] https://lore.kernel.org/r/CAKgT0UcJQ3uy6J_CCLizDLfzGL2saa_PjOYH4nK+RQjfmpNA=w@mail.gmail.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-04  0:10       ` Bjorn Helgaas
@ 2021-02-04 15:50         ` Leon Romanovsky
  2021-02-04 21:12           ` Bjorn Helgaas
  0 siblings, 1 reply; 16+ messages in thread
From: Leon Romanovsky @ 2021-02-04 15:50 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote:
> On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote:
> > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:
> > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > >
> > > > Extend PCI sysfs interface with a new callback that allows
> > > > configure the number of MSI-X vectors for specific SR-IO VF.
> > > > This is needed to optimize the performance of newly bound
> > > > devices by allocating the number of vectors based on the
> > > > administrator knowledge of targeted VM.
> > >
> > > I'm reading between the lines here, but IIUC the point is that you
> > > have a PF that supports a finite number of MSI-X vectors for use
> > > by all the VFs, and this interface is to control the distribution
> > > of those MSI-X vectors among the VFs.
> >
> > The MSI-X is HW resource, all devices in the world have limitation
> > here.
> >
> > > > This function is applicable for SR-IOV VF because such devices
> > > > allocate their MSI-X table before they will run on the VMs and
> > > > HW can't guess the right number of vectors, so the HW allocates
> > > > them statically and equally.
> > >
> > > This is written in a way that suggests this is behavior required
> > > by the PCIe spec.  If it is indeed related to something in the
> > > spec, please cite it.
> >
> > Spec doesn't say it directly, but you will need to really hurt brain
> > of your users if you decide to do it differently. You have one
> > enable bit to create all VFs at the same time without any option to
> > configure them in advance.
> >
> > Of course, you can create some partition map, upload it to FW and
> > create from there.
>
> Of course all devices have limitations.  But let's add some details
> about the way *this* device works.  That will help avoid giving the
> impression that this is the *only* way spec-conforming devices can
> work.

Sure

>
> > > "such devices allocate their MSI-X table before they will run on
> > > the VMs": Let's be specific here.  This MSI-X Table allocation
> > > apparently doesn't happen when we set VF Enable in the PF, because
> > > these sysfs files are attached to the VFs, which don't exist yet.
> > > It's not the VF driver binding, because that's a software
> > > construct.  What is the hardware event that triggers the
> > > allocation?
> >
> > Write of MSI-X vector count to the FW through PF.
>
> This is an example of something that is obviously specific to this
> mlx5 device.  The Table Size field in Message Control is RO per spec,
> and obviously firmware on the device is completely outside the scope
> of the PCIe spec.
>
> This commit log should describe how *this* device manages this
> allocation and how the PF Table Size and the VF Table Sizes are
> related.  Per PCIe, there is no necessary connection between them.

There is no connection in mlx5 devices either. PF is used as a vehicle
to access VF that doesn't have driver yet. From "table size" perspective
they completely independent, because PF already probed by driver and
it is already too late to change it.

So PF table size is static and can be changed through FW utility only.

>
> > > > cat /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > > >   = 0 - feature is not supported
> > > >   > 0 - total number of MSI-X vectors to consume by the VFs
> > >
> > > "total number of MSI-X vectors available for distribution among the
> > > VFs"?
> >
> > Users need to be aware of how much vectors exist in the system.
>
> Understood -- if there's an interface to influence the distribution of
> vectors among VFs, one needs to know how many vectors there are to
> work with.
>
> My point was that "number of vectors to consume by VFs" is awkward
> wording, so I suggested an alternative.

Got it, thanks

>
> > > > +What:            /sys/bus/pci/devices/.../vfs_overlay/sriov_vf_msix_count
> > > > +Date:            January 2021
> > > > +Contact: Leon Romanovsky <leonro@nvidia.com>
> > > > +Description:
> > > > +         This file is associated with the SR-IOV VFs.
> > > > +         It allows configuration of the number of MSI-X vectors for
> > > > +         the VF. This is needed to optimize performance of newly bound
> > > > +         devices by allocating the number of vectors based on the
> > > > +         administrator knowledge of targeted VM.
> > > > +
> > > > +         The values accepted are:
> > > > +          * > 0 - this will be number reported by the VF's MSI-X
> > > > +                  capability
> > > > +          * < 0 - not valid
> > > > +          * = 0 - will reset to the device default value
> > > > +
> > > > +         The file is writable if the PF is bound to a driver that
> > > > +         set sriov_vf_total_msix > 0 and there is no driver bound
> > > > +         to the VF.
>
> Drivers don't actually set "sriov_vf_total_msix".  This should
> probably say something like "the PF is bound to a driver that
> implements ->sriov_set_msix_vec_count()."

Sure, will change.

>
> > > > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > > > +Date:		January 2021
> > > > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > > > +Description:
> > > > +		This file is associated with the SR-IOV PFs.
> > > > +		It returns a total number of possible to configure MSI-X
> > > > +		vectors on the enabled VFs.
> > > > +
> > > > +		The values returned are:
> > > > +		 * > 0 - this will be total number possible to consume by VFs,
> > > > +		 * = 0 - feature is not supported
>
> Can you swap the order of these two files in the documentation?
> sriov_vf_total_msix is associated with the PF and logically precedes
> sriov_vf_msix_count, which is associated with VFs.

I'll do.

>
> Not sure the "= 0" description is necessary here.  If the value
> returned is the number of MSI-X vectors available for assignment to
> VFs, "0" is a perfectly legitimate value.  It just means there are
> none.  It doesn't need to be described separately.

I wanted to help users and remove ambiguity. For example, mlx5 drivers
will always implement proper .set_...() callbacks but for some devices
without needed FW support, the value will be 0. Instead of misleading
users with wrong promise that feature supported but doesn't have
available vectors, I decided to be more clear. For the users, 0 means, don't
try, it is not working.

>
> > > Do these filenames really need to contain both "sriov" and "vf"?
> >
> > This is what I was asked at some point. In previous versions the name
> > was without "sriov".
>
> We currently have:
>
>   $ grep DEVICE_ATTR drivers/pci/iov.c
>   static DEVICE_ATTR_RO(sriov_totalvfs);
>   static DEVICE_ATTR_RW(sriov_numvfs);
>   static DEVICE_ATTR_RO(sriov_offset);
>   static DEVICE_ATTR_RO(sriov_stride);
>   static DEVICE_ATTR_RO(sriov_vf_device);
>   static DEVICE_ATTR_RW(sriov_drivers_autoprobe);
>
> If we put them in a new "vfs_overlay" directory, it seems like
> overkill to repeat the "vf" part, but I'm hoping the new files can end
> up next to these existing files.  In that case, I think it makes sense
> to include "sriov".  And it probably does make sense to include "vf"
> as well.

I put everything in folder to group any possible future extensions.
Those extensions are applicable to SR-IOV VFs only, IMHO they deserve
separate folder.

This is the link with the request to change name:
https://lore.kernel.org/linux-pci/CAKgT0UcJrSNMPAOoniRSnUn+wyRUkL62AfgR3-8QbAkak=pQ=w@mail.gmail.com/

Also, _vf_ in the sriov_vf_total_msix symbolize that we are talking
explicitly about VFs and not whole device/PF.

>
> > > Should these be next to the existing SR-IOV sysfs files, i.e., in or
> > > alongside sriov_dev_attr_group?
> >
> > This was suggestion in previous versions. I didn't hear anyone
> > supporting it.
>
> Pointer to this discussion?  I'm not sure what value is added by a new
> directory.

In the link below, Alex talks about creating PF driven sysfs entries clearly
visible by the users as not PCI config space writes. The overlay folder achieves
that.

https://lore.kernel.org/linux-pci/CAKgT0UeYb5xz8iehE1Y0s-cyFbsy46bjF83BkA7qWZMkAOLR-g@mail.gmail.com/

and this is the resolution that proposed structure is OK:
https://lore.kernel.org/linux-pci/20210124190032.GD5038@unreal/

>
> > > Hmmm, I see pci_enable_vfs_overlay() is called by the driver.  I don't
> > > really like that because then we're dependent on drivers to maintain
> > > the PCI sysfs hierarchy.  E.g., a driver might forget to call
> > > pci_disable_vfs_overlay(), and then a future driver load will fail.
> > >
> > > Maybe this could be done with .is_visible() functions that call driver
> > > callbacks.
> >
> > It was in previous versions too, but this solution allows PF to control
> > the VFs overlay dynamically.
>
> See below; I think it might be possible to do this dynamically even
> without pci_enable_vfs_overlay().

The request was to have this overlay be PF driven. It means that if
PF driver is removed, the folder should disappear. I tried to do it
with .si_visible() and found myself adding hooks in general pci remove
flow. It was far from clean.

>
> > > > +int pci_enable_vfs_overlay(struct pci_dev *dev)
> > > > +{
> > > > +	struct pci_dev *virtfn;
> > > > +	int id, ret;
> > > > +
> > > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > > +		return 0;
> > > > +
> > > > +	ret = sysfs_create_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	for (id = 0; id < dev->sriov->num_VFs; id++) {
> > > > +		virtfn = pci_get_domain_bus_and_slot(
> > > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > > +			pci_iov_virtfn_devfn(dev, id));
> > > > +
> > > > +		if (!virtfn)
> > > > +			continue;
> > > > +
> > > > +		ret = sysfs_create_group(&virtfn->dev.kobj,
> > > > +					 &sriov_vf_dev_attr_group);
> > > > +		if (ret)
> > > > +			goto out;
> > > > +	}
> > > > +	return 0;
> > > > +
> > > > +out:
> > > > +	while (id--) {
> > > > +		virtfn = pci_get_domain_bus_and_slot(
> > > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > > +			pci_iov_virtfn_devfn(dev, id));
> > > > +
> > > > +		if (!virtfn)
> > > > +			continue;
> > > > +
> > > > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > > > +	}
> > > > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > > +	return ret;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(pci_enable_vfs_overlay);
> > > > +
> > > > +void pci_disable_vfs_overlay(struct pci_dev *dev)
> > > > +{
> > > > +	struct pci_dev *virtfn;
> > > > +	int id;
> > > > +
> > > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > > +		return;
> > > > +
> > > > +	id = dev->sriov->num_VFs;
> > > > +	while (id--) {
> > > > +		virtfn = pci_get_domain_bus_and_slot(
> > > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > > +			pci_iov_virtfn_devfn(dev, id));
> > > > +
> > > > +		if (!virtfn)
> > > > +			continue;
> > > > +
> > > > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > > > +	}
> > > > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);
> > >
> > > I'm not convinced all this sysfs wrangling is necessary.  If it is,
> > > add a hint in a comment about why this is special and can't use
> > > something like sriov_dev_attr_group.
> >
> > This makes the overlay to be PF-driven. Alexander insisted on this flow.
>
> If you're referring to [1], I think "insisted on this flow" might be
> an overstatement of what Alex was looking for.  IIUC Alex wants the
> sysfs files to be visible only when they're useful, i.e., when a
> driver implements ->sriov_set_msix_vec_count().

It is only one side of the request, the sysfs files shouldn't be visible
if PF driver was removed and visible again when its probed again.

>
> That seems reasonable and also seems like something a smarter
> .is_visible() function could accomplish without having drivers call
> pci_enable_vfs_overlay(), e.g., maybe some variation of this:
>
>   static umode_t sriov_vf_attrs_are_visible(...)
>   {
>     if (!pdev->msix_cap || dev_is_pf(dev))
>       return 0;
>
>     pf = pci_physfn(pdev);
>     if (pf->driver && pf->driver->sriov_set_msix_vec_count)
>       return a->mode;
>
>     return 0;
>   }

It doesn't work with the following flow:
1. load driver
2. disable autoprobe
3. echo to sriov_numvfs
.... <--- you have this sriov_vf_attrs_are_visible() created
4. unload driver
.... <--- sysfs still exists despite not having PF driver.

>
> (Probably needs locking while we look at pf->driver, just like in
> pci_vf_set_msix_vec_count().)
>
> pci_enable_vfs_overlay() significantly complicates the code and it's
> the sort of sysfs code we're trying to avoid, so if we really do need
> it, please add a brief comment about *why* we have to do it that way.

I will add.

>
> > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count)
> > > > +{
> > > > +	if (!dev->is_physfn)
> > > > +		return;
> > > > +
> > > > +	dev->sriov->vf_total_msix = count;
> > >
> > > The PCI core doesn't use vf_total_msix at all.  The driver, e.g.,
> > > mlx5, calls this, and all the PCI core does is hang onto the value and
> > > expose it via sysfs.  I think I'd rather have a callback in struct
> > > pci_driver and let the driver supply the value when needed.  I.e.,
> > > sriov_vf_total_msix_show() would call the callback instead of looking
> > > at pdev->sriov->vf_total_msix.
> >
> > It will cause to unnecessary locking to ensure that driver doesn't
> > vanish during sysfs read. I can change, but don't think that it is right
> > decision.
>
> Doesn't sysfs already ensure the driver can't vanish while we're
> executing a DEVICE_ATTR accessor?

It is not, you can see it by adding device_lock_held() check in any
.show attribute. See drivers/base/core.c: dev_attr_show(), it doesn't
do much. This is why pci_vf_set_msix_vec_count() has double lock.

>
> > > > +int pci_vf_set_msix_vec_count(struct pci_dev *dev, int count)
> > > > +{
> > > > +	struct pci_dev *pdev = pci_physfn(dev);
> > > > +	int ret;
> > > > +
> > > > +	if (count < 0)
> > > > +		/*
> > > > +		 * We don't support negative numbers for now,
> > > > +		 * but maybe in the future it will make sense.
> > > > +		 */
> > >
> > > Drop the comment; I don't think it adds useful information.
> > >
> > > > +		return -EINVAL;
> > > > +
> > > > +	device_lock(&pdev->dev);
> > > > +	if (!pdev->driver) {
> > > > +		ret = -EOPNOTSUPP;
> > > > +		goto err_pdev;
> > > > +	}
> > > > +
> > > > +	device_lock(&dev->dev);
> > > > +	if (dev->driver) {
> > > > +		/*
> > > > +		 * Driver already probed this VF and configured itself
> > > > +		 * based on previously configured (or default) MSI-X vector
> > > > +		 * count. It is too late to change this field for this
> > > > +		 * specific VF.
> > > > +		 */
> > > > +		ret = -EBUSY;
> > > > +		goto err_dev;
> > > > +	}
> > > > +
> > > > +	ret = pdev->driver->sriov_set_msix_vec_count(dev, count);
> > >
> > > This looks like a NULL pointer dereference.
> >
> > In current code, it is impossible, the call to pci_vf_set_msix_vec_count()
> > will be only for devices that supports sriov_vf_msix_count which is
> > possible with ->sriov_set_msix_vec_count() only.
>
> OK.  I think you're right, but it takes quite a lot of analysis to
> prove that right now.  If the .is_visible() function for
> sriov_vf_msix_count checked whether the driver implements
> ->sriov_set_msix_vec_count(), it would be quite a bit easier,
> and it might even help get rid of pci_enable_vfs_overlay().
>
> Also, pci_vf_set_msix_vec_count() is in pci/msi.c, but AFAICT there's
> no actual *reason* for it to be there other than the fact that it has
> "msix" in the name.  It uses no MSI data structures.  Maybe it could
> be folded into sriov_vf_msix_count_store(), which would make the
> analysis even easier.

I put _set_ command near _get_ command, but I can move it to iov.c

>
> > > > @@ -326,6 +327,9 @@ struct pci_sriov {
> > > >  	u16		subsystem_device; /* VF subsystem device */
> > > >  	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
> > > >  	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
> > > > +	u32		vf_total_msix;  /* Total number of MSI-X vectors the VFs
> > > > +					 * can consume
> > > > +					 */
> > >
> > >   * can consume */
> > >
> > > Hopefully you can eliminate vf_total_msix altogether.
> >
> > I think that will be worse from the flow point of view (extra locks)
> > and the memory if you are worried about it. This variable consumes 4
> > bytes, the pointer to the function will take 8 bytes.
>
> I'm not concerned about the space.  The function pointer is in struct
> pci_driver, whereas the variable is in struct pci_sriov, so the
> variable would likely consume more space because there are probably
> more VFs than pci_drivers.
>
> My bigger concern is that vf_total_msix is really *driver* state, not
> PCI core state, and I'd prefer if the PCI core were not responsible
> for it.

As i said above, I can do it, just don't think that it is right.
Should I do it or not?

>
> > > > +++ b/include/linux/pci.h
> > > > @@ -856,6 +856,8 @@ struct module;
> > > >   *		e.g. drivers/net/e100.c.
> > > >   * @sriov_configure: Optional driver callback to allow configuration of
> > > >   *		number of VFs to enable via sysfs "sriov_numvfs" file.
> > > > + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> > > > + *              exposed by the sysfs "vf_msix_vec" entry.
> > >
> > > "vf_msix_vec" is apparently stale?  There's no other reference in this
> > > patch.
> > >
> > > I think the important part is that this changes the number of vectors
> > > advertised in the VF's MSI-X Message Control register, which will be
> > > read when the VF driver enables MSI-X.
> > >
> > > If that's true, why would we expose this via a sysfs file?  We can
> > > easily read it via lspci.
> >
> > I did it for feature complete, we don't need read of this sysfs.
>
> If you don't need this sysfs file, just remove it.  If you do need it,
> please update the "vf_msix_vec" so it's correct.  Also, clarify the
> description so we can tell that we're changing the Table Size VFs will
> see in their Message Control registers.

I'll do

>
> > > > +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {}
> > >
> > > Also here.  Unless removing the space would make this fit in 80
> > > columns.
> >
> > Yes, it is 80 columns without space.
>
> No, I don't think it is.  In an 80 column window, the result looks
> like:
>
>   static inline void pci_sriov_set_vf_total_msix(...) {
>   }
>
> Please change it so it looks like this so it matches the rest of the
> file:
>
>   static inline void pci_sriov_set_vf_total_msix(...)
>   { }

I counted back then, but sure will check again.

Thanks for your review.

>
> Bjorn
>
> [1] https://lore.kernel.org/r/CAKgT0UcJQ3uy6J_CCLizDLfzGL2saa_PjOYH4nK+RQjfmpNA=w@mail.gmail.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-04 15:50         ` Leon Romanovsky
@ 2021-02-04 21:12           ` Bjorn Helgaas
  2021-02-05 17:35             ` Leon Romanovsky
  0 siblings, 1 reply; 16+ messages in thread
From: Bjorn Helgaas @ 2021-02-04 21:12 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote:
> On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote:
> > On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote:
> > > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:
> > > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > >
> > > > > Extend PCI sysfs interface with a new callback that allows
> > > > > configure the number of MSI-X vectors for specific SR-IO VF.
> > > > > This is needed to optimize the performance of newly bound
> > > > > devices by allocating the number of vectors based on the
> > > > > administrator knowledge of targeted VM.
> > > >
> > > > I'm reading between the lines here, but IIUC the point is that you
> > > > have a PF that supports a finite number of MSI-X vectors for use
> > > > by all the VFs, and this interface is to control the distribution
> > > > of those MSI-X vectors among the VFs.

> > This commit log should describe how *this* device manages this
> > allocation and how the PF Table Size and the VF Table Sizes are
> > related.  Per PCIe, there is no necessary connection between them.
> 
> There is no connection in mlx5 devices either. PF is used as a vehicle
> to access VF that doesn't have driver yet. From "table size" perspective
> they completely independent, because PF already probed by driver and
> it is already too late to change it.
> 
> So PF table size is static and can be changed through FW utility only.

This is where description of the device would be useful.  

The fact that you need "sriov_vf_total_msix" to advertise how many
vectors are available and "sriov_vf_msix_count" to influence how they
are distributed across the VFs suggests that these Table Sizes are not
completely independent.

Can a VF have a bigger Table Size than the PF does?  Can all the VF
Table Sizes added together be bigger than the PF Table Size?  If VF A
has a larger Table Size, does that mean VF B must have a smaller Table
Size?

Obviously I do not understand the details about how this device works.
It would be helpful to have those details here.

Here's the sequence as I understand it:

  1) PF driver binds to PF
  2) PF driver enables VFs
  3) PF driver creates /sys/.../<PF>/sriov_vf_total_msix
  4) PF driver creates /sys/.../<VFn>/sriov_vf_msix_count for each VF
  5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count
  6) VF driver binds to VF 
  7) VF reads MSI-X Message Control (Table Size)

Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read
different Table Sizes?  That would be a little weird.

I'm also a little concerned about doing 2 before 3 & 4.  That works
for mlx5 because implements the Table Size adjustment in a way that
works *after* the VFs have been enabled.

But it seems conceivable that a device could implement vector
distribution in a way that would require the VF Table Sizes to be
fixed *before* enabling VFs.  That would be nice in the sense that the
VFs would be created "fully formed" and the VF Table Size would be
completely read-only as documented.

The other knob idea you mentioned at [2]:

  echo "0000:01:00.2 123" > sriov_vf_msix_count

would have the advantage of working for both cases.  That's definitely
more complicated, but at the same time, I would hate to carve a sysfs
interface into stone if it might not work for other devices.

> > > > > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > > > > +Date:		January 2021
> > > > > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > > > > +Description:
> > > > > +		This file is associated with the SR-IOV PFs.
> > > > > +		It returns a total number of possible to configure MSI-X
> > > > > +		vectors on the enabled VFs.
> > > > > +
> > > > > +		The values returned are:
> > > > > +		 * > 0 - this will be total number possible to consume by VFs,
> > > > > +		 * = 0 - feature is not supported
> 
> > Not sure the "= 0" description is necessary here.  If the value
> > returned is the number of MSI-X vectors available for assignment to
> > VFs, "0" is a perfectly legitimate value.  It just means there are
> > none.  It doesn't need to be described separately.
> 
> I wanted to help users and remove ambiguity. For example, mlx5 drivers
> will always implement proper .set_...() callbacks but for some devices
> without needed FW support, the value will be 0. Instead of misleading
> users with wrong promise that feature supported but doesn't have
> available vectors, I decided to be more clear. For the users, 0 means, don't
> try, it is not working.

Oh, you mean "feature is not supported by the FIRMWARE on some mlx5
devices"?  I totally missed that; I thought you meant "not supported
by the PF driver."  Why not have the PF driver detect that the
firmware doesn't support the feature and just not expose the sysfs
files at all in that case?

> > If we put them in a new "vfs_overlay" directory, it seems like
> > overkill to repeat the "vf" part, but I'm hoping the new files can end
> > up next to these existing files.  In that case, I think it makes sense
> > to include "sriov".  And it probably does make sense to include "vf"
> > as well.
> 
> I put everything in folder to group any possible future extensions.
> Those extensions are applicable to SR-IOV VFs only, IMHO they deserve
> separate folder.

I'm not convinced (yet) that the possibility of future extensions is
enough justification for adding the "vfs_overlay" directory.  It
really complicates the code flow -- if we skipped the new directory,
I'm pretty sure we could make .is_visible() work, which would be a
major simplification.

And there's quite a bit of value in the new files being right next to
the existing sriov_* files.

> > > > > +void pci_disable_vfs_overlay(struct pci_dev *dev)
> > > > > +{
> > > > > +	struct pci_dev *virtfn;
> > > > > +	int id;
> > > > > +
> > > > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > > > +		return;
> > > > > +
> > > > > +	id = dev->sriov->num_VFs;
> > > > > +	while (id--) {
> > > > > +		virtfn = pci_get_domain_bus_and_slot(
> > > > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > > > +			pci_iov_virtfn_devfn(dev, id));
> > > > > +
> > > > > +		if (!virtfn)
> > > > > +			continue;
> > > > > +
> > > > > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > > > > +	}
> > > > > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);
> > > >
> > > > I'm not convinced all this sysfs wrangling is necessary.  If it is,
> > > > add a hint in a comment about why this is special and can't use
> > > > something like sriov_dev_attr_group.
> > >
> > > This makes the overlay to be PF-driven. Alexander insisted on this flow.
> >
> > If you're referring to [1], I think "insisted on this flow" might be
> > an overstatement of what Alex was looking for.  IIUC Alex wants the
> > sysfs files to be visible only when they're useful, i.e., when a
> > driver implements ->sriov_set_msix_vec_count().
> 
> It is only one side of the request, the sysfs files shouldn't be visible
> if PF driver was removed and visible again when its probed again.

I can't parse this, but it's probably related to the question below.

> > That seems reasonable and also seems like something a smarter
> > .is_visible() function could accomplish without having drivers call
> > pci_enable_vfs_overlay(), e.g., maybe some variation of this:
> >
> >   static umode_t sriov_vf_attrs_are_visible(...)
> >   {
> >     if (!pdev->msix_cap || dev_is_pf(dev))
> >       return 0;
> >
> >     pf = pci_physfn(pdev);
> >     if (pf->driver && pf->driver->sriov_set_msix_vec_count)
> >       return a->mode;
> >
> >     return 0;
> >   }
> 
> It doesn't work with the following flow:
> 1. load driver
> 2. disable autoprobe
> 3. echo to sriov_numvfs
> .... <--- you have this sriov_vf_attrs_are_visible() created
> 4. unload driver
> .... <--- sysfs still exists despite not having PF driver.

I missed your point here, sorry.  After unloading the PF driver,
"pf->driver" in the sketch above will be NULL, so the VF sysfs file
would not be visible.  Right?  Maybe it has to do with autoprobe?  I
didn't catch what the significance of disabling autoprobe was.

> > > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count)
> > > > > +{
> > > > > +	if (!dev->is_physfn)
> > > > > +		return;
> > > > > +
> > > > > +	dev->sriov->vf_total_msix = count;
> > > >
> > > > The PCI core doesn't use vf_total_msix at all.  The driver, e.g.,
> > > > mlx5, calls this, and all the PCI core does is hang onto the value and
> > > > expose it via sysfs.  I think I'd rather have a callback in struct
> > > > pci_driver and let the driver supply the value when needed.  I.e.,
> > > > sriov_vf_total_msix_show() would call the callback instead of looking
> > > > at pdev->sriov->vf_total_msix.
> > >
> > > It will cause to unnecessary locking to ensure that driver doesn't
> > > vanish during sysfs read. I can change, but don't think that it is right
> > > decision.
> >
> > Doesn't sysfs already ensure the driver can't vanish while we're
> > executing a DEVICE_ATTR accessor?
> 
> It is not, you can see it by adding device_lock_held() check in any
> .show attribute. See drivers/base/core.c: dev_attr_show(), it doesn't
> do much. This is why pci_vf_set_msix_vec_count() has double lock.

Aahh, right, I learned something today, thanks!  There are only a few
PCI sysfs attributes that reference the driver, and they do their own
locking.

I do think vf_total_msix is a bit of driver state related to
implementing this functionality and doesn't need to be in the PCI
core.  I think device locking is acceptable; it's very similar to what
is done in sriov_numvfs_store().  Doing the locking and calling a
driver callback makes it obvious that vf_total_msix is part of this PF
driver-specific functionality, not a generic part of the PCI core.

So let's give it a try.  If it turns out to be terrible, we can
revisit it.

> > Also, pci_vf_set_msix_vec_count() is in pci/msi.c, but AFAICT there's
> > no actual *reason* for it to be there other than the fact that it has
> > "msix" in the name.  It uses no MSI data structures.  Maybe it could
> > be folded into sriov_vf_msix_count_store(), which would make the
> > analysis even easier.
> 
> I put _set_ command near _get_ command, but I can move it to iov.c

You mean you put pci_vf_set_msix_vec_count() near
pci_msix_vec_count()?  That's *true*, but they are not analogues, and
one is not setting the value returned by the other.

pci_vf_set_msix_vec_count() is a completely magical thing that uses a
device-specific mechanism on a PF that happens to change what
pci_msix_vec_count() on a VF will return later.  I think this is more
related to SR-IOV than it is to MSI.

Bjorn

> > [1] https://lore.kernel.org/r/CAKgT0UcJQ3uy6J_CCLizDLfzGL2saa_PjOYH4nK+RQjfmpNA=w@mail.gmail.com

[2] https://lore.kernel.org/linux-pci/20210118132800.GA4835@unreal/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-04 21:12           ` Bjorn Helgaas
@ 2021-02-05 17:35             ` Leon Romanovsky
  2021-02-05 22:57               ` Bjorn Helgaas
  0 siblings, 1 reply; 16+ messages in thread
From: Leon Romanovsky @ 2021-02-05 17:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

On Thu, Feb 04, 2021 at 03:12:12PM -0600, Bjorn Helgaas wrote:
> On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote:
> > On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote:
> > > On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote:
> > > > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:
> > > > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > >
> > > > > > Extend PCI sysfs interface with a new callback that allows
> > > > > > configure the number of MSI-X vectors for specific SR-IO VF.
> > > > > > This is needed to optimize the performance of newly bound
> > > > > > devices by allocating the number of vectors based on the
> > > > > > administrator knowledge of targeted VM.
> > > > >
> > > > > I'm reading between the lines here, but IIUC the point is that you
> > > > > have a PF that supports a finite number of MSI-X vectors for use
> > > > > by all the VFs, and this interface is to control the distribution
> > > > > of those MSI-X vectors among the VFs.
>
> > > This commit log should describe how *this* device manages this
> > > allocation and how the PF Table Size and the VF Table Sizes are
> > > related.  Per PCIe, there is no necessary connection between them.
> >
> > There is no connection in mlx5 devices either. PF is used as a vehicle
> > to access VF that doesn't have driver yet. From "table size" perspective
> > they completely independent, because PF already probed by driver and
> > it is already too late to change it.
> >
> > So PF table size is static and can be changed through FW utility only.
>
> This is where description of the device would be useful.
>
> The fact that you need "sriov_vf_total_msix" to advertise how many
> vectors are available and "sriov_vf_msix_count" to influence how they
> are distributed across the VFs suggests that these Table Sizes are not
> completely independent.
>
> Can a VF have a bigger Table Size than the PF does?  Can all the VF
> Table Sizes added together be bigger than the PF Table Size?  If VF A
> has a larger Table Size, does that mean VF B must have a smaller Table
> Size?

VFs are completely independent devices and their table size can be
bigger than PF. FW has two pools, one for PF and another for all VFs.
In real world scenarios, every VF will have more MSI-X vectors than PF,
which will be distributed by orchestration software.

In theory, users can assign almost all vectors to one VF and leave others
depleted. In practice, it is not possible, we ensure that all VFs start
with some sensible number of vectors and FW protects with check of max
vector size write.

>
> Obviously I do not understand the details about how this device works.
> It would be helpful to have those details here.
>
> Here's the sequence as I understand it:
>
>   1) PF driver binds to PF
>   2) PF driver enables VFs
>   3) PF driver creates /sys/.../<PF>/sriov_vf_total_msix
>   4) PF driver creates /sys/.../<VFn>/sriov_vf_msix_count for each VF
>   5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count
>   6) VF driver binds to VF
>   7) VF reads MSI-X Message Control (Table Size)
>
> Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read
> different Table Sizes?  That would be a little weird.

Yes, this is the flow. I think differently from you and think this
is actual good thing that user writes new msix count and it is shown
immediately.

>
> I'm also a little concerned about doing 2 before 3 & 4.  That works
> for mlx5 because implements the Table Size adjustment in a way that
> works *after* the VFs have been enabled.

It is not related to mlx5, but to the PCI spec that requires us to
create all VFs at the same time. Before enabling VFs, they don't
exist.

>
> But it seems conceivable that a device could implement vector
> distribution in a way that would require the VF Table Sizes to be
> fixed *before* enabling VFs.  That would be nice in the sense that the
> VFs would be created "fully formed" and the VF Table Size would be
> completely read-only as documented.

It is not how SR-IOV is used in real world. Cloud providers create many
VFs but don't assign those VFs yet. They do it after customer request
VM with specific properties (number of CPUs) which means number of MSI-X
vectors in our case.

All this is done when other VFs already in use and we can't destroy and
recreate them at that point of time.

>
> The other knob idea you mentioned at [2]:
>
>   echo "0000:01:00.2 123" > sriov_vf_msix_count

This knob doesn't always work if you have many devices and it is
nightmare to guess "new" VF BDF before it is claimed.

As an example on my machine with two devices, VFs are created differently:
[root@c ~]$ lspci |grep nox
08:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
08:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
[root@c ~]# echo 2 > /sys/bus/pci/devices/0000\:08\:00.1/sriov_numvfs
[root@c ~]# lspci |grep nox
08:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
08:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
08:01.2 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
08:01.3 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
[root@c ~]# echo 0 > /sys/bus/pci/devices/0000\:08\:00.1/sriov_numvfs
[root@c ~]# echo 2 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
[root@c ~]# lspci |grep nox
08:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
08:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
08:00.3 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]

>
> would have the advantage of working for both cases.  That's definitely
> more complicated, but at the same time, I would hate to carve a sysfs
> interface into stone if it might not work for other devices.

As you can see above, it is not really solves pre-creation configuration flow.
For such flows, probably devlink is the best choice.

>
> > > > > > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > > > > > +Date:		January 2021
> > > > > > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > > > > > +Description:
> > > > > > +		This file is associated with the SR-IOV PFs.
> > > > > > +		It returns a total number of possible to configure MSI-X
> > > > > > +		vectors on the enabled VFs.
> > > > > > +
> > > > > > +		The values returned are:
> > > > > > +		 * > 0 - this will be total number possible to consume by VFs,
> > > > > > +		 * = 0 - feature is not supported
> >
> > > Not sure the "= 0" description is necessary here.  If the value
> > > returned is the number of MSI-X vectors available for assignment to
> > > VFs, "0" is a perfectly legitimate value.  It just means there are
> > > none.  It doesn't need to be described separately.
> >
> > I wanted to help users and remove ambiguity. For example, mlx5 drivers
> > will always implement proper .set_...() callbacks but for some devices
> > without needed FW support, the value will be 0. Instead of misleading
> > users with wrong promise that feature supported but doesn't have
> > available vectors, I decided to be more clear. For the users, 0 means, don't
> > try, it is not working.
>
> Oh, you mean "feature is not supported by the FIRMWARE on some mlx5
> devices"?  I totally missed that; I thought you meant "not supported
> by the PF driver."  Why not have the PF driver detect that the
> firmware doesn't support the feature and just not expose the sysfs
> files at all in that case?

I can do it and will do it, but this behavior will be possible because
mlx5 is flexible enough. I imagine that other device won't be so flexible,
for example they will decide to "close" this feature because of other
SW limitation.

>
> > > If we put them in a new "vfs_overlay" directory, it seems like
> > > overkill to repeat the "vf" part, but I'm hoping the new files can end
> > > up next to these existing files.  In that case, I think it makes sense
> > > to include "sriov".  And it probably does make sense to include "vf"
> > > as well.
> >
> > I put everything in folder to group any possible future extensions.
> > Those extensions are applicable to SR-IOV VFs only, IMHO they deserve
> > separate folder.
>
> I'm not convinced (yet) that the possibility of future extensions is
> enough justification for adding the "vfs_overlay" directory.  It
> really complicates the code flow -- if we skipped the new directory,
> I'm pretty sure we could make .is_visible() work, which would be a
> major simplification.

Unfortunately not, is_visible() is not dynamic and called when device
creates sysfs and it doesn't rescan it or refresh them. It means that
all files from vfs_overlay folder will exist even driver is removed
from PF.

Also sysfs is created before driver is probed.

>
> And there's quite a bit of value in the new files being right next to
> the existing sriov_* files.

I have no problem to drop folder, just need clear request, should I.

>
> > > > > > +void pci_disable_vfs_overlay(struct pci_dev *dev)
> > > > > > +{
> > > > > > +	struct pci_dev *virtfn;
> > > > > > +	int id;
> > > > > > +
> > > > > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > > > > +		return;
> > > > > > +
> > > > > > +	id = dev->sriov->num_VFs;
> > > > > > +	while (id--) {
> > > > > > +		virtfn = pci_get_domain_bus_and_slot(
> > > > > > +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> > > > > > +			pci_iov_virtfn_devfn(dev, id));
> > > > > > +
> > > > > > +		if (!virtfn)
> > > > > > +			continue;
> > > > > > +
> > > > > > +		sysfs_remove_group(&virtfn->dev.kobj, &sriov_vf_dev_attr_group);
> > > > > > +	}
> > > > > > +	sysfs_remove_group(&dev->dev.kobj, &sriov_pf_dev_attr_group);
> > > > > > +}
> > > > > > +EXPORT_SYMBOL_GPL(pci_disable_vfs_overlay);
> > > > >
> > > > > I'm not convinced all this sysfs wrangling is necessary.  If it is,
> > > > > add a hint in a comment about why this is special and can't use
> > > > > something like sriov_dev_attr_group.
> > > >
> > > > This makes the overlay to be PF-driven. Alexander insisted on this flow.
> > >
> > > If you're referring to [1], I think "insisted on this flow" might be
> > > an overstatement of what Alex was looking for.  IIUC Alex wants the
> > > sysfs files to be visible only when they're useful, i.e., when a
> > > driver implements ->sriov_set_msix_vec_count().
> >
> > It is only one side of the request, the sysfs files shouldn't be visible
> > if PF driver was removed and visible again when its probed again.
>
> I can't parse this, but it's probably related to the question below.
>
> > > That seems reasonable and also seems like something a smarter
> > > .is_visible() function could accomplish without having drivers call
> > > pci_enable_vfs_overlay(), e.g., maybe some variation of this:
> > >
> > >   static umode_t sriov_vf_attrs_are_visible(...)
> > >   {
> > >     if (!pdev->msix_cap || dev_is_pf(dev))
> > >       return 0;
> > >
> > >     pf = pci_physfn(pdev);
> > >     if (pf->driver && pf->driver->sriov_set_msix_vec_count)
> > >       return a->mode;
> > >
> > >     return 0;
> > >   }
> >
> > It doesn't work with the following flow:
> > 1. load driver
> > 2. disable autoprobe
> > 3. echo to sriov_numvfs
> > .... <--- you have this sriov_vf_attrs_are_visible() created
> > 4. unload driver
> > .... <--- sysfs still exists despite not having PF driver.
>
> I missed your point here, sorry.  After unloading the PF driver,
> "pf->driver" in the sketch above will be NULL, so the VF sysfs file
> would not be visible.  Right?  Maybe it has to do with autoprobe?  I
> didn't catch what the significance of disabling autoprobe was.

No, is_visible() is called on file creation only, see fs/sysfs/group.c:
create_files(). After you remove driver, there is no sysfs file refresh.

>
> > > > > > +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count)
> > > > > > +{
> > > > > > +	if (!dev->is_physfn)
> > > > > > +		return;
> > > > > > +
> > > > > > +	dev->sriov->vf_total_msix = count;
> > > > >
> > > > > The PCI core doesn't use vf_total_msix at all.  The driver, e.g.,
> > > > > mlx5, calls this, and all the PCI core does is hang onto the value and
> > > > > expose it via sysfs.  I think I'd rather have a callback in struct
> > > > > pci_driver and let the driver supply the value when needed.  I.e.,
> > > > > sriov_vf_total_msix_show() would call the callback instead of looking
> > > > > at pdev->sriov->vf_total_msix.
> > > >
> > > > It will cause to unnecessary locking to ensure that driver doesn't
> > > > vanish during sysfs read. I can change, but don't think that it is right
> > > > decision.
> > >
> > > Doesn't sysfs already ensure the driver can't vanish while we're
> > > executing a DEVICE_ATTR accessor?
> >
> > It is not, you can see it by adding device_lock_held() check in any
> > .show attribute. See drivers/base/core.c: dev_attr_show(), it doesn't
> > do much. This is why pci_vf_set_msix_vec_count() has double lock.
>
> Aahh, right, I learned something today, thanks!  There are only a few
> PCI sysfs attributes that reference the driver, and they do their own
> locking.
>
> I do think vf_total_msix is a bit of driver state related to
> implementing this functionality and doesn't need to be in the PCI
> core.  I think device locking is acceptable; it's very similar to what
> is done in sriov_numvfs_store().  Doing the locking and calling a
> driver callback makes it obvious that vf_total_msix is part of this PF
> driver-specific functionality, not a generic part of the PCI core.
>
> So let's give it a try.  If it turns out to be terrible, we can
> revisit it.

No problem.

>
> > > Also, pci_vf_set_msix_vec_count() is in pci/msi.c, but AFAICT there's
> > > no actual *reason* for it to be there other than the fact that it has
> > > "msix" in the name.  It uses no MSI data structures.  Maybe it could
> > > be folded into sriov_vf_msix_count_store(), which would make the
> > > analysis even easier.
> >
> > I put _set_ command near _get_ command, but I can move it to iov.c
>
> You mean you put pci_vf_set_msix_vec_count() near
> pci_msix_vec_count()?  That's *true*, but they are not analogues, and
> one is not setting the value returned by the other.
>
> pci_vf_set_msix_vec_count() is a completely magical thing that uses a
> device-specific mechanism on a PF that happens to change what
> pci_msix_vec_count() on a VF will return later.  I think this is more
> related to SR-IOV than it is to MSI.

I will do.

>
> Bjorn
>
> > > [1] https://lore.kernel.org/r/CAKgT0UcJQ3uy6J_CCLizDLfzGL2saa_PjOYH4nK+RQjfmpNA=w@mail.gmail.com
>
> [2] https://lore.kernel.org/linux-pci/20210118132800.GA4835@unreal/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-05 17:35             ` Leon Romanovsky
@ 2021-02-05 22:57               ` Bjorn Helgaas
  2021-02-06 12:42                 ` Leon Romanovsky
  0 siblings, 1 reply; 16+ messages in thread
From: Bjorn Helgaas @ 2021-02-05 22:57 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

On Fri, Feb 05, 2021 at 07:35:47PM +0200, Leon Romanovsky wrote:
> On Thu, Feb 04, 2021 at 03:12:12PM -0600, Bjorn Helgaas wrote:
> > On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote:
> > > On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote:
> > > > On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote:
> > > > > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:
> > > > > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > > >
> > > > > > > Extend PCI sysfs interface with a new callback that allows
> > > > > > > configure the number of MSI-X vectors for specific SR-IO VF.
> > > > > > > This is needed to optimize the performance of newly bound
> > > > > > > devices by allocating the number of vectors based on the
> > > > > > > administrator knowledge of targeted VM.
> > > > > >
> > > > > > I'm reading between the lines here, but IIUC the point is that you
> > > > > > have a PF that supports a finite number of MSI-X vectors for use
> > > > > > by all the VFs, and this interface is to control the distribution
> > > > > > of those MSI-X vectors among the VFs.
> >
> > > > This commit log should describe how *this* device manages this
> > > > allocation and how the PF Table Size and the VF Table Sizes are
> > > > related.  Per PCIe, there is no necessary connection between them.
> > >
> > > There is no connection in mlx5 devices either. PF is used as a vehicle
> > > to access VF that doesn't have driver yet. From "table size" perspective
> > > they completely independent, because PF already probed by driver and
> > > it is already too late to change it.
> > >
> > > So PF table size is static and can be changed through FW utility only.
> >
> > This is where description of the device would be useful.
> >
> > The fact that you need "sriov_vf_total_msix" to advertise how many
> > vectors are available and "sriov_vf_msix_count" to influence how they
> > are distributed across the VFs suggests that these Table Sizes are not
> > completely independent.
> >
> > Can a VF have a bigger Table Size than the PF does?  Can all the VF
> > Table Sizes added together be bigger than the PF Table Size?  If VF A
> > has a larger Table Size, does that mean VF B must have a smaller Table
> > Size?
> 
> VFs are completely independent devices and their table size can be
> bigger than PF. FW has two pools, one for PF and another for all VFs.
> In real world scenarios, every VF will have more MSI-X vectors than PF,
> which will be distributed by orchestration software.

Well, if the sum of all the VF Table Sizes cannot exceed the size of
the FW pool for VFs, I would say the VFs are not completely
independent.  Increasing the Table Size of one VF reduces it for other
VFs.

This is an essential detail because it's the whole reason behind this
interface, so sketching this out in the commit log will make this much
easier to understand.

> > Here's the sequence as I understand it:
> >
> >   1) PF driver binds to PF
> >   2) PF driver enables VFs
> >   3) PF driver creates /sys/.../<PF>/sriov_vf_total_msix
> >   4) PF driver creates /sys/.../<VFn>/sriov_vf_msix_count for each VF
> >   5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count
> >   6) VF driver binds to VF
> >   7) VF reads MSI-X Message Control (Table Size)
> >
> > Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read
> > different Table Sizes?  That would be a little weird.
> 
> Yes, this is the flow. I think differently from you and think this
> is actual good thing that user writes new msix count and it is shown
> immediately.

Only weird because per spec Table Size is read-only and in this
scenario it changes, so it may be surprising, but probably not a huge
deal.

> > I'm also a little concerned about doing 2 before 3 & 4.  That works
> > for mlx5 because implements the Table Size adjustment in a way that
> > works *after* the VFs have been enabled.
> 
> It is not related to mlx5, but to the PCI spec that requires us to
> create all VFs at the same time. Before enabling VFs, they don't
> exist.

Yes.  I can imagine a PF driver that collects characteristics for the
desired VFs before enabling them, sort of like we already collect the
*number* of VFs.  But I think your argument for dynamic changes after
creation below is pretty compelling.

> > But it seems conceivable that a device could implement vector
> > distribution in a way that would require the VF Table Sizes to be
> > fixed *before* enabling VFs.  That would be nice in the sense that the
> > VFs would be created "fully formed" and the VF Table Size would be
> > completely read-only as documented.
> 
> It is not how SR-IOV is used in real world. Cloud providers create many
> VFs but don't assign those VFs yet. They do it after customer request
> VM with specific properties (number of CPUs) which means number of MSI-X
> vectors in our case.
> 
> All this is done when other VFs already in use and we can't destroy and
> recreate them at that point of time.

Makes sense, thanks for this insight.  The need to change the MSI-X
Table Size dynamically while other VFs are in use would also be useful
in the commit log because it really helps motivate this design.

> > > > > > > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > > > > > > +Date:		January 2021
> > > > > > > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > > > > > > +Description:
> > > > > > > +		This file is associated with the SR-IOV PFs.
> > > > > > > +		It returns a total number of possible to configure MSI-X
> > > > > > > +		vectors on the enabled VFs.
> > > > > > > +
> > > > > > > +		The values returned are:
> > > > > > > +		 * > 0 - this will be total number possible to consume by VFs,
> > > > > > > +		 * = 0 - feature is not supported
> > >
> > > > Not sure the "= 0" description is necessary here.  If the value
> > > > returned is the number of MSI-X vectors available for assignment to
> > > > VFs, "0" is a perfectly legitimate value.  It just means there are
> > > > none.  It doesn't need to be described separately.
> > >
> > > I wanted to help users and remove ambiguity. For example, mlx5 drivers
> > > will always implement proper .set_...() callbacks but for some devices
> > > without needed FW support, the value will be 0. Instead of misleading
> > > users with wrong promise that feature supported but doesn't have
> > > available vectors, I decided to be more clear. For the users, 0 means, don't
> > > try, it is not working.
> >
> > Oh, you mean "feature is not supported by the FIRMWARE on some mlx5
> > devices"?  I totally missed that; I thought you meant "not supported
> > by the PF driver."  Why not have the PF driver detect that the
> > firmware doesn't support the feature and just not expose the sysfs
> > files at all in that case?
> 
> I can do it and will do it, but this behavior will be possible because
> mlx5 is flexible enough. I imagine that other device won't be so flexible,
> for example they will decide to "close" this feature because of other
> SW limitation.

What about something like this?

  This file contains the total number of MSI-X vectors available for
  assignment to all VFs associated with this PF.  It may be zero if
  the device doesn't support this functionality.

Side question: How does user-space use this?  This file contains a
constant, and there's no interface to learn how many vectors are still
available in the pool, right?

I guess the user just has to manage the values written to
.../VF<n>/sriov_vf_msix_count so the sum of all of them never exceeds
sriov_vf_total_msix?  If that user app crashes, I guess it can
reconstruct this state by reading all the
.../VF<n>/sriov_vf_msix_count files?

> > > > If we put them in a new "vfs_overlay" directory, it seems like
> > > > overkill to repeat the "vf" part, but I'm hoping the new files can end
> > > > up next to these existing files.  In that case, I think it makes sense
> > > > to include "sriov".  And it probably does make sense to include "vf"
> > > > as well.
> > >
> > > I put everything in folder to group any possible future extensions.
> > > Those extensions are applicable to SR-IOV VFs only, IMHO they deserve
> > > separate folder.
> >
> > I'm not convinced (yet) that the possibility of future extensions is
> > enough justification for adding the "vfs_overlay" directory.  It
> > really complicates the code flow -- if we skipped the new directory,
> > I'm pretty sure we could make .is_visible() work, which would be a
> > major simplification.
> 
> Unfortunately not, is_visible() is not dynamic and called when device
> creates sysfs and it doesn't rescan it or refresh them. It means that
> all files from vfs_overlay folder will exist even driver is removed
> from PF.
> 
> Also sysfs is created before driver is probed.

Ah, both excellent points, that makes this much clearer to me, thank
you!

> > And there's quite a bit of value in the new files being right next to
> > the existing sriov_* files.
> 
> I have no problem to drop folder, just need clear request, should I.

I think we should drop the folder.  I missed the previous discussion
though, so if somebody has a good objection to dropping it, let me
know.

Dropping the folder has the potential advantage that we *could* decide
to expose these files always, and just have them be non-functional if
the device doesn't support assignment or the PF driver isn't bound.
Then I think we could use static sysfs attributes without
pci_enable_vfs_overlay().

Bjorn

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-05 22:57               ` Bjorn Helgaas
@ 2021-02-06 12:42                 ` Leon Romanovsky
  0 siblings, 0 replies; 16+ messages in thread
From: Leon Romanovsky @ 2021-02-06 12:42 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

On Fri, Feb 05, 2021 at 04:57:03PM -0600, Bjorn Helgaas wrote:
> On Fri, Feb 05, 2021 at 07:35:47PM +0200, Leon Romanovsky wrote:
> > On Thu, Feb 04, 2021 at 03:12:12PM -0600, Bjorn Helgaas wrote:
> > > On Thu, Feb 04, 2021 at 05:50:48PM +0200, Leon Romanovsky wrote:
> > > > On Wed, Feb 03, 2021 at 06:10:39PM -0600, Bjorn Helgaas wrote:
> > > > > On Tue, Feb 02, 2021 at 09:44:29PM +0200, Leon Romanovsky wrote:
> > > > > > On Tue, Feb 02, 2021 at 12:06:09PM -0600, Bjorn Helgaas wrote:
> > > > > > > On Tue, Jan 26, 2021 at 10:57:27AM +0200, Leon Romanovsky wrote:
> > > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > > > >
> > > > > > > > Extend PCI sysfs interface with a new callback that allows
> > > > > > > > configure the number of MSI-X vectors for specific SR-IO VF.
> > > > > > > > This is needed to optimize the performance of newly bound
> > > > > > > > devices by allocating the number of vectors based on the
> > > > > > > > administrator knowledge of targeted VM.
> > > > > > >
> > > > > > > I'm reading between the lines here, but IIUC the point is that you
> > > > > > > have a PF that supports a finite number of MSI-X vectors for use
> > > > > > > by all the VFs, and this interface is to control the distribution
> > > > > > > of those MSI-X vectors among the VFs.
> > >
> > > > > This commit log should describe how *this* device manages this
> > > > > allocation and how the PF Table Size and the VF Table Sizes are
> > > > > related.  Per PCIe, there is no necessary connection between them.
> > > >
> > > > There is no connection in mlx5 devices either. PF is used as a vehicle
> > > > to access VF that doesn't have driver yet. From "table size" perspective
> > > > they completely independent, because PF already probed by driver and
> > > > it is already too late to change it.
> > > >
> > > > So PF table size is static and can be changed through FW utility only.
> > >
> > > This is where description of the device would be useful.
> > >
> > > The fact that you need "sriov_vf_total_msix" to advertise how many
> > > vectors are available and "sriov_vf_msix_count" to influence how they
> > > are distributed across the VFs suggests that these Table Sizes are not
> > > completely independent.
> > >
> > > Can a VF have a bigger Table Size than the PF does?  Can all the VF
> > > Table Sizes added together be bigger than the PF Table Size?  If VF A
> > > has a larger Table Size, does that mean VF B must have a smaller Table
> > > Size?
> >
> > VFs are completely independent devices and their table size can be
> > bigger than PF. FW has two pools, one for PF and another for all VFs.
> > In real world scenarios, every VF will have more MSI-X vectors than PF,
> > which will be distributed by orchestration software.
>
> Well, if the sum of all the VF Table Sizes cannot exceed the size of
> the FW pool for VFs, I would say the VFs are not completely
> independent.  Increasing the Table Size of one VF reduces it for other
> VFs.
>
> This is an essential detail because it's the whole reason behind this
> interface, so sketching this out in the commit log will make this much
> easier to understand.

I think that it is already written, but will recheck.

>
> > > Here's the sequence as I understand it:
> > >
> > >   1) PF driver binds to PF
> > >   2) PF driver enables VFs
> > >   3) PF driver creates /sys/.../<PF>/sriov_vf_total_msix
> > >   4) PF driver creates /sys/.../<VFn>/sriov_vf_msix_count for each VF
> > >   5) Management app reads sriov_vf_total_msix, writes sriov_vf_msix_count
> > >   6) VF driver binds to VF
> > >   7) VF reads MSI-X Message Control (Table Size)
> > >
> > > Is it true that "lspci VF" at 4.1 and "lspci VF" at 5.1 may read
> > > different Table Sizes?  That would be a little weird.
> >
> > Yes, this is the flow. I think differently from you and think this
> > is actual good thing that user writes new msix count and it is shown
> > immediately.
>
> Only weird because per spec Table Size is read-only and in this
> scenario it changes, so it may be surprising, but probably not a huge
> deal.
>
> > > I'm also a little concerned about doing 2 before 3 & 4.  That works
> > > for mlx5 because implements the Table Size adjustment in a way that
> > > works *after* the VFs have been enabled.
> >
> > It is not related to mlx5, but to the PCI spec that requires us to
> > create all VFs at the same time. Before enabling VFs, they don't
> > exist.
>
> Yes.  I can imagine a PF driver that collects characteristics for the
> desired VFs before enabling them, sort of like we already collect the
> *number* of VFs.  But I think your argument for dynamic changes after
> creation below is pretty compelling.

IMHO, the best tool for such pre-configured changes will be devlink.

>
> > > But it seems conceivable that a device could implement vector
> > > distribution in a way that would require the VF Table Sizes to be
> > > fixed *before* enabling VFs.  That would be nice in the sense that the
> > > VFs would be created "fully formed" and the VF Table Size would be
> > > completely read-only as documented.
> >
> > It is not how SR-IOV is used in real world. Cloud providers create many
> > VFs but don't assign those VFs yet. They do it after customer request
> > VM with specific properties (number of CPUs) which means number of MSI-X
> > vectors in our case.
> >
> > All this is done when other VFs already in use and we can't destroy and
> > recreate them at that point of time.
>
> Makes sense, thanks for this insight.  The need to change the MSI-X
> Table Size dynamically while other VFs are in use would also be useful
> in the commit log because it really helps motivate this design.

It is already written, but I will add more info.

>
> > > > > > > > +What:		/sys/bus/pci/devices/.../vfs_overlay/sriov_vf_total_msix
> > > > > > > > +Date:		January 2021
> > > > > > > > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > > > > > > > +Description:
> > > > > > > > +		This file is associated with the SR-IOV PFs.
> > > > > > > > +		It returns a total number of possible to configure MSI-X
> > > > > > > > +		vectors on the enabled VFs.
> > > > > > > > +
> > > > > > > > +		The values returned are:
> > > > > > > > +		 * > 0 - this will be total number possible to consume by VFs,
> > > > > > > > +		 * = 0 - feature is not supported
> > > >
> > > > > Not sure the "= 0" description is necessary here.  If the value
> > > > > returned is the number of MSI-X vectors available for assignment to
> > > > > VFs, "0" is a perfectly legitimate value.  It just means there are
> > > > > none.  It doesn't need to be described separately.
> > > >
> > > > I wanted to help users and remove ambiguity. For example, mlx5 drivers
> > > > will always implement proper .set_...() callbacks but for some devices
> > > > without needed FW support, the value will be 0. Instead of misleading
> > > > users with wrong promise that feature supported but doesn't have
> > > > available vectors, I decided to be more clear. For the users, 0 means, don't
> > > > try, it is not working.
> > >
> > > Oh, you mean "feature is not supported by the FIRMWARE on some mlx5
> > > devices"?  I totally missed that; I thought you meant "not supported
> > > by the PF driver."  Why not have the PF driver detect that the
> > > firmware doesn't support the feature and just not expose the sysfs
> > > files at all in that case?
> >
> > I can do it and will do it, but this behavior will be possible because
> > mlx5 is flexible enough. I imagine that other device won't be so flexible,
> > for example they will decide to "close" this feature because of other
> > SW limitation.
>
> What about something like this?
>
>   This file contains the total number of MSI-X vectors available for
>   assignment to all VFs associated with this PF.  It may be zero if
>   the device doesn't support this functionality.

Sounds good.

>
> Side question: How does user-space use this?  This file contains a
> constant, and there's no interface to learn how many vectors are still
> available in the pool, right?

Right, it is easy for the kernel implementation and easy for the users.
They don't need from the kernel to see exact utilized number.

Every VF has vectors assigned from the beginning and there is no parallel
race between kernel and user space to claim resources in our flow, so at the
point when orchestration will have a chance to run the system will be in steady
state.

Access to the hypervisor with ability to write to sysfs files requires
root permissions and management software already counts number of users,
their CPUs and number of vectors.

>
> I guess the user just has to manage the values written to
> .../VF<n>/sriov_vf_msix_count so the sum of all of them never exceeds
> sriov_vf_total_msix?  If that user app crashes, I guess it can
> reconstruct this state by reading all the
> .../VF<n>/sriov_vf_msix_count files?

Yes, if orchestration software crashes on hypervisor, it will simply
reiterate all VFs from the beginning.

>
> > > > > If we put them in a new "vfs_overlay" directory, it seems like
> > > > > overkill to repeat the "vf" part, but I'm hoping the new files can end
> > > > > up next to these existing files.  In that case, I think it makes sense
> > > > > to include "sriov".  And it probably does make sense to include "vf"
> > > > > as well.
> > > >
> > > > I put everything in folder to group any possible future extensions.
> > > > Those extensions are applicable to SR-IOV VFs only, IMHO they deserve
> > > > separate folder.
> > >
> > > I'm not convinced (yet) that the possibility of future extensions is
> > > enough justification for adding the "vfs_overlay" directory.  It
> > > really complicates the code flow -- if we skipped the new directory,
> > > I'm pretty sure we could make .is_visible() work, which would be a
> > > major simplification.
> >
> > Unfortunately not, is_visible() is not dynamic and called when device
> > creates sysfs and it doesn't rescan it or refresh them. It means that
> > all files from vfs_overlay folder will exist even driver is removed
> > from PF.
> >
> > Also sysfs is created before driver is probed.
>
> Ah, both excellent points, that makes this much clearer to me, thank
> you!
>
> > > And there's quite a bit of value in the new files being right next to
> > > the existing sriov_* files.
> >
> > I have no problem to drop folder, just need clear request, should I.
>
> I think we should drop the folder.  I missed the previous discussion
> though, so if somebody has a good objection to dropping it, let me
> know.

Just to be clear, dropping "vfs_overlay" folder won't remove any
complexity with pci_enable_vfs_overlay()/pci_disable_vfs_overlay(), but
will cause to simple change of the name attribute in the attribute_group.

>
> Dropping the folder has the potential advantage that we *could* decide
> to expose these files always, and just have them be non-functional if
> the device doesn't support assignment or the PF driver isn't bound.
> Then I think we could use static sysfs attributes without
> pci_enable_vfs_overlay().

Such static initialization was in v0..v3 versions, but the request was to have
new files to be PF-driven and it requires enable/disable callbacks.

They need to allow the flow when PF driver is bounded after VF devices
already created. For example this flow:
1. modprobe mlx5_core
2. echo 5 > .../sriov_numvf
3. rmmod mlx5_core
...
4. modprobe mlx5_core <- need to reattach to already existing VFs.

Thanks

>
> Bjorn

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-02-06 12:43 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-26  8:57 [PATCH mlx5-next v5 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
2021-01-26  8:57 ` [PATCH mlx5-next v5 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
2021-02-02  7:02   ` Leon Romanovsky
2021-02-02 18:06   ` Bjorn Helgaas
2021-02-02 19:44     ` Leon Romanovsky
2021-02-04  0:10       ` Bjorn Helgaas
2021-02-04 15:50         ` Leon Romanovsky
2021-02-04 21:12           ` Bjorn Helgaas
2021-02-05 17:35             ` Leon Romanovsky
2021-02-05 22:57               ` Bjorn Helgaas
2021-02-06 12:42                 ` Leon Romanovsky
2021-01-26  8:57 ` [PATCH mlx5-next v5 2/4] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
2021-01-26  8:57 ` [PATCH mlx5-next v5 3/4] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
2021-02-02 17:25   ` Bjorn Helgaas
2021-02-02 19:11     ` Leon Romanovsky
2021-01-26  8:57 ` [PATCH mlx5-next v5 4/4] net/mlx5: Allow to the users to configure number of MSI-X vectors Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).