linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mlx5-next v6 0/4] Dynamically assign MSI-X vectors count
@ 2021-02-09 13:34 Leon Romanovsky
  2021-02-09 13:34 ` [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Leon Romanovsky @ 2021-02-09 13:34 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

From: Leon Romanovsky <leonro@nvidia.com>

Changelog
v6:
 * Patch 1:
   * English fixes
   * Moved pci_vf_set_msix_vec_count() from msi.c to iov.c
   * Embedded pci_vf_set_msix_vec_count() into sriov_vf_msix_count_store
   * Deleted sriov_vf_msix_count_show
   * Deleted vfs_overlay folder
   * Renamed functions *_vfs_overlay_* to be *_vf_overlay_*
   * Deleted is_supported and attribute_group because it confused people more than
     it gave advantage.
   * Changed vf_total_msix to be callback
 * Patch 3:
   * Fixed english as suggested by Bjorn
   * Added more explanations to the commit message
 * Patch 4:
   * Protected enable/disable with capability check
v5: https://lore.kernel.org/linux-pci/20210126085730.1165673-1-leon@kernel.org
 * Patch 1:
  * Added forgotten "inline" keyword when declaring empty functions.
v4: https://lore.kernel.org/linux-pci/20210124131119.558563-1-leon@kernel.org
 * Used sysfs_emit() instead of sprintf() in new sysfs entries.
 * Changed EXPORT_SYMBOL to be EXPORT_SYMBOL_GPL for pci_iov_virtfn_devfn().
 * Rewrote sysfs registration code to be driven by PF that wants to enable VF
   overlay instead of creating to all SR-IOV devices.
 * Grouped all such functionality under new "vfs_overlay" folder.
 * Combined two PCI patches into one.
v3: https://lore.kernel.org/linux-pci/20210117081548.1278992-1-leon@kernel.org
 * Renamed pci_set_msix_vec_count to be pci_vf_set_msix_vec_count.
 * Added VF msix_cap check to hide sysfs entry if device doesn't support msix.
 * Changed "-" to be ":" in the mlx5 patch to silence CI warnings about missing
   kdoc description.
 * Split differently error print in mlx5 driver to avoid checkpatch warning.
v2: https://lore.kernel.org/linux-pci/20210114103140.866141-1-leon@kernel.org
 * Patch 1:
  * Renamed vf_msix_vec sysfs knob to be sriov_vf_msix_count
  * Added PF and VF device locks during set MSI-X call to protect from parallel
    driver bind/unbind operations.
  * Removed extra checks when reading sriov_vf_msix, because users will
    be able to distinguish between supported/not supported by looking on
    sriov_vf_total_msix count.
  * Changed all occurrences of "numb" to be "count"
  * Changed returned error from EOPNOTSUPP to be EBUSY if user tries to set
    MSI-X count after driver already bound to the VF.
  * Added extra comment in pci_set_msix_vec_count() to emphasize that driver
    should not be bound.
 * Patch 2:
  * Changed vf_total_msix from int to be u32 and updated function signatures
    accordingly.
  * Improved patch title
v1: https://lore.kernel.org/linux-pci/20210110150727.1965295-1-leon@kernel.org
 * Improved wording and commit messages of first PCI patch
 * Added extra PCI patch to provide total number of MSI-X vectors
 * Prohibited read of vf_msix_vec sysfs file if driver doesn't support write
 * Removed extra function definition in pci.h
v0: https://lore.kernel.org/linux-pci/20210103082440.34994-1-leon@kernel.org

--------------------------------------------------------------------
Hi,

The number of MSI-X vectors is PCI property visible through lspci, that
field is read-only and configured by the device.

The static assignment of an amount of MSI-X vectors doesn't allow utilize
the newly created VF because it is not known to the device the future load
and configuration where that VF will be used.

The VFs are created on the hypervisor and forwarded to the VMs that have
different properties (for example number of CPUs).

To overcome the inefficiency in the spread of such MSI-X vectors, we
allow the kernel to instruct the device with the needed number of such
vectors, before VF is initialized and bounded to the driver.

Before this series:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
        Capabilities: [9c] MSI-X: Enable- Count=12 Masked-

Configuration script:
1. Start fresh
echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
modprobe -q -r mlx5_ib mlx5_core
2. Ensure that driver doesn't run and it is safe to change MSI-X
echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_drivers_autoprobe
3. Load driver for the PF
modprobe mlx5_core
4. Configure one of the VFs with new number
echo 2 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
echo 21 > /sys/bus/pci/devices/0000\:08\:00.2/sriov_vf_msix_count

After this series:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
        Capabilities: [9c] MSI-X: Enable- Count=21 Masked-

Thanks

Leon Romanovsky (4):
  PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  net/mlx5: Add dynamic MSI-X capabilities bits
  net/mlx5: Dynamically assign MSI-X vectors count
  net/mlx5: Allow to the users to configure number of MSI-X vectors

 Documentation/ABI/testing/sysfs-bus-pci       |  28 ++++
 .../net/ethernet/mellanox/mlx5/core/main.c    |  17 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  27 ++++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  72 +++++++++
 .../net/ethernet/mellanox/mlx5/core/sriov.c   |  58 ++++++-
 drivers/pci/iov.c                             | 153 ++++++++++++++++++
 include/linux/mlx5/mlx5_ifc.h                 |  11 +-
 include/linux/pci.h                           |  12 ++
 8 files changed, 375 insertions(+), 3 deletions(-)

--
2.29.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-09 13:34 [PATCH mlx5-next v6 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
@ 2021-02-09 13:34 ` Leon Romanovsky
  2021-02-15 21:01   ` Bjorn Helgaas
  2021-02-09 13:34 ` [PATCH mlx5-next v6 2/4] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Leon Romanovsky @ 2021-02-09 13:34 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

From: Leon Romanovsky <leonro@nvidia.com>

Extend PCI sysfs interface with a new callback that allows configuration
of the number of MSI-X vectors for specific SR-IOV VF. This is needed
to optimize the performance of VFs devices by allocating the number of
vectors based on the administrator knowledge of the intended use of the VF.

This function is applicable for SR-IOV VF because such devices allocate
their MSI-X table before they will run on the VMs and HW can't guess the
right number of vectors, so some devices allocate them statically and equally.

1) The newly added /sys/bus/pci/devices/.../sriov_vf_msix_count
file will be seen for the VFs and it is writable as long as a driver is not
bound to the VF.

The values accepted are:
 * > 0 - this will be number reported by the Table Size in the VF's MSI-X Message
         Control register
 * < 0 - not valid
 * = 0 - will reset to the device default value

2) In order to make management easy, provide new read-only sysfs file that
returns a total number of possible to configure MSI-X vectors.

cat /sys/bus/pci/devices/.../sriov_vf_total_msix
  = 0 - feature is not supported
  > 0 - total number of MSI-X vectors available for distribution among the VFs

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 Documentation/ABI/testing/sysfs-bus-pci |  28 +++++
 drivers/pci/iov.c                       | 153 ++++++++++++++++++++++++
 include/linux/pci.h                     |  12 ++
 3 files changed, 193 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 25c9c39770c6..7dadc3610959 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -375,3 +375,31 @@ Description:
 		The value comes from the PCI kernel device state and can be one
 		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
 		The file is read only.
+
+What:		/sys/bus/pci/devices/.../sriov_vf_total_msix
+Date:		January 2021
+Contact:	Leon Romanovsky <leonro@nvidia.com>
+Description:
+		This file is associated with the SR-IOV PFs.
+		It contains the total number of MSI-X vectors available for
+		assignment to all VFs associated with this PF. It may be zero
+		if the device doesn't support this functionality.
+
+What:		/sys/bus/pci/devices/.../sriov_vf_msix_count
+Date:		January 2021
+Contact:	Leon Romanovsky <leonro@nvidia.com>
+Description:
+		This file is associated with the SR-IOV VFs.
+		It allows configuration of the number of MSI-X vectors for
+		the VF. This is needed to optimize performance of newly bound
+		devices by allocating the number of vectors based on the
+		administrator knowledge of targeted VM.
+
+		The values accepted are:
+		 * > 0 - this will be number reported by the VF's MSI-X
+			 capability
+		 * < 0 - not valid
+		 * = 0 - will reset to the device default value
+
+		The file is writable if the PF is bound to a driver that
+		implements ->sriov_set_msix_vec_count().
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 4afd4ee4f7f0..c0554aa6b90a 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
 	return (dev->devfn + dev->sriov->offset +
 		dev->sriov->stride * vf_id) & 0xff;
 }
+EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);

 /*
  * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
@@ -157,6 +158,158 @@ int pci_iov_sysfs_link(struct pci_dev *dev,
 	return rc;
 }

+#ifdef CONFIG_PCI_MSI
+static ssize_t sriov_vf_msix_count_store(struct device *dev,
+					 struct device_attribute *attr,
+					 const char *buf, size_t count)
+{
+	struct pci_dev *vf_dev = to_pci_dev(dev);
+	struct pci_dev *pdev = pci_physfn(vf_dev);
+	int val, ret;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	if (val < 0)
+		return -EINVAL;
+
+	device_lock(&pdev->dev);
+	if (!pdev->driver || !pdev->driver->sriov_set_msix_vec_count) {
+		ret = -EOPNOTSUPP;
+		goto err_pdev;
+	}
+
+	device_lock(&vf_dev->dev);
+	if (vf_dev->driver) {
+		/*
+		 * Driver already probed this VF and configured itself
+		 * based on previously configured (or default) MSI-X vector
+		 * count. It is too late to change this field for this
+		 * specific VF.
+		 */
+		ret = -EBUSY;
+		goto err_dev;
+	}
+
+	ret = pdev->driver->sriov_set_msix_vec_count(vf_dev, val);
+
+err_dev:
+	device_unlock(&vf_dev->dev);
+err_pdev:
+	device_unlock(&pdev->dev);
+	return ret ? : count;
+}
+static DEVICE_ATTR_WO(sriov_vf_msix_count);
+
+static ssize_t sriov_vf_total_msix_show(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	u32 vf_total_msix;
+
+	device_lock(dev);
+	if (!pdev->driver || !pdev->driver->sriov_get_vf_total_msix) {
+		device_unlock(dev);
+		return -EOPNOTSUPP;
+	}
+	vf_total_msix = pdev->driver->sriov_get_vf_total_msix(pdev);
+	device_unlock(dev);
+
+	return sysfs_emit(buf, "%u\n", vf_total_msix);
+}
+static DEVICE_ATTR_RO(sriov_vf_total_msix);
+#endif
+
+static const struct attribute *sriov_pf_dev_attrs[] = {
+#ifdef CONFIG_PCI_MSI
+	&dev_attr_sriov_vf_total_msix.attr,
+#endif
+	NULL,
+};
+
+static const struct attribute *sriov_vf_dev_attrs[] = {
+#ifdef CONFIG_PCI_MSI
+	&dev_attr_sriov_vf_msix_count.attr,
+#endif
+	NULL,
+};
+
+/*
+ * The PF can change the specific properties of associated VFs. Such
+ * functionality is usually known after PF probed and PCI sysfs files
+ * were already created.
+ *
+ * The function below is driven by such PF. It adds sysfs files to already
+ * existing PF/VF sysfs device hierarchies.
+ */
+int pci_enable_vf_overlay(struct pci_dev *dev)
+{
+	struct pci_dev *virtfn;
+	int id, ret;
+
+	if (!dev->is_physfn || !dev->sriov->num_VFs)
+		return 0;
+
+	ret = sysfs_create_files(&dev->dev.kobj, sriov_pf_dev_attrs);
+	if (ret)
+		return ret;
+
+	for (id = 0; id < dev->sriov->num_VFs; id++) {
+		virtfn = pci_get_domain_bus_and_slot(
+			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
+			pci_iov_virtfn_devfn(dev, id));
+
+		if (!virtfn)
+			continue;
+
+		ret = sysfs_create_files(&virtfn->dev.kobj,
+					 sriov_vf_dev_attrs);
+		if (ret)
+			goto out;
+	}
+	return 0;
+
+out:
+	while (id--) {
+		virtfn = pci_get_domain_bus_and_slot(
+			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
+			pci_iov_virtfn_devfn(dev, id));
+
+		if (!virtfn)
+			continue;
+
+		sysfs_remove_files(&virtfn->dev.kobj, sriov_vf_dev_attrs);
+	}
+	sysfs_remove_files(&dev->dev.kobj, sriov_pf_dev_attrs);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_enable_vf_overlay);
+
+void pci_disable_vf_overlay(struct pci_dev *dev)
+{
+	struct pci_dev *virtfn;
+	int id;
+
+	if (!dev->is_physfn || !dev->sriov->num_VFs)
+		return;
+
+	id = dev->sriov->num_VFs;
+	while (id--) {
+		virtfn = pci_get_domain_bus_and_slot(
+			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
+			pci_iov_virtfn_devfn(dev, id));
+
+		if (!virtfn)
+			continue;
+
+		sysfs_remove_files(&virtfn->dev.kobj, sriov_vf_dev_attrs);
+	}
+	sysfs_remove_files(&dev->dev.kobj, sriov_pf_dev_attrs);
+}
+EXPORT_SYMBOL_GPL(pci_disable_vf_overlay);
+
 int pci_iov_add_virtfn(struct pci_dev *dev, int id)
 {
 	int i;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b32126d26997..732611937574 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -856,6 +856,11 @@ struct module;
  *		e.g. drivers/net/e100.c.
  * @sriov_configure: Optional driver callback to allow configuration of
  *		number of VFs to enable via sysfs "sriov_numvfs" file.
+ * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
+ *              to configure via sysfs "sriov_vf_msix_count" entry. This will
+ *              change MSI-X Table Size in their Message Control registers.
+ * @sriov_get_vf_total_msix: Total number of MSI-X veectors to distribute
+ *              to the VFs
  * @err_handler: See Documentation/PCI/pci-error-recovery.rst
  * @groups:	Sysfs attribute groups.
  * @driver:	Driver model structure.
@@ -871,6 +876,8 @@ struct pci_driver {
 	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
 	void (*shutdown)(struct pci_dev *dev);
 	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
+	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
+	u32  (*sriov_get_vf_total_msix)(struct pci_dev *pf);
 	const struct pci_error_handlers *err_handler;
 	const struct attribute_group **groups;
 	struct device_driver	driver;
@@ -2059,6 +2066,9 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
 int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
 int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);

+int pci_enable_vf_overlay(struct pci_dev *dev);
+void pci_disable_vf_overlay(struct pci_dev *dev);
+
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);

@@ -2100,6 +2110,8 @@ static inline int pci_iov_add_virtfn(struct pci_dev *dev, int id)
 }
 static inline void pci_iov_remove_virtfn(struct pci_dev *dev,
 					 int id) { }
+static inline int pci_enable_vf_overlay(struct pci_dev *dev) { return 0; }
+static inline void pci_disable_vf_overlay(struct pci_dev *dev) { }
 static inline void pci_disable_sriov(struct pci_dev *dev) { }
 static inline int pci_num_vf(struct pci_dev *dev) { return 0; }
 static inline int pci_vfs_assigned(struct pci_dev *dev)
--
2.29.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH mlx5-next v6 2/4] net/mlx5: Add dynamic MSI-X capabilities bits
  2021-02-09 13:34 [PATCH mlx5-next v6 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
  2021-02-09 13:34 ` [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
@ 2021-02-09 13:34 ` Leon Romanovsky
  2021-02-09 13:34 ` [PATCH mlx5-next v6 3/4] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 32+ messages in thread
From: Leon Romanovsky @ 2021-02-09 13:34 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

From: Leon Romanovsky <leonro@nvidia.com>

These new fields declare the number of MSI-X vectors that is
possible to allocate on the VF through PF configuration.

Value must be in range defined by min_dynamic_vf_msix_table_size
and max_dynamic_vf_msix_table_size.

The driver should continue to query its MSI-X table through PCI
configuration header.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 77051bd5c1cf..ffe2c7231ae4 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -1677,7 +1677,16 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8	   reserved_at_6e0[0x10];
 	u8	   sf_base_id[0x10];
 
-	u8	   reserved_at_700[0x80];
+	u8	   reserved_at_700[0x8];
+	u8	   num_total_dynamic_vf_msix[0x18];
+	u8	   reserved_at_720[0x14];
+	u8	   dynamic_msix_table_size[0xc];
+	u8	   reserved_at_740[0xc];
+	u8	   min_dynamic_vf_msix_table_size[0x4];
+	u8	   reserved_at_750[0x4];
+	u8	   max_dynamic_vf_msix_table_size[0xc];
+
+	u8	   reserved_at_760[0x20];
 	u8	   vhca_tunnel_commands[0x40];
 	u8	   reserved_at_7c0[0x40];
 };
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH mlx5-next v6 3/4] net/mlx5: Dynamically assign MSI-X vectors count
  2021-02-09 13:34 [PATCH mlx5-next v6 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
  2021-02-09 13:34 ` [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
  2021-02-09 13:34 ` [PATCH mlx5-next v6 2/4] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
@ 2021-02-09 13:34 ` Leon Romanovsky
  2021-02-09 13:34 ` [PATCH mlx5-next v6 4/4] net/mlx5: Allow to the users to configure number of MSI-X vectors Leon Romanovsky
  2021-02-09 21:06 ` [PATCH mlx5-next v6 0/4] Dynamically assign MSI-X vectors count Alexander Duyck
  4 siblings, 0 replies; 32+ messages in thread
From: Leon Romanovsky @ 2021-02-09 13:34 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

From: Leon Romanovsky <leonro@nvidia.com>

The number of MSI-X vectors is PCI property visible through lspci that
field is read-only and configured by the device. The mlx5 devices work
in two possible modes: static assignment or dynamic.

Static assignment means that all newly created VFs have a stable number of
MSI-X vectors. Such partitioning doesn't utilize the newly created VF
because it is unknown to the device the future load and configuration
where that VF will be used.

The second mode is a dynamic assignment and provided to overcome the
inefficiency in the spread of such MSI-X vectors. It allows the kernel
to convey a new number of vectors based on the administrator's knowledge.

For example: if VF is bound to a powerful VM with many CPUs, the administrator
will configure a larger number of MSI-X vectors than in static mode. It will
utilize the mlx5 device efficiently.

The assignment is performed before the driver is probed on that VF.

Due to the internal to mlx5 implementation, the device provides more dynamic
vectors than static vectors. In this patch, we enable this feature, and such
change immediately increases the amount of MSI-X vectors for the system with
2 VFs from 12 vectors, to be 32 vectors per-VF.

Before this patch:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
	Capabilities: [9c] MSI-X: Enable- Count=12 Masked-

After this patch:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
	Capabilities: [9c] MSI-X: Enable- Count=32 Masked-

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/main.c    |  4 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  5 ++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 72 +++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 13 +++-
 4 files changed, 92 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index ca6f2fc39ea0..79cfcc844156 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -567,6 +567,10 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
 	if (MLX5_CAP_GEN_MAX(dev, mkey_by_name))
 		MLX5_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1);
 
+	if (MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix))
+		MLX5_SET(cmd_hca_cap, set_hca_cap, num_total_dynamic_vf_msix,
+			 MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix));
+
 	return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 0a0302ce7144..5babb4434a87 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -172,6 +172,11 @@ int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, int vecidx,
 		       struct notifier_block *nb);
 int mlx5_irq_detach_nb(struct mlx5_irq_table *irq_table, int vecidx,
 		       struct notifier_block *nb);
+
+int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int devfn,
+			    int msix_vec_count);
+int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs);
+
 struct cpumask *
 mlx5_irq_get_affinity_mask(struct mlx5_irq_table *irq_table, int vecidx);
 struct cpu_rmap *mlx5_irq_get_rmap(struct mlx5_irq_table *table);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 6fd974920394..715c801ae865 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -55,6 +55,78 @@ static struct mlx5_irq *mlx5_irq_get(struct mlx5_core_dev *dev, int vecidx)
 	return &irq_table->irq[vecidx];
 }
 
+/**
+ * mlx5_get_default_msix_vec_count - Get default number of MSI-X vectors
+ * to be ssigned to each VF.
+ * @dev: PF to work on
+ * @num_vfs: Number of enabled VFs/
+ */
+int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs)
+{
+	int num_vf_msix, min_msix, max_msix;
+
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+	if (!num_vf_msix)
+		return 0;
+
+	min_msix = MLX5_CAP_GEN(dev, min_dynamic_vf_msix_table_size);
+	max_msix = MLX5_CAP_GEN(dev, max_dynamic_vf_msix_table_size);
+
+	/* Limit maximum number of MSI-X vectors to leave some of them free
+	 * in the pool and ready to be assigned by the users without need to
+	 * resize other Vfs.
+	 */
+	return max(min(num_vf_msix / num_vfs, max_msix / 2), min_msix);
+}
+
+/**
+ * mlx5_set_msix_vec_count - Set dynamically allocated MSI-X to the VF
+ * @dev: PF to work on
+ * @function_id: Internal PCI VF function IDd
+ * @msix_vec_count: Number of MSI-X vectors to set
+ */
+int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int function_id,
+			    int msix_vec_count)
+{
+	int sz = MLX5_ST_SZ_BYTES(set_hca_cap_in);
+	int num_vf_msix, min_msix, max_msix;
+	void *hca_cap, *cap;
+	int ret;
+
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+	if (!num_vf_msix)
+		return 0;
+
+	if (!MLX5_CAP_GEN(dev, vport_group_manager) || !mlx5_core_is_pf(dev))
+		return -EOPNOTSUPP;
+
+	min_msix = MLX5_CAP_GEN(dev, min_dynamic_vf_msix_table_size);
+	max_msix = MLX5_CAP_GEN(dev, max_dynamic_vf_msix_table_size);
+
+	if (msix_vec_count < min_msix)
+		return -EINVAL;
+
+	if (msix_vec_count > max_msix)
+		return -EOVERFLOW;
+
+	hca_cap = kzalloc(sz, GFP_KERNEL);
+	if (!hca_cap)
+		return -ENOMEM;
+
+	cap = MLX5_ADDR_OF(set_hca_cap_in, hca_cap, capability);
+	MLX5_SET(cmd_hca_cap, cap, dynamic_msix_table_size, msix_vec_count);
+
+	MLX5_SET(set_hca_cap_in, hca_cap, opcode, MLX5_CMD_OP_SET_HCA_CAP);
+	MLX5_SET(set_hca_cap_in, hca_cap, other_function, 1);
+	MLX5_SET(set_hca_cap_in, hca_cap, function_id, function_id);
+
+	MLX5_SET(set_hca_cap_in, hca_cap, op_mod,
+		 MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE << 1);
+	ret = mlx5_cmd_exec_in(dev, set_hca_cap, hca_cap);
+	kfree(hca_cap);
+	return ret;
+}
+
 int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, int vecidx,
 		       struct notifier_block *nb)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index 3094d20297a9..f0ec86a1c8a6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -71,8 +71,7 @@ static int sriov_restore_guids(struct mlx5_core_dev *dev, int vf)
 static int mlx5_device_enable_sriov(struct mlx5_core_dev *dev, int num_vfs)
 {
 	struct mlx5_core_sriov *sriov = &dev->priv.sriov;
-	int err;
-	int vf;
+	int err, vf, num_msix_count;
 
 	if (!MLX5_ESWITCH_MANAGER(dev))
 		goto enable_vfs_hca;
@@ -85,12 +84,22 @@ static int mlx5_device_enable_sriov(struct mlx5_core_dev *dev, int num_vfs)
 	}
 
 enable_vfs_hca:
+	num_msix_count = mlx5_get_default_msix_vec_count(dev, num_vfs);
 	for (vf = 0; vf < num_vfs; vf++) {
 		err = mlx5_core_enable_hca(dev, vf + 1);
 		if (err) {
 			mlx5_core_warn(dev, "failed to enable VF %d (%d)\n", vf, err);
 			continue;
 		}
+
+		err = mlx5_set_msix_vec_count(dev, vf + 1, num_msix_count);
+		if (err) {
+			mlx5_core_warn(dev,
+				       "failed to set MSI-X vector counts VF %d, err %d\n",
+				       vf, err);
+			continue;
+		}
+
 		sriov->vfs_ctx[vf].enabled = 1;
 		if (MLX5_CAP_GEN(dev, port_type) == MLX5_CAP_PORT_TYPE_IB) {
 			err = sriov_restore_guids(dev, vf);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH mlx5-next v6 4/4] net/mlx5: Allow to the users to configure number of MSI-X vectors
  2021-02-09 13:34 [PATCH mlx5-next v6 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
                   ` (2 preceding siblings ...)
  2021-02-09 13:34 ` [PATCH mlx5-next v6 3/4] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
@ 2021-02-09 13:34 ` Leon Romanovsky
  2021-02-09 21:06 ` [PATCH mlx5-next v6 0/4] Dynamically assign MSI-X vectors count Alexander Duyck
  4 siblings, 0 replies; 32+ messages in thread
From: Leon Romanovsky @ 2021-02-09 13:34 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

From: Leon Romanovsky <leonro@nvidia.com>

Implement ability to configure MSI-X for the SR-IOV VFs.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/main.c    | 13 ++++++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   | 22 +++++++++
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 45 +++++++++++++++++++
 3 files changed, 80 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 79cfcc844156..db59c51e148e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1395,6 +1395,14 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *id)
 		goto err_load_one;
 	}
 
+	err = mlx5_enable_vf_overlay(dev);
+	if (err) {
+		mlx5_core_err(dev,
+			      "mlx5_enable_vf_overlay failed with error code %d\n",
+			      err);
+		goto err_vf_overlay;
+	}
+
 	err = mlx5_crdump_enable(dev);
 	if (err)
 		dev_err(&pdev->dev, "mlx5_crdump_enable failed with error code %d\n", err);
@@ -1403,6 +1411,8 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *id)
 	devlink_reload_enable(devlink);
 	return 0;
 
+err_vf_overlay:
+	mlx5_unload_one(dev, true);
 err_load_one:
 	mlx5_pci_close(dev);
 pci_init_err:
@@ -1423,6 +1433,7 @@ static void remove_one(struct pci_dev *pdev)
 	devlink_reload_disable(devlink);
 	mlx5_crdump_disable(dev);
 	mlx5_drain_health_wq(dev);
+	mlx5_disable_vf_overlay(dev);
 	mlx5_unload_one(dev, true);
 	mlx5_pci_close(dev);
 	mlx5_mdev_uninit(dev);
@@ -1650,6 +1661,8 @@ static struct pci_driver mlx5_core_driver = {
 	.shutdown	= shutdown,
 	.err_handler	= &mlx5_err_handler,
 	.sriov_configure   = mlx5_core_sriov_configure,
+	.sriov_get_vf_total_msix = mlx5_sriov_get_vf_total_msix,
+	.sriov_set_msix_vec_count = mlx5_core_sriov_set_msix_vec_count,
 };
 
 static void mlx5_core_verify_params(void)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 5babb4434a87..e5c2fa558f77 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -138,6 +138,7 @@ void mlx5_sriov_cleanup(struct mlx5_core_dev *dev);
 int mlx5_sriov_attach(struct mlx5_core_dev *dev);
 void mlx5_sriov_detach(struct mlx5_core_dev *dev);
 int mlx5_core_sriov_configure(struct pci_dev *dev, int num_vfs);
+int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count);
 int mlx5_core_enable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_core_disable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_create_scheduling_element_cmd(struct mlx5_core_dev *dev, u8 hierarchy,
@@ -264,4 +265,25 @@ void mlx5_set_nic_state(struct mlx5_core_dev *dev, u8 state);
 
 void mlx5_unload_one(struct mlx5_core_dev *dev, bool cleanup);
 int mlx5_load_one(struct mlx5_core_dev *dev, bool boot);
+
+static inline int mlx5_enable_vf_overlay(struct mlx5_core_dev *dev)
+{
+	if (MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix))
+		return pci_enable_vf_overlay(dev->pdev);
+
+	return 0;
+}
+
+static inline void mlx5_disable_vf_overlay(struct mlx5_core_dev *dev)
+{
+	if (MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix))
+		pci_disable_vf_overlay(dev->pdev);
+}
+
+static inline u32 mlx5_sriov_get_vf_total_msix(struct pci_dev *pdev)
+{
+	struct mlx5_core_dev *dev = pci_get_drvdata(pdev);
+
+	return MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+}
 #endif /* __MLX5_CORE_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index f0ec86a1c8a6..446bfdfce4a2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -156,6 +156,15 @@ static int mlx5_sriov_enable(struct pci_dev *pdev, int num_vfs)
 	if (err) {
 		mlx5_core_warn(dev, "pci_enable_sriov failed : %d\n", err);
 		mlx5_device_disable_sriov(dev, num_vfs, true);
+		return err;
+	}
+
+	err = mlx5_enable_vf_overlay(dev);
+	if (err) {
+		mlx5_core_warn(dev, "mlx5_enable_vf_overlay failed : %d\n",
+			       err);
+		pci_disable_sriov(pdev);
+		mlx5_device_disable_sriov(dev, num_vfs, true);
 	}
 	return err;
 }
@@ -165,6 +174,7 @@ static void mlx5_sriov_disable(struct pci_dev *pdev)
 	struct mlx5_core_dev *dev  = pci_get_drvdata(pdev);
 	int num_vfs = pci_num_vf(dev->pdev);
 
+	mlx5_disable_vf_overlay(dev);
 	pci_disable_sriov(pdev);
 	mlx5_device_disable_sriov(dev, num_vfs, true);
 }
@@ -187,6 +197,41 @@ int mlx5_core_sriov_configure(struct pci_dev *pdev, int num_vfs)
 	return err ? err : num_vfs;
 }
 
+int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count)
+{
+	struct pci_dev *pf = pci_physfn(vf);
+	struct mlx5_core_sriov *sriov;
+	struct mlx5_core_dev *dev;
+	int num_vf_msix, id;
+
+	dev = pci_get_drvdata(pf);
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+	if (!num_vf_msix)
+		return -EOPNOTSUPP;
+
+	if (!msix_vec_count)
+		msix_vec_count =
+			mlx5_get_default_msix_vec_count(dev, pci_num_vf(pf));
+
+	sriov = &dev->priv.sriov;
+
+	/* Reversed translation of PCI VF function number to the internal
+	 * function_id, which exists in the name of virtfn symlink.
+	 */
+	for (id = 0; id < pci_num_vf(pf); id++) {
+		if (!sriov->vfs_ctx[id].enabled)
+			continue;
+
+		if (vf->devfn == pci_iov_virtfn_devfn(pf, id))
+			break;
+	}
+
+	if (id == pci_num_vf(pf) || !sriov->vfs_ctx[id].enabled)
+		return -EINVAL;
+
+	return mlx5_set_msix_vec_count(dev, id + 1, msix_vec_count);
+}
+
 int mlx5_sriov_attach(struct mlx5_core_dev *dev)
 {
 	if (!mlx5_core_is_pf(dev) || !pci_num_vf(dev->pdev))
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 0/4] Dynamically assign MSI-X vectors count
  2021-02-09 13:34 [PATCH mlx5-next v6 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
                   ` (3 preceding siblings ...)
  2021-02-09 13:34 ` [PATCH mlx5-next v6 4/4] net/mlx5: Allow to the users to configure number of MSI-X vectors Leon Romanovsky
@ 2021-02-09 21:06 ` Alexander Duyck
  2021-02-14  5:24   ` Leon Romanovsky
  4 siblings, 1 reply; 32+ messages in thread
From: Alexander Duyck @ 2021-02-09 21:06 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile,
	Alex Williamson, David S . Miller

On Tue, Feb 9, 2021 at 5:34 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> From: Leon Romanovsky <leonro@nvidia.com>

<snip>

> --------------------------------------------------------------------
> Hi,
>
> The number of MSI-X vectors is PCI property visible through lspci, that
> field is read-only and configured by the device.
>
> The static assignment of an amount of MSI-X vectors doesn't allow utilize
> the newly created VF because it is not known to the device the future load
> and configuration where that VF will be used.
>
> The VFs are created on the hypervisor and forwarded to the VMs that have
> different properties (for example number of CPUs).
>
> To overcome the inefficiency in the spread of such MSI-X vectors, we
> allow the kernel to instruct the device with the needed number of such
> vectors, before VF is initialized and bounded to the driver.
>
> Before this series:
> [root@server ~]# lspci -vs 0000:08:00.2
> 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
> ....
>         Capabilities: [9c] MSI-X: Enable- Count=12 Masked-
>
> Configuration script:
> 1. Start fresh
> echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
> modprobe -q -r mlx5_ib mlx5_core
> 2. Ensure that driver doesn't run and it is safe to change MSI-X
> echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_drivers_autoprobe
> 3. Load driver for the PF
> modprobe mlx5_core
> 4. Configure one of the VFs with new number
> echo 2 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
> echo 21 > /sys/bus/pci/devices/0000\:08\:00.2/sriov_vf_msix_count
>
> After this series:
> [root@server ~]# lspci -vs 0000:08:00.2
> 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
> ....
>         Capabilities: [9c] MSI-X: Enable- Count=21 Masked-
>
> Thanks
>
> Leon Romanovsky (4):
>   PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
>   net/mlx5: Add dynamic MSI-X capabilities bits
>   net/mlx5: Dynamically assign MSI-X vectors count
>   net/mlx5: Allow to the users to configure number of MSI-X vectors
>
>  Documentation/ABI/testing/sysfs-bus-pci       |  28 ++++
>  .../net/ethernet/mellanox/mlx5/core/main.c    |  17 ++
>  .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  27 ++++
>  .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  72 +++++++++
>  .../net/ethernet/mellanox/mlx5/core/sriov.c   |  58 ++++++-
>  drivers/pci/iov.c                             | 153 ++++++++++++++++++
>  include/linux/mlx5/mlx5_ifc.h                 |  11 +-
>  include/linux/pci.h                           |  12 ++
>  8 files changed, 375 insertions(+), 3 deletions(-)
>

This seems much improved from the last time I reviewed the patch set.
I am good with the drop of the folder in favor of using "sriov" in the
naming of the fields.

For the series:
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 0/4] Dynamically assign MSI-X vectors count
  2021-02-09 21:06 ` [PATCH mlx5-next v6 0/4] Dynamically assign MSI-X vectors count Alexander Duyck
@ 2021-02-14  5:24   ` Leon Romanovsky
  0 siblings, 0 replies; 32+ messages in thread
From: Leon Romanovsky @ 2021-02-14  5:24 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller

On Tue, Feb 09, 2021 at 01:06:25PM -0800, Alexander Duyck wrote:
> On Tue, Feb 9, 2021 at 5:34 AM Leon Romanovsky <leon@kernel.org> wrote:
> >

<...>

> > Leon Romanovsky (4):
> >   PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
> >   net/mlx5: Add dynamic MSI-X capabilities bits
> >   net/mlx5: Dynamically assign MSI-X vectors count
> >   net/mlx5: Allow to the users to configure number of MSI-X vectors
> >
> >  Documentation/ABI/testing/sysfs-bus-pci       |  28 ++++
> >  .../net/ethernet/mellanox/mlx5/core/main.c    |  17 ++
> >  .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  27 ++++
> >  .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  72 +++++++++
> >  .../net/ethernet/mellanox/mlx5/core/sriov.c   |  58 ++++++-
> >  drivers/pci/iov.c                             | 153 ++++++++++++++++++
> >  include/linux/mlx5/mlx5_ifc.h                 |  11 +-
> >  include/linux/pci.h                           |  12 ++
> >  8 files changed, 375 insertions(+), 3 deletions(-)
> >
>
> This seems much improved from the last time I reviewed the patch set.
> I am good with the drop of the folder in favor of using "sriov" in the
> naming of the fields.
>
> For the series:
> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

Bjorn,

Can I get your Ack too, so we won't miss this merge window?

Thanks

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-09 13:34 ` [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
@ 2021-02-15 21:01   ` Bjorn Helgaas
  2021-02-16  7:33     ` Leon Romanovsky
  0 siblings, 1 reply; 32+ messages in thread
From: Bjorn Helgaas @ 2021-02-15 21:01 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky, Jason Gunthorpe,
	Alexander Duyck, Jakub Kicinski, linux-pci, linux-rdma, netdev,
	Don Dutile, Alex Williamson, David S . Miller

On Tue, Feb 09, 2021 at 03:34:42PM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Extend PCI sysfs interface with a new callback that allows configuration
> of the number of MSI-X vectors for specific SR-IOV VF. This is needed
> to optimize the performance of VFs devices by allocating the number of
> vectors based on the administrator knowledge of the intended use of the VF.
> 
> This function is applicable for SR-IOV VF because such devices allocate
> their MSI-X table before they will run on the VMs and HW can't guess the
> right number of vectors, so some devices allocate them statically and equally.

This commit log should be clear that this functionality is motivated
by *mlx5* behavior.  The description above makes it sound like this is
generic PCI spec behavior, and it is not.

It may be a reasonable design that conforms to the spec, and we hope
the model will be usable by other designs, but it is not required by
the spec and AFAIK there is nothing in the spec you can point to as
background for this.

So don't *remove* the text you have above, but please *add* some
preceding background information about how mlx5 works.

> 1) The newly added /sys/bus/pci/devices/.../sriov_vf_msix_count
> file will be seen for the VFs and it is writable as long as a driver is not
> bound to the VF.

  This adds /sys/bus/pci/devices/.../sriov_vf_msix_count for VF
  devices and is writable ...

> The values accepted are:
>  * > 0 - this will be number reported by the Table Size in the VF's MSI-X Message
>          Control register
>  * < 0 - not valid
>  * = 0 - will reset to the device default value

  = 0 - will reset to a device-specific default value

> 2) In order to make management easy, provide new read-only sysfs file that
> returns a total number of possible to configure MSI-X vectors.

  For PF devices, this adds a read-only
  /sys/bus/pci/devices/.../sriov_vf_total_msix file that contains the
  total number of MSI-X vectors available for distribution among VFs.

Just as in sysfs-bus-pci, this file should be listed first, because
you must read it before you can use vf_msix_count.

> cat /sys/bus/pci/devices/.../sriov_vf_total_msix
>   = 0 - feature is not supported
>   > 0 - total number of MSI-X vectors available for distribution among the VFs
> 
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci |  28 +++++
>  drivers/pci/iov.c                       | 153 ++++++++++++++++++++++++
>  include/linux/pci.h                     |  12 ++
>  3 files changed, 193 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 25c9c39770c6..7dadc3610959 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -375,3 +375,31 @@ Description:
>  		The value comes from the PCI kernel device state and can be one
>  		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
>  		The file is read only.
> +
> +What:		/sys/bus/pci/devices/.../sriov_vf_total_msix
> +Date:		January 2021
> +Contact:	Leon Romanovsky <leonro@nvidia.com>
> +Description:
> +		This file is associated with the SR-IOV PFs.
> +		It contains the total number of MSI-X vectors available for
> +		assignment to all VFs associated with this PF. It may be zero
> +		if the device doesn't support this functionality.

s/associated with the/associated with/

> +What:		/sys/bus/pci/devices/.../sriov_vf_msix_count
> +Date:		January 2021
> +Contact:	Leon Romanovsky <leonro@nvidia.com>
> +Description:
> +		This file is associated with the SR-IOV VFs.
> +		It allows configuration of the number of MSI-X vectors for
> +		the VF. This is needed to optimize performance of newly bound
> +		devices by allocating the number of vectors based on the
> +		administrator knowledge of targeted VM.

s/associated with the/associated with/
s/knowledge of targeted VM/knowledge of how the VF will be used/

> +		The values accepted are:
> +		 * > 0 - this will be number reported by the VF's MSI-X
> +			 capability

  this number will be reported as the Table Size in the VF's MSI-X
  capability

> +		 * < 0 - not valid
> +		 * = 0 - will reset to the device default value
> +
> +		The file is writable if the PF is bound to a driver that
> +		implements ->sriov_set_msix_vec_count().
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 4afd4ee4f7f0..c0554aa6b90a 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
>  	return (dev->devfn + dev->sriov->offset +
>  		dev->sriov->stride * vf_id) & 0xff;
>  }
> +EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);
> 
>  /*
>   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
> @@ -157,6 +158,158 @@ int pci_iov_sysfs_link(struct pci_dev *dev,
>  	return rc;
>  }
> 
> +#ifdef CONFIG_PCI_MSI
> +static ssize_t sriov_vf_msix_count_store(struct device *dev,
> +					 struct device_attribute *attr,
> +					 const char *buf, size_t count)
> +{
> +	struct pci_dev *vf_dev = to_pci_dev(dev);
> +	struct pci_dev *pdev = pci_physfn(vf_dev);
> +	int val, ret;
> +
> +	ret = kstrtoint(buf, 0, &val);
> +	if (ret)
> +		return ret;
> +
> +	if (val < 0)
> +		return -EINVAL;
> +
> +	device_lock(&pdev->dev);
> +	if (!pdev->driver || !pdev->driver->sriov_set_msix_vec_count) {
> +		ret = -EOPNOTSUPP;
> +		goto err_pdev;
> +	}
> +
> +	device_lock(&vf_dev->dev);
> +	if (vf_dev->driver) {
> +		/*
> +		 * Driver already probed this VF and configured itself
> +		 * based on previously configured (or default) MSI-X vector
> +		 * count. It is too late to change this field for this
> +		 * specific VF.
> +		 */
> +		ret = -EBUSY;
> +		goto err_dev;
> +	}
> +
> +	ret = pdev->driver->sriov_set_msix_vec_count(vf_dev, val);
> +
> +err_dev:
> +	device_unlock(&vf_dev->dev);
> +err_pdev:
> +	device_unlock(&pdev->dev);
> +	return ret ? : count;
> +}
> +static DEVICE_ATTR_WO(sriov_vf_msix_count);
> +
> +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> +					struct device_attribute *attr,
> +					char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	u32 vf_total_msix;
> +
> +	device_lock(dev);
> +	if (!pdev->driver || !pdev->driver->sriov_get_vf_total_msix) {
> +		device_unlock(dev);
> +		return -EOPNOTSUPP;
> +	}
> +	vf_total_msix = pdev->driver->sriov_get_vf_total_msix(pdev);
> +	device_unlock(dev);
> +
> +	return sysfs_emit(buf, "%u\n", vf_total_msix);
> +}
> +static DEVICE_ATTR_RO(sriov_vf_total_msix);
> +#endif
> +
> +static const struct attribute *sriov_pf_dev_attrs[] = {
> +#ifdef CONFIG_PCI_MSI
> +	&dev_attr_sriov_vf_total_msix.attr,
> +#endif
> +	NULL,
> +};
> +
> +static const struct attribute *sriov_vf_dev_attrs[] = {
> +#ifdef CONFIG_PCI_MSI
> +	&dev_attr_sriov_vf_msix_count.attr,
> +#endif
> +	NULL,
> +};
> +
> +/*
> + * The PF can change the specific properties of associated VFs. Such
> + * functionality is usually known after PF probed and PCI sysfs files
> + * were already created.

s/The PF can/The PF may be able to/

> + * The function below is driven by such PF. It adds sysfs files to already
> + * existing PF/VF sysfs device hierarchies.

  pci_enable_vf_overlay() and pci_disable_vf_overlay() should be
  called by PF drivers that support changing the number of MSI-X
  vectors assigned to their VFs.

> + */
> +int pci_enable_vf_overlay(struct pci_dev *dev)
> +{
> +	struct pci_dev *virtfn;
> +	int id, ret;
> +
> +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> +		return 0;
> +
> +	ret = sysfs_create_files(&dev->dev.kobj, sriov_pf_dev_attrs);

But I still don't like the fact that we're calling
sysfs_create_files() and sysfs_remove_files() directly.  It makes
complication and opportunities for errors.

I don't see the advantage of creating these files only when the PF
driver supports this.  The management tools have to deal with
sriov_vf_total_msix == 0 and sriov_vf_msix_count == 0 anyway.
Having the sysfs files not be present at all might be slightly
prettier to the person running "ls", but I'm not sure the code
complication is worth that.

I see a hint that Alex might have requested this "only visible when PF
driver supports it" functionality, but I don't see that email on
linux-pci, so I missed the background.

It's true that we have a clump of "sriov_*" sysfs files and this makes
the clump a little bigger.  I wish we had put them all inside an "iov"
directory to begin with, but that's water under the bridge.

> +	if (ret)
> +		return ret;
> +
> +	for (id = 0; id < dev->sriov->num_VFs; id++) {
> +		virtfn = pci_get_domain_bus_and_slot(
> +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> +			pci_iov_virtfn_devfn(dev, id));
> +
> +		if (!virtfn)
> +			continue;
> +
> +		ret = sysfs_create_files(&virtfn->dev.kobj,
> +					 sriov_vf_dev_attrs);
> +		if (ret)
> +			goto out;
> +	}
> +	return 0;
> +
> +out:
> +	while (id--) {
> +		virtfn = pci_get_domain_bus_and_slot(
> +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> +			pci_iov_virtfn_devfn(dev, id));
> +
> +		if (!virtfn)
> +			continue;
> +
> +		sysfs_remove_files(&virtfn->dev.kobj, sriov_vf_dev_attrs);
> +	}
> +	sysfs_remove_files(&dev->dev.kobj, sriov_pf_dev_attrs);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pci_enable_vf_overlay);
> +
> +void pci_disable_vf_overlay(struct pci_dev *dev)
> +{
> +	struct pci_dev *virtfn;
> +	int id;
> +
> +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> +		return;
> +
> +	id = dev->sriov->num_VFs;
> +	while (id--) {
> +		virtfn = pci_get_domain_bus_and_slot(
> +			pci_domain_nr(dev->bus), pci_iov_virtfn_bus(dev, id),
> +			pci_iov_virtfn_devfn(dev, id));
> +
> +		if (!virtfn)
> +			continue;
> +
> +		sysfs_remove_files(&virtfn->dev.kobj, sriov_vf_dev_attrs);
> +	}
> +	sysfs_remove_files(&dev->dev.kobj, sriov_pf_dev_attrs);
> +}
> +EXPORT_SYMBOL_GPL(pci_disable_vf_overlay);
> +
>  int pci_iov_add_virtfn(struct pci_dev *dev, int id)
>  {
>  	int i;
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index b32126d26997..732611937574 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -856,6 +856,11 @@ struct module;
>   *		e.g. drivers/net/e100.c.
>   * @sriov_configure: Optional driver callback to allow configuration of
>   *		number of VFs to enable via sysfs "sriov_numvfs" file.
> + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> + *              to configure via sysfs "sriov_vf_msix_count" entry. This will
> + *              change MSI-X Table Size in their Message Control registers.

s/Driver callback/PF driver callback/
s/in their/in VF/

> + * @sriov_get_vf_total_msix: Total number of MSI-X veectors to distribute
> + *              to the VFs

s/Total number/PF driver callback to get the total number/
s/veectors/vectors/
s/to distribute/available for distribution/

>   * @err_handler: See Documentation/PCI/pci-error-recovery.rst
>   * @groups:	Sysfs attribute groups.
>   * @driver:	Driver model structure.
> @@ -871,6 +876,8 @@ struct pci_driver {
>  	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
>  	void (*shutdown)(struct pci_dev *dev);
>  	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
> +	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
> +	u32  (*sriov_get_vf_total_msix)(struct pci_dev *pf);
>  	const struct pci_error_handlers *err_handler;
>  	const struct attribute_group **groups;
>  	struct device_driver	driver;
> @@ -2059,6 +2066,9 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar);
>  int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
>  int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
> 
> +int pci_enable_vf_overlay(struct pci_dev *dev);
> +void pci_disable_vf_overlay(struct pci_dev *dev);
> +
>  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>  void pci_disable_sriov(struct pci_dev *dev);
> 
> @@ -2100,6 +2110,8 @@ static inline int pci_iov_add_virtfn(struct pci_dev *dev, int id)
>  }
>  static inline void pci_iov_remove_virtfn(struct pci_dev *dev,
>  					 int id) { }
> +static inline int pci_enable_vf_overlay(struct pci_dev *dev) { return 0; }
> +static inline void pci_disable_vf_overlay(struct pci_dev *dev) { }
>  static inline void pci_disable_sriov(struct pci_dev *dev) { }
>  static inline int pci_num_vf(struct pci_dev *dev) { return 0; }
>  static inline int pci_vfs_assigned(struct pci_dev *dev)
> --
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-15 21:01   ` Bjorn Helgaas
@ 2021-02-16  7:33     ` Leon Romanovsky
  2021-02-16 16:12       ` Bjorn Helgaas
  0 siblings, 1 reply; 32+ messages in thread
From: Leon Romanovsky @ 2021-02-16  7:33 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

On Mon, Feb 15, 2021 at 03:01:06PM -0600, Bjorn Helgaas wrote:
> On Tue, Feb 09, 2021 at 03:34:42PM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > Extend PCI sysfs interface with a new callback that allows configuration
> > of the number of MSI-X vectors for specific SR-IOV VF. This is needed
> > to optimize the performance of VFs devices by allocating the number of
> > vectors based on the administrator knowledge of the intended use of the VF.
> >
> > This function is applicable for SR-IOV VF because such devices allocate
> > their MSI-X table before they will run on the VMs and HW can't guess the
> > right number of vectors, so some devices allocate them statically and equally.
>
> This commit log should be clear that this functionality is motivated
> by *mlx5* behavior.  The description above makes it sound like this is
> generic PCI spec behavior, and it is not.
>
> It may be a reasonable design that conforms to the spec, and we hope
> the model will be usable by other designs, but it is not required by
> the spec and AFAIK there is nothing in the spec you can point to as
> background for this.
>
> So don't *remove* the text you have above, but please *add* some
> preceding background information about how mlx5 works.
>
> > 1) The newly added /sys/bus/pci/devices/.../sriov_vf_msix_count
> > file will be seen for the VFs and it is writable as long as a driver is not
> > bound to the VF.
>
>   This adds /sys/bus/pci/devices/.../sriov_vf_msix_count for VF
>   devices and is writable ...
>
> > The values accepted are:
> >  * > 0 - this will be number reported by the Table Size in the VF's MSI-X Message
> >          Control register
> >  * < 0 - not valid
> >  * = 0 - will reset to the device default value
>
>   = 0 - will reset to a device-specific default value
>
> > 2) In order to make management easy, provide new read-only sysfs file that
> > returns a total number of possible to configure MSI-X vectors.
>
>   For PF devices, this adds a read-only
>   /sys/bus/pci/devices/.../sriov_vf_total_msix file that contains the
>   total number of MSI-X vectors available for distribution among VFs.
>
> Just as in sysfs-bus-pci, this file should be listed first, because
> you must read it before you can use vf_msix_count.

No problem, I'll change, just remember that we are talking about commit
message because in Documentation file, the order is exactly as you request.

>
> > cat /sys/bus/pci/devices/.../sriov_vf_total_msix
> >   = 0 - feature is not supported
> >   > 0 - total number of MSI-X vectors available for distribution among the VFs
> >
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  Documentation/ABI/testing/sysfs-bus-pci |  28 +++++
> >  drivers/pci/iov.c                       | 153 ++++++++++++++++++++++++
> >  include/linux/pci.h                     |  12 ++
> >  3 files changed, 193 insertions(+)

<...>

> > + */
> > +int pci_enable_vf_overlay(struct pci_dev *dev)
> > +{
> > +	struct pci_dev *virtfn;
> > +	int id, ret;
> > +
> > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > +		return 0;
> > +
> > +	ret = sysfs_create_files(&dev->dev.kobj, sriov_pf_dev_attrs);
>
> But I still don't like the fact that we're calling
> sysfs_create_files() and sysfs_remove_files() directly.  It makes
> complication and opportunities for errors.

It is not different from any other code that we have in the kernel.
Let's be concrete, can you point to the errors in this code that I
should fix?

>
> I don't see the advantage of creating these files only when the PF
> driver supports this.  The management tools have to deal with
> sriov_vf_total_msix == 0 and sriov_vf_msix_count == 0 anyway.
> Having the sysfs files not be present at all might be slightly
> prettier to the person running "ls", but I'm not sure the code
> complication is worth that.

It is more than "ls", right now sriov_numvfs is visible without relation
to the driver, even if driver doesn't implement ".sriov_configure", which
IMHO bad. We didn't want to repeat.

Right now, we have many devices that supports SR-IOV, but small amount
of them are capable to rewrite their VF MSI-X table siz. We don't want
"to punish" and clatter their sysfs.

>
> I see a hint that Alex might have requested this "only visible when PF
> driver supports it" functionality, but I don't see that email on
> linux-pci, so I missed the background.

First version of this patch had static files solution.
https://lore.kernel.org/linux-pci/20210103082440.34994-2-leon@kernel.org/#Z30drivers:pci:iov.c

Thanks

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-16  7:33     ` Leon Romanovsky
@ 2021-02-16 16:12       ` Bjorn Helgaas
  2021-02-16 19:58         ` Leon Romanovsky
  2021-02-16 20:37         ` Jason Gunthorpe
  0 siblings, 2 replies; 32+ messages in thread
From: Bjorn Helgaas @ 2021-02-16 16:12 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

Proposed subject:

  PCI/IOV: Add dynamic MSI-X vector assignment sysfs interface

On Tue, Feb 16, 2021 at 09:33:44AM +0200, Leon Romanovsky wrote:
> On Mon, Feb 15, 2021 at 03:01:06PM -0600, Bjorn Helgaas wrote:
> > On Tue, Feb 09, 2021 at 03:34:42PM +0200, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>

Here's a draft of the sort of thing I'm looking for here:

  A typical cloud provider SR-IOV use case is to create many VFs for
  use by guest VMs.  The VFs may not be assigned to a VM until a
  customer requests a VM of a certain size, e.g., number of CPUs.  A
  VF may need MSI-X vectors proportional to the number of CPUs in the
  VM, but there is no standard way to change the number of MSI-X
  vectors supported by a VF.

  Some Mellanox ConnectX devices support dynamic assignment of MSI-X
  vectors to SR-IOV VFs.  This can be done by the PF driver after VFs
  are enabled, and it can be done without affecting VFs that are
  already in use.  The hardware supports a limited pool of MSI-X
  vectors that can be assigned to the PF or to individual VFs.  This
  is device-specific behavior that requires support in the PF driver.

  Add a read-only "sriov_vf_total_msix" sysfs file for the PF and a
  writable "sriov_vf_msix_count" file for each VF.  Management
  software may use these to learn how many MSI-X vectors are available
  and to dynamically assign them to VFs before the VFs are passed
  through to a VM.

  If the PF driver implements the ->sriov_get_vf_total_msix()
  callback, "sriov_vf_total_msix" contains the total number of MSI-X
  vectors available for distribution among VFs.

  If no driver is bound to the VF, writing "N" to
  "sriov_vf_msix_count" uses the PF driver ->sriov_set_msix_vec_count()
  callback to assign "N" MSI-X vectors to the VF.  When a VF driver
  subsequently reads the MSI-X Message Control register, it will see
  the new Table Size "N".

> > > Extend PCI sysfs interface with a new callback that allows configuration
> > > of the number of MSI-X vectors for specific SR-IOV VF. This is needed
> > > to optimize the performance of VFs devices by allocating the number of
> > > vectors based on the administrator knowledge of the intended use of the VF.
> > >
> > > This function is applicable for SR-IOV VF because such devices allocate
> > > their MSI-X table before they will run on the VMs and HW can't guess the
> > > right number of vectors, so some devices allocate them statically and equally.
> >
> > This commit log should be clear that this functionality is motivated
> > by *mlx5* behavior.  The description above makes it sound like this is
> > generic PCI spec behavior, and it is not.
> >
> > It may be a reasonable design that conforms to the spec, and we hope
> > the model will be usable by other designs, but it is not required by
> > the spec and AFAIK there is nothing in the spec you can point to as
> > background for this.
> >
> > So don't *remove* the text you have above, but please *add* some
> > preceding background information about how mlx5 works.
> >
> > > 1) The newly added /sys/bus/pci/devices/.../sriov_vf_msix_count
> > > file will be seen for the VFs and it is writable as long as a driver is not
> > > bound to the VF.
> >
> >   This adds /sys/bus/pci/devices/.../sriov_vf_msix_count for VF
> >   devices and is writable ...
> >
> > > The values accepted are:
> > >  * > 0 - this will be number reported by the Table Size in the VF's MSI-X Message
> > >          Control register
> > >  * < 0 - not valid
> > >  * = 0 - will reset to the device default value
> >
> >   = 0 - will reset to a device-specific default value
> >
> > > 2) In order to make management easy, provide new read-only sysfs file that
> > > returns a total number of possible to configure MSI-X vectors.
> >
> >   For PF devices, this adds a read-only
> >   /sys/bus/pci/devices/.../sriov_vf_total_msix file that contains the
> >   total number of MSI-X vectors available for distribution among VFs.
> >
> > Just as in sysfs-bus-pci, this file should be listed first, because
> > you must read it before you can use vf_msix_count.
> 
> No problem, I'll change, just remember that we are talking about commit
> message because in Documentation file, the order is exactly as you request.

Yes, I noticed that, thank you!  It will be good to have them in the
same order in both the commit log and the Documentation file.  I think
it will make more sense to readers.

> > > cat /sys/bus/pci/devices/.../sriov_vf_total_msix
> > >   = 0 - feature is not supported
> > >   > 0 - total number of MSI-X vectors available for distribution among the VFs
> > >
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > ---
> > >  Documentation/ABI/testing/sysfs-bus-pci |  28 +++++
> > >  drivers/pci/iov.c                       | 153 ++++++++++++++++++++++++
> > >  include/linux/pci.h                     |  12 ++
> > >  3 files changed, 193 insertions(+)
> 
> <...>
> 
> > > + */
> > > +int pci_enable_vf_overlay(struct pci_dev *dev)
> > > +{
> > > +	struct pci_dev *virtfn;
> > > +	int id, ret;
> > > +
> > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > +		return 0;
> > > +
> > > +	ret = sysfs_create_files(&dev->dev.kobj, sriov_pf_dev_attrs);
> >
> > But I still don't like the fact that we're calling
> > sysfs_create_files() and sysfs_remove_files() directly.  It makes
> > complication and opportunities for errors.
> 
> It is not different from any other code that we have in the kernel.

It *is* different.  There is a general rule that drivers should not
call sysfs_* [1].  The PCI core is arguably not a "driver," but it is
still true that callers of sysfs_create_files() are very special, and
I'd prefer not to add another one.

> Let's be concrete, can you point to the errors in this code that I
> should fix?

I'm not saying there are current errors; I'm saying the additional
code makes errors possible in future code.  For example, we hope that
other drivers can use these sysfs interfaces, and it's possible they
may not call pci_enable_vf_overlay() or pci_disable_vfs_overlay()
correctly.

Or there may be races in device addition/removal.  We have current
issues in this area, e.g., [2], and they're fairly subtle.  I'm not
saying your patches have these issues; only that extra code makes more
chances for mistakes and it's more work to validate it.

> > I don't see the advantage of creating these files only when the PF
> > driver supports this.  The management tools have to deal with
> > sriov_vf_total_msix == 0 and sriov_vf_msix_count == 0 anyway.
> > Having the sysfs files not be present at all might be slightly
> > prettier to the person running "ls", but I'm not sure the code
> > complication is worth that.
> 
> It is more than "ls", right now sriov_numvfs is visible without relation
> to the driver, even if driver doesn't implement ".sriov_configure", which
> IMHO bad. We didn't want to repeat.
> 
> Right now, we have many devices that supports SR-IOV, but small amount
> of them are capable to rewrite their VF MSI-X table siz. We don't want
> "to punish" and clatter their sysfs.

I agree, it's clutter, but at least it's just cosmetic clutter (but
I'm willing to hear discussion about why it's more than cosmetic; see
below).

From the management software point of view, I don't think it matters.
That software already needs to deal with files that don't exist (on
old kernels) and files that contain zero (feature not supported or no
vectors are available).

From my point of view, pci_enable_vf_overlay() or
pci_disable_vfs_overlay() are also clutter, at least compared to
static sysfs attributes.

> > I see a hint that Alex might have requested this "only visible when PF
> > driver supports it" functionality, but I don't see that email on
> > linux-pci, so I missed the background.
> 
> First version of this patch had static files solution.
> https://lore.kernel.org/linux-pci/20210103082440.34994-2-leon@kernel.org/#Z30drivers:pci:iov.c

Thanks for the pointer to the patch.  Can you point me to the
discussion about why we should use the "only visible when PF driver
supports it" model?

Bjorn

[1] https://lore.kernel.org/linux-pci/YBmG7qgIDYIveDfX@kroah.com/
[2] https://lore.kernel.org/linux-pci/20200716110423.xtfyb3n6tn5ixedh@pali/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-16 16:12       ` Bjorn Helgaas
@ 2021-02-16 19:58         ` Leon Romanovsky
  2021-02-17 18:02           ` Bjorn Helgaas
  2021-02-16 20:37         ` Jason Gunthorpe
  1 sibling, 1 reply; 32+ messages in thread
From: Leon Romanovsky @ 2021-02-16 19:58 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

On Tue, Feb 16, 2021 at 10:12:12AM -0600, Bjorn Helgaas wrote:
> Proposed subject:
>
>   PCI/IOV: Add dynamic MSI-X vector assignment sysfs interface
>
> On Tue, Feb 16, 2021 at 09:33:44AM +0200, Leon Romanovsky wrote:
> > On Mon, Feb 15, 2021 at 03:01:06PM -0600, Bjorn Helgaas wrote:
> > > On Tue, Feb 09, 2021 at 03:34:42PM +0200, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
>
> Here's a draft of the sort of thing I'm looking for here:
>
>   A typical cloud provider SR-IOV use case is to create many VFs for
>   use by guest VMs.  The VFs may not be assigned to a VM until a
>   customer requests a VM of a certain size, e.g., number of CPUs.  A
>   VF may need MSI-X vectors proportional to the number of CPUs in the
>   VM, but there is no standard way to change the number of MSI-X
>   vectors supported by a VF.
>
>   Some Mellanox ConnectX devices support dynamic assignment of MSI-X
>   vectors to SR-IOV VFs.  This can be done by the PF driver after VFs
>   are enabled, and it can be done without affecting VFs that are
>   already in use.  The hardware supports a limited pool of MSI-X
>   vectors that can be assigned to the PF or to individual VFs.  This
>   is device-specific behavior that requires support in the PF driver.
>
>   Add a read-only "sriov_vf_total_msix" sysfs file for the PF and a
>   writable "sriov_vf_msix_count" file for each VF.  Management
>   software may use these to learn how many MSI-X vectors are available
>   and to dynamically assign them to VFs before the VFs are passed
>   through to a VM.
>
>   If the PF driver implements the ->sriov_get_vf_total_msix()
>   callback, "sriov_vf_total_msix" contains the total number of MSI-X
>   vectors available for distribution among VFs.
>
>   If no driver is bound to the VF, writing "N" to
>   "sriov_vf_msix_count" uses the PF driver ->sriov_set_msix_vec_count()
>   callback to assign "N" MSI-X vectors to the VF.  When a VF driver
>   subsequently reads the MSI-X Message Control register, it will see
>   the new Table Size "N".

It is extremely helpful, I will copy/paste it to the commit message. Thanks.

>

<...>

> > > > + */
> > > > +int pci_enable_vf_overlay(struct pci_dev *dev)
> > > > +{
> > > > +	struct pci_dev *virtfn;
> > > > +	int id, ret;
> > > > +
> > > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > > +		return 0;
> > > > +
> > > > +	ret = sysfs_create_files(&dev->dev.kobj, sriov_pf_dev_attrs);
> > >
> > > But I still don't like the fact that we're calling
> > > sysfs_create_files() and sysfs_remove_files() directly.  It makes
> > > complication and opportunities for errors.
> >
> > It is not different from any other code that we have in the kernel.
>
> It *is* different.  There is a general rule that drivers should not
> call sysfs_* [1].  The PCI core is arguably not a "driver," but it is
> still true that callers of sysfs_create_files() are very special, and
> I'd prefer not to add another one.

PCI for me is a bus, and bus is the right place to manage sysfs.
But it doesn't matter, we understand each other positions.

>
> > Let's be concrete, can you point to the errors in this code that I
> > should fix?
>
> I'm not saying there are current errors; I'm saying the additional
> code makes errors possible in future code.  For example, we hope that
> other drivers can use these sysfs interfaces, and it's possible they
> may not call pci_enable_vf_overlay() or pci_disable_vfs_overlay()
> correctly.

If not, we will fix, we just need is to ensure that sysfs name won't
change, everything else is easy to change.

>
> Or there may be races in device addition/removal.  We have current
> issues in this area, e.g., [2], and they're fairly subtle.  I'm not
> saying your patches have these issues; only that extra code makes more
> chances for mistakes and it's more work to validate it.
>
> > > I don't see the advantage of creating these files only when the PF
> > > driver supports this.  The management tools have to deal with
> > > sriov_vf_total_msix == 0 and sriov_vf_msix_count == 0 anyway.
> > > Having the sysfs files not be present at all might be slightly
> > > prettier to the person running "ls", but I'm not sure the code
> > > complication is worth that.
> >
> > It is more than "ls", right now sriov_numvfs is visible without relation
> > to the driver, even if driver doesn't implement ".sriov_configure", which
> > IMHO bad. We didn't want to repeat.
> >
> > Right now, we have many devices that supports SR-IOV, but small amount
> > of them are capable to rewrite their VF MSI-X table siz. We don't want
> > "to punish" and clatter their sysfs.
>
> I agree, it's clutter, but at least it's just cosmetic clutter (but
> I'm willing to hear discussion about why it's more than cosmetic; see
> below).

It is more than cosmetic and IMHO it is related to the driver role.
This feature is advertised, managed and configured by PF. It is very
natural request that the PF will view/hide those sysfs files.

>
> From the management software point of view, I don't think it matters.
> That software already needs to deal with files that don't exist (on
> old kernels) and files that contain zero (feature not supported or no
> vectors are available).

Right, in v0, I used static approach.

>
> From my point of view, pci_enable_vf_overlay() or
> pci_disable_vfs_overlay() are also clutter, at least compared to
> static sysfs attributes.
>
> > > I see a hint that Alex might have requested this "only visible when PF
> > > driver supports it" functionality, but I don't see that email on
> > > linux-pci, so I missed the background.
> >
> > First version of this patch had static files solution.
> > https://lore.kernel.org/linux-pci/20210103082440.34994-2-leon@kernel.org/#Z30drivers:pci:iov.c
>
> Thanks for the pointer to the patch.  Can you point me to the
> discussion about why we should use the "only visible when PF driver
> supports it" model?

It is hard to pinpoint specific sentence, this discussion is spread
across many emails and I implemented it in v4.

See this request from Alex:
https://lore.kernel.org/linux-pci/20210114170543.143cce49@omen.home.shazbot.org/
and this is my acknowledge:
https://lore.kernel.org/linux-pci/20210116082331.GL944463@unreal/

BTW, I asked more than once how these sysfs knobs should be handled in the PCI/core.

Thanks

>
> Bjorn
>
> [1] https://lore.kernel.org/linux-pci/YBmG7qgIDYIveDfX@kroah.com/
> [2] https://lore.kernel.org/linux-pci/20200716110423.xtfyb3n6tn5ixedh@pali/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-16 16:12       ` Bjorn Helgaas
  2021-02-16 19:58         ` Leon Romanovsky
@ 2021-02-16 20:37         ` Jason Gunthorpe
  1 sibling, 0 replies; 32+ messages in thread
From: Jason Gunthorpe @ 2021-02-16 20:37 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

On Tue, Feb 16, 2021 at 10:12:12AM -0600, Bjorn Helgaas wrote:
> > >
> > > But I still don't like the fact that we're calling
> > > sysfs_create_files() and sysfs_remove_files() directly.  It makes
> > > complication and opportunities for errors.
> > 
> > It is not different from any other code that we have in the kernel.
> 
> It *is* different.  There is a general rule that drivers should not
> call sysfs_* [1].  The PCI core is arguably not a "driver," but it is
> still true that callers of sysfs_create_files() are very special, and
> I'd prefer not to add another one.

I think the point of [1] is people should be setting up their sysfs in
the struct device attribute groups/etc before doing device_add() and
allowing the driver core to handle everything. This can be done in
a lot of cases, eg we have examples of building a dynamic list of
attributes

In other cases, calling wrappers like device_create_file() introduces
a bit more type safety, so adding a device_create_files() would be
trivial enough

Other places in PCI are using syfs_create_group() (and there are over
400 calls to this function in all sorts of device drivers):

drivers/pci/msi.c:      ret = sysfs_create_groups(&pdev->dev.kobj, msi_irq_groups);
drivers/pci/p2pdma.c:   error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
drivers/pci/pci-label.c:        return sysfs_create_group(&pdev->dev.kobj, &smbios_attr_group);
drivers/pci/pci-label.c:        return sysfs_create_group(&pdev->dev.kobj, &acpi_attr_group);

For post-driver_add() stuff, maybe this should do the same, a
"srio_vf/" group?

And a minor cleanup would change these to use device_create_bin_file():

drivers/pci/pci-sysfs.c:        retval = sysfs_create_bin_file(&pdev->dev.kobj, res_attr);
drivers/pci/pci-sysfs.c:                retval = sysfs_create_bin_file(&pdev->dev.kobj, &pcie_config_attr);
drivers/pci/pci-sysfs.c:                retval = sysfs_create_bin_file(&pdev->dev.kobj, &pci_config_attr);
drivers/pci/pci-sysfs.c:                retval = sysfs_create_bin_file(&pdev->dev.kobj, attr);
drivers/pci/vpd.c:      retval = sysfs_create_bin_file(&dev->dev.kobj, attr);

I haven't worked out why pci_create_firmware_label_files() and all of
this needs to be after device_add() though.. Would be slick to put
that in the normal attribute list - we've got some examples of dynamic
pre-device_add() attribute lists in the tree (eg tpm, rdma) that work
nicely.

Jason

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-16 19:58         ` Leon Romanovsky
@ 2021-02-17 18:02           ` Bjorn Helgaas
  2021-02-17 19:25             ` Jason Gunthorpe
  2021-02-18 10:15             ` Leon Romanovsky
  0 siblings, 2 replies; 32+ messages in thread
From: Bjorn Helgaas @ 2021-02-17 18:02 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman,
	Krzysztof Wilczyński

[+cc Greg in case he wants to chime in on the sysfs discussion.
TL;DR: we're trying to add/remove sysfs files when a PCI driver that
supports certain callbacks binds or unbinds; series at
https://lore.kernel.org/r/20210209133445.700225-1-leon@kernel.org]

On Tue, Feb 16, 2021 at 09:58:25PM +0200, Leon Romanovsky wrote:
> On Tue, Feb 16, 2021 at 10:12:12AM -0600, Bjorn Helgaas wrote:
> > On Tue, Feb 16, 2021 at 09:33:44AM +0200, Leon Romanovsky wrote:
> > > On Mon, Feb 15, 2021 at 03:01:06PM -0600, Bjorn Helgaas wrote:
> > > > On Tue, Feb 09, 2021 at 03:34:42PM +0200, Leon Romanovsky wrote:
> > > > > From: Leon Romanovsky <leonro@nvidia.com>

> > > > > +int pci_enable_vf_overlay(struct pci_dev *dev)
> > > > > +{
> > > > > +	struct pci_dev *virtfn;
> > > > > +	int id, ret;
> > > > > +
> > > > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > > > +		return 0;
> > > > > +
> > > > > +	ret = sysfs_create_files(&dev->dev.kobj, sriov_pf_dev_attrs);
> > > >
> > > > But I still don't like the fact that we're calling
> > > > sysfs_create_files() and sysfs_remove_files() directly.  It makes
> > > > complication and opportunities for errors.
> > >
> > > It is not different from any other code that we have in the kernel.
> >
> > It *is* different.  There is a general rule that drivers should not
> > call sysfs_* [1].  The PCI core is arguably not a "driver," but it is
> > still true that callers of sysfs_create_files() are very special, and
> > I'd prefer not to add another one.
> 
> PCI for me is a bus, and bus is the right place to manage sysfs.
> But it doesn't matter, we understand each other positions.
> 
> > > Let's be concrete, can you point to the errors in this code that I
> > > should fix?
> >
> > I'm not saying there are current errors; I'm saying the additional
> > code makes errors possible in future code.  For example, we hope that
> > other drivers can use these sysfs interfaces, and it's possible they
> > may not call pci_enable_vf_overlay() or pci_disable_vfs_overlay()
> > correctly.
> 
> If not, we will fix, we just need is to ensure that sysfs name won't
> change, everything else is easy to change.
> 
> > Or there may be races in device addition/removal.  We have current
> > issues in this area, e.g., [2], and they're fairly subtle.  I'm not
> > saying your patches have these issues; only that extra code makes more
> > chances for mistakes and it's more work to validate it.
> >
> > > > I don't see the advantage of creating these files only when
> > > > the PF driver supports this.  The management tools have to
> > > > deal with sriov_vf_total_msix == 0 and sriov_vf_msix_count ==
> > > > 0 anyway.  Having the sysfs files not be present at all might
> > > > be slightly prettier to the person running "ls", but I'm not
> > > > sure the code complication is worth that.
> > >
> > > It is more than "ls", right now sriov_numvfs is visible without
> > > relation to the driver, even if driver doesn't implement
> > > ".sriov_configure", which IMHO bad. We didn't want to repeat.
> > >
> > > Right now, we have many devices that supports SR-IOV, but small
> > > amount of them are capable to rewrite their VF MSI-X table siz.
> > > We don't want "to punish" and clatter their sysfs.
> >
> > I agree, it's clutter, but at least it's just cosmetic clutter
> > (but I'm willing to hear discussion about why it's more than
> > cosmetic; see below).
> 
> It is more than cosmetic and IMHO it is related to the driver role.
> This feature is advertised, managed and configured by PF. It is very
> natural request that the PF will view/hide those sysfs files.

Agreed, it's natural if the PF driver adds/removes those files.  But I
don't think it's *essential*, and they *could* be static because of
this:

> > From the management software point of view, I don't think it matters.
> > That software already needs to deal with files that don't exist (on
> > old kernels) and files that contain zero (feature not supported or no
> > vectors are available).

I wonder if sysfs_update_group() would let us have our cake and eat
it, too?  Maybe we could define these files as static attributes and
call sysfs_update_group() when the PF driver binds or unbinds?

Makes me wonder if the device core could call sysfs_update_group()
when binding/unbinding drivers.  But there are only a few existing
callers, and it looks like none of them are for the bind/unbind
situation, so maybe that would be pointless.

> > From my point of view, pci_enable_vf_overlay() or
> > pci_disable_vfs_overlay() are also clutter, at least compared to
> > static sysfs attributes.
> >
> > > > I see a hint that Alex might have requested this "only visible when PF
> > > > driver supports it" functionality, but I don't see that email on
> > > > linux-pci, so I missed the background.
> > >
> > > First version of this patch had static files solution.
> > > https://lore.kernel.org/linux-pci/20210103082440.34994-2-leon@kernel.org/#Z30drivers:pci:iov.c
> >
> > Thanks for the pointer to the patch.  Can you point me to the
> > discussion about why we should use the "only visible when PF driver
> > supports it" model?
> 
> It is hard to pinpoint specific sentence, this discussion is spread
> across many emails and I implemented it in v4.
> 
> See this request from Alex:
> https://lore.kernel.org/linux-pci/20210114170543.143cce49@omen.home.shazbot.org/
> and this is my acknowledge:
> https://lore.kernel.org/linux-pci/20210116082331.GL944463@unreal/
> 
> BTW, I asked more than once how these sysfs knobs should be handled
> in the PCI/core.

Thanks for the pointers.  This is the first instance I can think of
where we want to create PCI core sysfs files based on a driver
binding, so there really isn't a precedent.

> > [1] https://lore.kernel.org/linux-pci/YBmG7qgIDYIveDfX@kroah.com/
> > [2] https://lore.kernel.org/linux-pci/20200716110423.xtfyb3n6tn5ixedh@pali/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-17 18:02           ` Bjorn Helgaas
@ 2021-02-17 19:25             ` Jason Gunthorpe
  2021-02-17 20:28               ` Bjorn Helgaas
  2021-02-18 10:15             ` Leon Romanovsky
  1 sibling, 1 reply; 32+ messages in thread
From: Jason Gunthorpe @ 2021-02-17 19:25 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman,
	Krzysztof Wilczyński

On Wed, Feb 17, 2021 at 12:02:39PM -0600, Bjorn Helgaas wrote:

> > BTW, I asked more than once how these sysfs knobs should be handled
> > in the PCI/core.
> 
> Thanks for the pointers.  This is the first instance I can think of
> where we want to create PCI core sysfs files based on a driver
> binding, so there really isn't a precedent.

The MSI stuff does it today, doesn't it? eg:

virtblk_probe (this is a driver bind)
  init_vq
   virtio_find_vqs
    vp_modern_find_vqs
     vp_find_vqs
      vp_find_vqs_msix
       vp_request_msix_vectors
        pci_alloc_irq_vectors_affinity
         __pci_enable_msi_range
          msi_capability_init
	   populate_msi_sysfs
	    	ret = sysfs_create_groups(&pdev->dev.kobj, msi_irq_groups);

And the sysfs is removed during pci_disable_msi(), also called by the
driver

Jason

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-17 19:25             ` Jason Gunthorpe
@ 2021-02-17 20:28               ` Bjorn Helgaas
  2021-02-17 23:52                 ` Jason Gunthorpe
  0 siblings, 1 reply; 32+ messages in thread
From: Bjorn Helgaas @ 2021-02-17 20:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman,
	Krzysztof Wilczyński

On Wed, Feb 17, 2021 at 03:25:22PM -0400, Jason Gunthorpe wrote:
> On Wed, Feb 17, 2021 at 12:02:39PM -0600, Bjorn Helgaas wrote:
> 
> > > BTW, I asked more than once how these sysfs knobs should be handled
> > > in the PCI/core.
> > 
> > Thanks for the pointers.  This is the first instance I can think of
> > where we want to create PCI core sysfs files based on a driver
> > binding, so there really isn't a precedent.
> 
> The MSI stuff does it today, doesn't it? eg:
> 
> virtblk_probe (this is a driver bind)
>   init_vq
>    virtio_find_vqs
>     vp_modern_find_vqs
>      vp_find_vqs
>       vp_find_vqs_msix
>        vp_request_msix_vectors
>         pci_alloc_irq_vectors_affinity
>          __pci_enable_msi_range
>           msi_capability_init
> 	   populate_msi_sysfs
> 	    	ret = sysfs_create_groups(&pdev->dev.kobj, msi_irq_groups);
> 
> And the sysfs is removed during pci_disable_msi(), also called by the
> driver

Yes, you're right, I didn't notice that one.

I'm not quite convinced that we clean up correctly in all cases --
pci_disable_msix(), pci_disable_msi(), pci_free_irq_vectors(),
pcim_release(), etc are called by several drivers, but in my quick
look I didn't see a guaranteed-to-be-called path to the cleanup during
driver unbind.  I probably just missed it.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-17 20:28               ` Bjorn Helgaas
@ 2021-02-17 23:52                 ` Jason Gunthorpe
  0 siblings, 0 replies; 32+ messages in thread
From: Jason Gunthorpe @ 2021-02-17 23:52 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman,
	Krzysztof Wilczyński

On Wed, Feb 17, 2021 at 02:28:35PM -0600, Bjorn Helgaas wrote:
> On Wed, Feb 17, 2021 at 03:25:22PM -0400, Jason Gunthorpe wrote:
> > On Wed, Feb 17, 2021 at 12:02:39PM -0600, Bjorn Helgaas wrote:
> > 
> > > > BTW, I asked more than once how these sysfs knobs should be handled
> > > > in the PCI/core.
> > > 
> > > Thanks for the pointers.  This is the first instance I can think of
> > > where we want to create PCI core sysfs files based on a driver
> > > binding, so there really isn't a precedent.
> > 
> > The MSI stuff does it today, doesn't it? eg:
> > 
> > virtblk_probe (this is a driver bind)
> >   init_vq
> >    virtio_find_vqs
> >     vp_modern_find_vqs
> >      vp_find_vqs
> >       vp_find_vqs_msix
> >        vp_request_msix_vectors
> >         pci_alloc_irq_vectors_affinity
> >          __pci_enable_msi_range
> >           msi_capability_init
> > 	   populate_msi_sysfs
> > 	    	ret = sysfs_create_groups(&pdev->dev.kobj, msi_irq_groups);
> > 
> > And the sysfs is removed during pci_disable_msi(), also called by the
> > driver
> 
> Yes, you're right, I didn't notice that one.
> 
> I'm not quite convinced that we clean up correctly in all cases --
> pci_disable_msix(), pci_disable_msi(), pci_free_irq_vectors(),
> pcim_release(), etc are called by several drivers, but in my quick
> look I didn't see a guaranteed-to-be-called path to the cleanup during
> driver unbind.  I probably just missed it.
 
I think the contract is the driver has to pair the msi enable with the
msi disable on its own? It is very similar to what is happening here.

Probably there are bugs in drivers on error paths, but there are
always bugs in drivers on error paths..

Jason

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-17 18:02           ` Bjorn Helgaas
  2021-02-17 19:25             ` Jason Gunthorpe
@ 2021-02-18 10:15             ` Leon Romanovsky
  2021-02-18 22:39               ` Bjorn Helgaas
  1 sibling, 1 reply; 32+ messages in thread
From: Leon Romanovsky @ 2021-02-18 10:15 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman,
	Krzysztof Wilczyński

On Wed, Feb 17, 2021 at 12:02:39PM -0600, Bjorn Helgaas wrote:
> [+cc Greg in case he wants to chime in on the sysfs discussion.
> TL;DR: we're trying to add/remove sysfs files when a PCI driver that
> supports certain callbacks binds or unbinds; series at
> https://lore.kernel.org/r/20210209133445.700225-1-leon@kernel.org]
>
> On Tue, Feb 16, 2021 at 09:58:25PM +0200, Leon Romanovsky wrote:
> > On Tue, Feb 16, 2021 at 10:12:12AM -0600, Bjorn Helgaas wrote:
> > > On Tue, Feb 16, 2021 at 09:33:44AM +0200, Leon Romanovsky wrote:
> > > > On Mon, Feb 15, 2021 at 03:01:06PM -0600, Bjorn Helgaas wrote:
> > > > > On Tue, Feb 09, 2021 at 03:34:42PM +0200, Leon Romanovsky wrote:
> > > > > > From: Leon Romanovsky <leonro@nvidia.com>
>
> > > > > > +int pci_enable_vf_overlay(struct pci_dev *dev)
> > > > > > +{
> > > > > > +	struct pci_dev *virtfn;
> > > > > > +	int id, ret;
> > > > > > +
> > > > > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	ret = sysfs_create_files(&dev->dev.kobj, sriov_pf_dev_attrs);
> > > > >
> > > > > But I still don't like the fact that we're calling
> > > > > sysfs_create_files() and sysfs_remove_files() directly.  It makes
> > > > > complication and opportunities for errors.
> > > >
> > > > It is not different from any other code that we have in the kernel.
> > >
> > > It *is* different.  There is a general rule that drivers should not
> > > call sysfs_* [1].  The PCI core is arguably not a "driver," but it is
> > > still true that callers of sysfs_create_files() are very special, and
> > > I'd prefer not to add another one.
> >
> > PCI for me is a bus, and bus is the right place to manage sysfs.
> > But it doesn't matter, we understand each other positions.
> >
> > > > Let's be concrete, can you point to the errors in this code that I
> > > > should fix?
> > >
> > > I'm not saying there are current errors; I'm saying the additional
> > > code makes errors possible in future code.  For example, we hope that
> > > other drivers can use these sysfs interfaces, and it's possible they
> > > may not call pci_enable_vf_overlay() or pci_disable_vfs_overlay()
> > > correctly.
> >
> > If not, we will fix, we just need is to ensure that sysfs name won't
> > change, everything else is easy to change.
> >
> > > Or there may be races in device addition/removal.  We have current
> > > issues in this area, e.g., [2], and they're fairly subtle.  I'm not
> > > saying your patches have these issues; only that extra code makes more
> > > chances for mistakes and it's more work to validate it.
> > >
> > > > > I don't see the advantage of creating these files only when
> > > > > the PF driver supports this.  The management tools have to
> > > > > deal with sriov_vf_total_msix == 0 and sriov_vf_msix_count ==
> > > > > 0 anyway.  Having the sysfs files not be present at all might
> > > > > be slightly prettier to the person running "ls", but I'm not
> > > > > sure the code complication is worth that.
> > > >
> > > > It is more than "ls", right now sriov_numvfs is visible without
> > > > relation to the driver, even if driver doesn't implement
> > > > ".sriov_configure", which IMHO bad. We didn't want to repeat.
> > > >
> > > > Right now, we have many devices that supports SR-IOV, but small
> > > > amount of them are capable to rewrite their VF MSI-X table siz.
> > > > We don't want "to punish" and clatter their sysfs.
> > >
> > > I agree, it's clutter, but at least it's just cosmetic clutter
> > > (but I'm willing to hear discussion about why it's more than
> > > cosmetic; see below).
> >
> > It is more than cosmetic and IMHO it is related to the driver role.
> > This feature is advertised, managed and configured by PF. It is very
> > natural request that the PF will view/hide those sysfs files.
>
> Agreed, it's natural if the PF driver adds/removes those files.  But I
> don't think it's *essential*, and they *could* be static because of
> this:
>
> > > From the management software point of view, I don't think it matters.
> > > That software already needs to deal with files that don't exist (on
> > > old kernels) and files that contain zero (feature not supported or no
> > > vectors are available).
>
> I wonder if sysfs_update_group() would let us have our cake and eat
> it, too?  Maybe we could define these files as static attributes and
> call sysfs_update_group() when the PF driver binds or unbinds?
>
> Makes me wonder if the device core could call sysfs_update_group()
> when binding/unbinding drivers.  But there are only a few existing
> callers, and it looks like none of them are for the bind/unbind
> situation, so maybe that would be pointless.

Also it will be not an easy task to do it in driver/core. Our attributes need to be
visible if driver is bound -> we will call to sysfs_update_group() after
->bind() callback. It means that in uwind, we will call to sysfs_update_group() before
->unbind() and the driver will be still bound. So the check is is_supported() for driver
exists/or not won't be possible.

So I tried something similar in bus/pci code and it was super hacky -
the sriov code in general pci path.

BTW, I found sentence which sent me to do directory layout.
https://lore.kernel.org/linux-pci/20210110150727.1965295-1-leon@kernel.org/T/#u
-------------------------------------------------------------------------------
> In addition you could probably even create a directory on the PF with
> the new control you had added for getting the master count as well as
> look at adding symlinks to the VF files so that you could manage all
> of the resources in one spot. That would result in the controls being
> nicely organized and easy to use.

Thanks, for you inputs.

I'll try offline different variants and will post v4 soon.
---------------------------------------------------------------------------------
>
> > > From my point of view, pci_enable_vf_overlay() or
> > > pci_disable_vfs_overlay() are also clutter, at least compared to
> > > static sysfs attributes.
> > >
> > > > > I see a hint that Alex might have requested this "only visible when PF
> > > > > driver supports it" functionality, but I don't see that email on
> > > > > linux-pci, so I missed the background.
> > > >
> > > > First version of this patch had static files solution.
> > > > https://lore.kernel.org/linux-pci/20210103082440.34994-2-leon@kernel.org/#Z30drivers:pci:iov.c
> > >
> > > Thanks for the pointer to the patch.  Can you point me to the
> > > discussion about why we should use the "only visible when PF driver
> > > supports it" model?
> >
> > It is hard to pinpoint specific sentence, this discussion is spread
> > across many emails and I implemented it in v4.
> >
> > See this request from Alex:
> > https://lore.kernel.org/linux-pci/20210114170543.143cce49@omen.home.shazbot.org/
> > and this is my acknowledge:
> > https://lore.kernel.org/linux-pci/20210116082331.GL944463@unreal/
> >
> > BTW, I asked more than once how these sysfs knobs should be handled
> > in the PCI/core.
>
> Thanks for the pointers.  This is the first instance I can think of
> where we want to create PCI core sysfs files based on a driver
> binding, so there really isn't a precedent.

It is always nice to be first :).

>
> > > [1] https://lore.kernel.org/linux-pci/YBmG7qgIDYIveDfX@kroah.com/
> > > [2] https://lore.kernel.org/linux-pci/20200716110423.xtfyb3n6tn5ixedh@pali/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-18 10:15             ` Leon Romanovsky
@ 2021-02-18 22:39               ` Bjorn Helgaas
  2021-02-19  7:52                 ` Leon Romanovsky
  0 siblings, 1 reply; 32+ messages in thread
From: Bjorn Helgaas @ 2021-02-18 22:39 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman,
	Krzysztof Wilczyński

On Thu, Feb 18, 2021 at 12:15:51PM +0200, Leon Romanovsky wrote:
> On Wed, Feb 17, 2021 at 12:02:39PM -0600, Bjorn Helgaas wrote:
> > [+cc Greg in case he wants to chime in on the sysfs discussion.
> > TL;DR: we're trying to add/remove sysfs files when a PCI driver that
> > supports certain callbacks binds or unbinds; series at
> > https://lore.kernel.org/r/20210209133445.700225-1-leon@kernel.org]
> >
> > On Tue, Feb 16, 2021 at 09:58:25PM +0200, Leon Romanovsky wrote:
> > > On Tue, Feb 16, 2021 at 10:12:12AM -0600, Bjorn Helgaas wrote:
> > > > On Tue, Feb 16, 2021 at 09:33:44AM +0200, Leon Romanovsky wrote:
> > > > > On Mon, Feb 15, 2021 at 03:01:06PM -0600, Bjorn Helgaas wrote:
> > > > > > On Tue, Feb 09, 2021 at 03:34:42PM +0200, Leon Romanovsky wrote:
> > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > > > > > > +int pci_enable_vf_overlay(struct pci_dev *dev)
> > > > > > > +{
> > > > > > > +	struct pci_dev *virtfn;
> > > > > > > +	int id, ret;
> > > > > > > +
> > > > > > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > > > > > +		return 0;
> > > > > > > +
> > > > > > > +	ret = sysfs_create_files(&dev->dev.kobj, sriov_pf_dev_attrs);
> > > > > >
> > > > > > But I still don't like the fact that we're calling
> > > > > > sysfs_create_files() and sysfs_remove_files() directly.  It makes
> > > > > > complication and opportunities for errors.
> > > > >
> > > > > It is not different from any other code that we have in the kernel.
> > > >
> > > > It *is* different.  There is a general rule that drivers should not
> > > > call sysfs_* [1].  The PCI core is arguably not a "driver," but it is
> > > > still true that callers of sysfs_create_files() are very special, and
> > > > I'd prefer not to add another one.
> > >
> > > PCI for me is a bus, and bus is the right place to manage sysfs.
> > > But it doesn't matter, we understand each other positions.
> > >
> > > > > Let's be concrete, can you point to the errors in this code that I
> > > > > should fix?
> > > >
> > > > I'm not saying there are current errors; I'm saying the additional
> > > > code makes errors possible in future code.  For example, we hope that
> > > > other drivers can use these sysfs interfaces, and it's possible they
> > > > may not call pci_enable_vf_overlay() or pci_disable_vfs_overlay()
> > > > correctly.
> > >
> > > If not, we will fix, we just need is to ensure that sysfs name won't
> > > change, everything else is easy to change.
> > >
> > > > Or there may be races in device addition/removal.  We have current
> > > > issues in this area, e.g., [2], and they're fairly subtle.  I'm not
> > > > saying your patches have these issues; only that extra code makes more
> > > > chances for mistakes and it's more work to validate it.
> > > >
> > > > > > I don't see the advantage of creating these files only when
> > > > > > the PF driver supports this.  The management tools have to
> > > > > > deal with sriov_vf_total_msix == 0 and sriov_vf_msix_count ==
> > > > > > 0 anyway.  Having the sysfs files not be present at all might
> > > > > > be slightly prettier to the person running "ls", but I'm not
> > > > > > sure the code complication is worth that.
> > > > >
> > > > > It is more than "ls", right now sriov_numvfs is visible without
> > > > > relation to the driver, even if driver doesn't implement
> > > > > ".sriov_configure", which IMHO bad. We didn't want to repeat.
> > > > >
> > > > > Right now, we have many devices that supports SR-IOV, but small
> > > > > amount of them are capable to rewrite their VF MSI-X table siz.
> > > > > We don't want "to punish" and clatter their sysfs.
> > > >
> > > > I agree, it's clutter, but at least it's just cosmetic clutter
> > > > (but I'm willing to hear discussion about why it's more than
> > > > cosmetic; see below).
> > >
> > > It is more than cosmetic and IMHO it is related to the driver role.
> > > This feature is advertised, managed and configured by PF. It is very
> > > natural request that the PF will view/hide those sysfs files.
> >
> > Agreed, it's natural if the PF driver adds/removes those files.  But I
> > don't think it's *essential*, and they *could* be static because of
> > this:
> >
> > > > From the management software point of view, I don't think it matters.
> > > > That software already needs to deal with files that don't exist (on
> > > > old kernels) and files that contain zero (feature not supported or no
> > > > vectors are available).
> >
> > I wonder if sysfs_update_group() would let us have our cake and eat
> > it, too?  Maybe we could define these files as static attributes and
> > call sysfs_update_group() when the PF driver binds or unbinds?
> >
> > Makes me wonder if the device core could call sysfs_update_group()
> > when binding/unbinding drivers.  But there are only a few existing
> > callers, and it looks like none of them are for the bind/unbind
> > situation, so maybe that would be pointless.
> 
> Also it will be not an easy task to do it in driver/core. Our
> attributes need to be visible if driver is bound -> we will call to
> sysfs_update_group() after ->bind() callback. It means that in
> uwind, we will call to sysfs_update_group() before ->unbind() and
> the driver will be still bound. So the check is is_supported() for
> driver exists/or not won't be possible.

Poking around some more, I found .dev_groups, which might be
applicable?  The test patch below applies to v5.11 and makes the "bh"
file visible in devices bound to the uhci_hcd driver if the function
number is odd.

This thread has more details and some samples:
https://lore.kernel.org/lkml/20190731124349.4474-1-gregkh@linuxfoundation.org/

On qemu, with 00:1a.[012] and 00:1d.[012] set up as uhci_hcd devices:

  root@ubuntu:~# ls /sys/bus/pci/drivers/uhci_hcd
  0000:00:1a.0  0000:00:1a.2  0000:00:1d.1  bind    new_id     uevent
  0000:00:1a.1  0000:00:1d.0  0000:00:1d.2  module  remove_id  unbind
  root@ubuntu:~# grep . /sys/devices/pci0000:00/0000:00:*/bh /dev/null
  /sys/devices/pci0000:00/0000:00:1a.1/bh:hi bjorn
  /sys/devices/pci0000:00/0000:00:1d.1/bh:hi bjorn

diff --git a/drivers/usb/host/uhci-pci.c b/drivers/usb/host/uhci-pci.c
index 9b88745d247f..17ea5bf0dab0 100644
--- a/drivers/usb/host/uhci-pci.c
+++ b/drivers/usb/host/uhci-pci.c
@@ -297,6 +297,38 @@ static int uhci_pci_probe(struct pci_dev *dev, const struct pci_device_id *id)
 	return usb_hcd_pci_probe(dev, id, &uhci_driver);
 }
 
+static ssize_t bh_show(struct device *dev, struct device_attribute *attr,
+			char *buf)
+{
+	return snprintf(buf, PAGE_SIZE, "hi bjorn\n");
+}
+static DEVICE_ATTR_RO(bh);
+
+static umode_t bh_is_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct pci_dev *pdev = to_pci_dev(dev);
+	umode_t mode = (PCI_FUNC(pdev->devfn) % 2) ? 0444 : 0;
+
+	dev_info(dev, "%s mode %o\n", __func__, mode);
+	return mode;
+}
+
+static struct attribute *bh_attrs[] = {
+	&dev_attr_bh.attr,
+	NULL,
+};
+
+static const struct attribute_group bh_group = {
+	.attrs = bh_attrs,
+	.is_visible = bh_is_visible,
+};
+
+static const struct attribute_group *bh_groups[] = {
+	&bh_group,
+	NULL
+};
+
 static struct pci_driver uhci_pci_driver = {
 	.name =		hcd_name,
 	.id_table =	uhci_pci_ids,
@@ -307,7 +339,8 @@ static struct pci_driver uhci_pci_driver = {
 
 #ifdef CONFIG_PM
 	.driver =	{
-		.pm =	&usb_hcd_pci_pm_ops
+		.pm =	&usb_hcd_pci_pm_ops,
+		.dev_groups = bh_groups,
 	},
 #endif
 };

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-18 22:39               ` Bjorn Helgaas
@ 2021-02-19  7:52                 ` Leon Romanovsky
  2021-02-19  8:20                   ` Greg Kroah-Hartman
  0 siblings, 1 reply; 32+ messages in thread
From: Leon Romanovsky @ 2021-02-19  7:52 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman,
	Krzysztof Wilczyński

On Thu, Feb 18, 2021 at 04:39:50PM -0600, Bjorn Helgaas wrote:
> On Thu, Feb 18, 2021 at 12:15:51PM +0200, Leon Romanovsky wrote:
> > On Wed, Feb 17, 2021 at 12:02:39PM -0600, Bjorn Helgaas wrote:
> > > [+cc Greg in case he wants to chime in on the sysfs discussion.
> > > TL;DR: we're trying to add/remove sysfs files when a PCI driver that
> > > supports certain callbacks binds or unbinds; series at
> > > https://lore.kernel.org/r/20210209133445.700225-1-leon@kernel.org]
> > >
> > > On Tue, Feb 16, 2021 at 09:58:25PM +0200, Leon Romanovsky wrote:
> > > > On Tue, Feb 16, 2021 at 10:12:12AM -0600, Bjorn Helgaas wrote:
> > > > > On Tue, Feb 16, 2021 at 09:33:44AM +0200, Leon Romanovsky wrote:
> > > > > > On Mon, Feb 15, 2021 at 03:01:06PM -0600, Bjorn Helgaas wrote:
> > > > > > > On Tue, Feb 09, 2021 at 03:34:42PM +0200, Leon Romanovsky wrote:
> > > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > > > > > > +int pci_enable_vf_overlay(struct pci_dev *dev)
> > > > > > > > +{
> > > > > > > > +	struct pci_dev *virtfn;
> > > > > > > > +	int id, ret;
> > > > > > > > +
> > > > > > > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > > > > > > +		return 0;
> > > > > > > > +
> > > > > > > > +	ret = sysfs_create_files(&dev->dev.kobj, sriov_pf_dev_attrs);
> > > > > > >
> > > > > > > But I still don't like the fact that we're calling
> > > > > > > sysfs_create_files() and sysfs_remove_files() directly.  It makes
> > > > > > > complication and opportunities for errors.
> > > > > >
> > > > > > It is not different from any other code that we have in the kernel.
> > > > >
> > > > > It *is* different.  There is a general rule that drivers should not
> > > > > call sysfs_* [1].  The PCI core is arguably not a "driver," but it is
> > > > > still true that callers of sysfs_create_files() are very special, and
> > > > > I'd prefer not to add another one.
> > > >
> > > > PCI for me is a bus, and bus is the right place to manage sysfs.
> > > > But it doesn't matter, we understand each other positions.
> > > >
> > > > > > Let's be concrete, can you point to the errors in this code that I
> > > > > > should fix?
> > > > >
> > > > > I'm not saying there are current errors; I'm saying the additional
> > > > > code makes errors possible in future code.  For example, we hope that
> > > > > other drivers can use these sysfs interfaces, and it's possible they
> > > > > may not call pci_enable_vf_overlay() or pci_disable_vfs_overlay()
> > > > > correctly.
> > > >
> > > > If not, we will fix, we just need is to ensure that sysfs name won't
> > > > change, everything else is easy to change.
> > > >
> > > > > Or there may be races in device addition/removal.  We have current
> > > > > issues in this area, e.g., [2], and they're fairly subtle.  I'm not
> > > > > saying your patches have these issues; only that extra code makes more
> > > > > chances for mistakes and it's more work to validate it.
> > > > >
> > > > > > > I don't see the advantage of creating these files only when
> > > > > > > the PF driver supports this.  The management tools have to
> > > > > > > deal with sriov_vf_total_msix == 0 and sriov_vf_msix_count ==
> > > > > > > 0 anyway.  Having the sysfs files not be present at all might
> > > > > > > be slightly prettier to the person running "ls", but I'm not
> > > > > > > sure the code complication is worth that.
> > > > > >
> > > > > > It is more than "ls", right now sriov_numvfs is visible without
> > > > > > relation to the driver, even if driver doesn't implement
> > > > > > ".sriov_configure", which IMHO bad. We didn't want to repeat.
> > > > > >
> > > > > > Right now, we have many devices that supports SR-IOV, but small
> > > > > > amount of them are capable to rewrite their VF MSI-X table siz.
> > > > > > We don't want "to punish" and clatter their sysfs.
> > > > >
> > > > > I agree, it's clutter, but at least it's just cosmetic clutter
> > > > > (but I'm willing to hear discussion about why it's more than
> > > > > cosmetic; see below).
> > > >
> > > > It is more than cosmetic and IMHO it is related to the driver role.
> > > > This feature is advertised, managed and configured by PF. It is very
> > > > natural request that the PF will view/hide those sysfs files.
> > >
> > > Agreed, it's natural if the PF driver adds/removes those files.  But I
> > > don't think it's *essential*, and they *could* be static because of
> > > this:
> > >
> > > > > From the management software point of view, I don't think it matters.
> > > > > That software already needs to deal with files that don't exist (on
> > > > > old kernels) and files that contain zero (feature not supported or no
> > > > > vectors are available).
> > >
> > > I wonder if sysfs_update_group() would let us have our cake and eat
> > > it, too?  Maybe we could define these files as static attributes and
> > > call sysfs_update_group() when the PF driver binds or unbinds?
> > >
> > > Makes me wonder if the device core could call sysfs_update_group()
> > > when binding/unbinding drivers.  But there are only a few existing
> > > callers, and it looks like none of them are for the bind/unbind
> > > situation, so maybe that would be pointless.
> >
> > Also it will be not an easy task to do it in driver/core. Our
> > attributes need to be visible if driver is bound -> we will call to
> > sysfs_update_group() after ->bind() callback. It means that in
> > uwind, we will call to sysfs_update_group() before ->unbind() and
> > the driver will be still bound. So the check is is_supported() for
> > driver exists/or not won't be possible.
>
> Poking around some more, I found .dev_groups, which might be
> applicable?  The test patch below applies to v5.11 and makes the "bh"
> file visible in devices bound to the uhci_hcd driver if the function
> number is odd.

This solution can be applicable for generic drivers where we can afford
to have custom sysfs files for this driver. In our case, we are talking
about hardware device driver. Both RDMA and netdev are against allowing
for such drivers to create their own sysfs. It will be real nightmare to
have different names/layout/output for the same functionality.

This .dev_groups moves responsibility over sysfs to the drivers and it
is no-go for us.

Another problem with this approach is addition of VFs, not only every
driver will start to manage its own sysfs, but it will need to iterate
over PCI bus or internal lists to find VFs, because we want to create
.set_msix_vec on VFs after PF is bound.

So instead of one, controlled place, we will find ourselves with many
genius implementations of the same thing in the drivers.

Bjorn, we really do standard enable/disable flow with out overlay thing.

Thanks

>
> This thread has more details and some samples:
> https://lore.kernel.org/lkml/20190731124349.4474-1-gregkh@linuxfoundation.org/
>
> On qemu, with 00:1a.[012] and 00:1d.[012] set up as uhci_hcd devices:
>
>   root@ubuntu:~# ls /sys/bus/pci/drivers/uhci_hcd
>   0000:00:1a.0  0000:00:1a.2  0000:00:1d.1  bind    new_id     uevent
>   0000:00:1a.1  0000:00:1d.0  0000:00:1d.2  module  remove_id  unbind
>   root@ubuntu:~# grep . /sys/devices/pci0000:00/0000:00:*/bh /dev/null
>   /sys/devices/pci0000:00/0000:00:1a.1/bh:hi bjorn
>   /sys/devices/pci0000:00/0000:00:1d.1/bh:hi bjorn
>
> diff --git a/drivers/usb/host/uhci-pci.c b/drivers/usb/host/uhci-pci.c
> index 9b88745d247f..17ea5bf0dab0 100644
> --- a/drivers/usb/host/uhci-pci.c
> +++ b/drivers/usb/host/uhci-pci.c
> @@ -297,6 +297,38 @@ static int uhci_pci_probe(struct pci_dev *dev, const struct pci_device_id *id)
>  	return usb_hcd_pci_probe(dev, id, &uhci_driver);
>  }
>
> +static ssize_t bh_show(struct device *dev, struct device_attribute *attr,
> +			char *buf)
> +{
> +	return snprintf(buf, PAGE_SIZE, "hi bjorn\n");
> +}
> +static DEVICE_ATTR_RO(bh);
> +
> +static umode_t bh_is_visible(struct kobject *kobj, struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	umode_t mode = (PCI_FUNC(pdev->devfn) % 2) ? 0444 : 0;
> +
> +	dev_info(dev, "%s mode %o\n", __func__, mode);
> +	return mode;
> +}
> +
> +static struct attribute *bh_attrs[] = {
> +	&dev_attr_bh.attr,
> +	NULL,
> +};
> +
> +static const struct attribute_group bh_group = {
> +	.attrs = bh_attrs,
> +	.is_visible = bh_is_visible,
> +};
> +
> +static const struct attribute_group *bh_groups[] = {
> +	&bh_group,
> +	NULL
> +};
> +
>  static struct pci_driver uhci_pci_driver = {
>  	.name =		hcd_name,
>  	.id_table =	uhci_pci_ids,
> @@ -307,7 +339,8 @@ static struct pci_driver uhci_pci_driver = {
>
>  #ifdef CONFIG_PM
>  	.driver =	{
> -		.pm =	&usb_hcd_pci_pm_ops
> +		.pm =	&usb_hcd_pci_pm_ops,
> +		.dev_groups = bh_groups,
>  	},
>  #endif
>  };

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-19  7:52                 ` Leon Romanovsky
@ 2021-02-19  8:20                   ` Greg Kroah-Hartman
  2021-02-19 16:58                     ` Leon Romanovsky
  2021-02-20 19:06                     ` Bjorn Helgaas
  0 siblings, 2 replies; 32+ messages in thread
From: Greg Kroah-Hartman @ 2021-02-19  8:20 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Alexander Duyck, Jakub Kicinski, linux-pci, linux-rdma, netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Krzysztof Wilczyński

On Fri, Feb 19, 2021 at 09:52:12AM +0200, Leon Romanovsky wrote:
> On Thu, Feb 18, 2021 at 04:39:50PM -0600, Bjorn Helgaas wrote:
> > On Thu, Feb 18, 2021 at 12:15:51PM +0200, Leon Romanovsky wrote:
> > > On Wed, Feb 17, 2021 at 12:02:39PM -0600, Bjorn Helgaas wrote:
> > > > [+cc Greg in case he wants to chime in on the sysfs discussion.
> > > > TL;DR: we're trying to add/remove sysfs files when a PCI driver that
> > > > supports certain callbacks binds or unbinds; series at
> > > > https://lore.kernel.org/r/20210209133445.700225-1-leon@kernel.org]
> > > >
> > > > On Tue, Feb 16, 2021 at 09:58:25PM +0200, Leon Romanovsky wrote:
> > > > > On Tue, Feb 16, 2021 at 10:12:12AM -0600, Bjorn Helgaas wrote:
> > > > > > On Tue, Feb 16, 2021 at 09:33:44AM +0200, Leon Romanovsky wrote:
> > > > > > > On Mon, Feb 15, 2021 at 03:01:06PM -0600, Bjorn Helgaas wrote:
> > > > > > > > On Tue, Feb 09, 2021 at 03:34:42PM +0200, Leon Romanovsky wrote:
> > > > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > >
> > > > > > > > > +int pci_enable_vf_overlay(struct pci_dev *dev)
> > > > > > > > > +{
> > > > > > > > > +	struct pci_dev *virtfn;
> > > > > > > > > +	int id, ret;
> > > > > > > > > +
> > > > > > > > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > > > > > > > +		return 0;
> > > > > > > > > +
> > > > > > > > > +	ret = sysfs_create_files(&dev->dev.kobj, sriov_pf_dev_attrs);
> > > > > > > >
> > > > > > > > But I still don't like the fact that we're calling
> > > > > > > > sysfs_create_files() and sysfs_remove_files() directly.  It makes
> > > > > > > > complication and opportunities for errors.
> > > > > > >
> > > > > > > It is not different from any other code that we have in the kernel.
> > > > > >
> > > > > > It *is* different.  There is a general rule that drivers should not
> > > > > > call sysfs_* [1].  The PCI core is arguably not a "driver," but it is
> > > > > > still true that callers of sysfs_create_files() are very special, and
> > > > > > I'd prefer not to add another one.
> > > > >
> > > > > PCI for me is a bus, and bus is the right place to manage sysfs.
> > > > > But it doesn't matter, we understand each other positions.
> > > > >
> > > > > > > Let's be concrete, can you point to the errors in this code that I
> > > > > > > should fix?
> > > > > >
> > > > > > I'm not saying there are current errors; I'm saying the additional
> > > > > > code makes errors possible in future code.  For example, we hope that
> > > > > > other drivers can use these sysfs interfaces, and it's possible they
> > > > > > may not call pci_enable_vf_overlay() or pci_disable_vfs_overlay()
> > > > > > correctly.
> > > > >
> > > > > If not, we will fix, we just need is to ensure that sysfs name won't
> > > > > change, everything else is easy to change.
> > > > >
> > > > > > Or there may be races in device addition/removal.  We have current
> > > > > > issues in this area, e.g., [2], and they're fairly subtle.  I'm not
> > > > > > saying your patches have these issues; only that extra code makes more
> > > > > > chances for mistakes and it's more work to validate it.
> > > > > >
> > > > > > > > I don't see the advantage of creating these files only when
> > > > > > > > the PF driver supports this.  The management tools have to
> > > > > > > > deal with sriov_vf_total_msix == 0 and sriov_vf_msix_count ==
> > > > > > > > 0 anyway.  Having the sysfs files not be present at all might
> > > > > > > > be slightly prettier to the person running "ls", but I'm not
> > > > > > > > sure the code complication is worth that.
> > > > > > >
> > > > > > > It is more than "ls", right now sriov_numvfs is visible without
> > > > > > > relation to the driver, even if driver doesn't implement
> > > > > > > ".sriov_configure", which IMHO bad. We didn't want to repeat.
> > > > > > >
> > > > > > > Right now, we have many devices that supports SR-IOV, but small
> > > > > > > amount of them are capable to rewrite their VF MSI-X table siz.
> > > > > > > We don't want "to punish" and clatter their sysfs.
> > > > > >
> > > > > > I agree, it's clutter, but at least it's just cosmetic clutter
> > > > > > (but I'm willing to hear discussion about why it's more than
> > > > > > cosmetic; see below).
> > > > >
> > > > > It is more than cosmetic and IMHO it is related to the driver role.
> > > > > This feature is advertised, managed and configured by PF. It is very
> > > > > natural request that the PF will view/hide those sysfs files.
> > > >
> > > > Agreed, it's natural if the PF driver adds/removes those files.  But I
> > > > don't think it's *essential*, and they *could* be static because of
> > > > this:
> > > >
> > > > > > From the management software point of view, I don't think it matters.
> > > > > > That software already needs to deal with files that don't exist (on
> > > > > > old kernels) and files that contain zero (feature not supported or no
> > > > > > vectors are available).
> > > >
> > > > I wonder if sysfs_update_group() would let us have our cake and eat
> > > > it, too?  Maybe we could define these files as static attributes and
> > > > call sysfs_update_group() when the PF driver binds or unbinds?
> > > >
> > > > Makes me wonder if the device core could call sysfs_update_group()
> > > > when binding/unbinding drivers.  But there are only a few existing
> > > > callers, and it looks like none of them are for the bind/unbind
> > > > situation, so maybe that would be pointless.
> > >
> > > Also it will be not an easy task to do it in driver/core. Our
> > > attributes need to be visible if driver is bound -> we will call to
> > > sysfs_update_group() after ->bind() callback. It means that in
> > > uwind, we will call to sysfs_update_group() before ->unbind() and
> > > the driver will be still bound. So the check is is_supported() for
> > > driver exists/or not won't be possible.
> >
> > Poking around some more, I found .dev_groups, which might be
> > applicable?  The test patch below applies to v5.11 and makes the "bh"
> > file visible in devices bound to the uhci_hcd driver if the function
> > number is odd.
> 
> This solution can be applicable for generic drivers where we can afford
> to have custom sysfs files for this driver. In our case, we are talking
> about hardware device driver. Both RDMA and netdev are against allowing
> for such drivers to create their own sysfs. It will be real nightmare to
> have different names/layout/output for the same functionality.
> 
> This .dev_groups moves responsibility over sysfs to the drivers and it
> is no-go for us.

But it _is_ the driver's responsibility for sysfs files, right?

If not, what exactly are you trying to do here, as I am very confused.

> Another problem with this approach is addition of VFs, not only every
> driver will start to manage its own sysfs, but it will need to iterate
> over PCI bus or internal lists to find VFs, because we want to create
> .set_msix_vec on VFs after PF is bound.

What?  I don't understand at all.

> So instead of one, controlled place, we will find ourselves with many
> genius implementations of the same thing in the drivers.

Same _what_ thing?

> Bjorn, we really do standard enable/disable flow with out overlay thing.

Ok, can you step back and try to explain what problem you are trying to
solve first, before getting bogged down in odd details?  I find it
highly unlikely that this is something "unique", but I could be wrong as
I do not understand what you are wanting to do here at all.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-19  8:20                   ` Greg Kroah-Hartman
@ 2021-02-19 16:58                     ` Leon Romanovsky
  2021-02-20 19:06                     ` Bjorn Helgaas
  1 sibling, 0 replies; 32+ messages in thread
From: Leon Romanovsky @ 2021-02-19 16:58 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Alexander Duyck, Jakub Kicinski, linux-pci, linux-rdma, netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Krzysztof Wilczyński

On Fri, Feb 19, 2021 at 09:20:18AM +0100, Greg Kroah-Hartman wrote:
> On Fri, Feb 19, 2021 at 09:52:12AM +0200, Leon Romanovsky wrote:
> > On Thu, Feb 18, 2021 at 04:39:50PM -0600, Bjorn Helgaas wrote:
> > > On Thu, Feb 18, 2021 at 12:15:51PM +0200, Leon Romanovsky wrote:
> > > > On Wed, Feb 17, 2021 at 12:02:39PM -0600, Bjorn Helgaas wrote:
> > > > > [+cc Greg in case he wants to chime in on the sysfs discussion.
> > > > > TL;DR: we're trying to add/remove sysfs files when a PCI driver that
> > > > > supports certain callbacks binds or unbinds; series at
> > > > > https://lore.kernel.org/r/20210209133445.700225-1-leon@kernel.org]
> > > > >
> > > > > On Tue, Feb 16, 2021 at 09:58:25PM +0200, Leon Romanovsky wrote:
> > > > > > On Tue, Feb 16, 2021 at 10:12:12AM -0600, Bjorn Helgaas wrote:
> > > > > > > On Tue, Feb 16, 2021 at 09:33:44AM +0200, Leon Romanovsky wrote:
> > > > > > > > On Mon, Feb 15, 2021 at 03:01:06PM -0600, Bjorn Helgaas wrote:
> > > > > > > > > On Tue, Feb 09, 2021 at 03:34:42PM +0200, Leon Romanovsky wrote:
> > > > > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > >
> > > > > > > > > > +int pci_enable_vf_overlay(struct pci_dev *dev)
> > > > > > > > > > +{
> > > > > > > > > > +	struct pci_dev *virtfn;
> > > > > > > > > > +	int id, ret;
> > > > > > > > > > +
> > > > > > > > > > +	if (!dev->is_physfn || !dev->sriov->num_VFs)
> > > > > > > > > > +		return 0;
> > > > > > > > > > +
> > > > > > > > > > +	ret = sysfs_create_files(&dev->dev.kobj, sriov_pf_dev_attrs);
> > > > > > > > >
> > > > > > > > > But I still don't like the fact that we're calling
> > > > > > > > > sysfs_create_files() and sysfs_remove_files() directly.  It makes
> > > > > > > > > complication and opportunities for errors.
> > > > > > > >
> > > > > > > > It is not different from any other code that we have in the kernel.
> > > > > > >
> > > > > > > It *is* different.  There is a general rule that drivers should not
> > > > > > > call sysfs_* [1].  The PCI core is arguably not a "driver," but it is
> > > > > > > still true that callers of sysfs_create_files() are very special, and
> > > > > > > I'd prefer not to add another one.
> > > > > >
> > > > > > PCI for me is a bus, and bus is the right place to manage sysfs.
> > > > > > But it doesn't matter, we understand each other positions.
> > > > > >
> > > > > > > > Let's be concrete, can you point to the errors in this code that I
> > > > > > > > should fix?
> > > > > > >
> > > > > > > I'm not saying there are current errors; I'm saying the additional
> > > > > > > code makes errors possible in future code.  For example, we hope that
> > > > > > > other drivers can use these sysfs interfaces, and it's possible they
> > > > > > > may not call pci_enable_vf_overlay() or pci_disable_vfs_overlay()
> > > > > > > correctly.
> > > > > >
> > > > > > If not, we will fix, we just need is to ensure that sysfs name won't
> > > > > > change, everything else is easy to change.
> > > > > >
> > > > > > > Or there may be races in device addition/removal.  We have current
> > > > > > > issues in this area, e.g., [2], and they're fairly subtle.  I'm not
> > > > > > > saying your patches have these issues; only that extra code makes more
> > > > > > > chances for mistakes and it's more work to validate it.
> > > > > > >
> > > > > > > > > I don't see the advantage of creating these files only when
> > > > > > > > > the PF driver supports this.  The management tools have to
> > > > > > > > > deal with sriov_vf_total_msix == 0 and sriov_vf_msix_count ==
> > > > > > > > > 0 anyway.  Having the sysfs files not be present at all might
> > > > > > > > > be slightly prettier to the person running "ls", but I'm not
> > > > > > > > > sure the code complication is worth that.
> > > > > > > >
> > > > > > > > It is more than "ls", right now sriov_numvfs is visible without
> > > > > > > > relation to the driver, even if driver doesn't implement
> > > > > > > > ".sriov_configure", which IMHO bad. We didn't want to repeat.
> > > > > > > >
> > > > > > > > Right now, we have many devices that supports SR-IOV, but small
> > > > > > > > amount of them are capable to rewrite their VF MSI-X table siz.
> > > > > > > > We don't want "to punish" and clatter their sysfs.
> > > > > > >
> > > > > > > I agree, it's clutter, but at least it's just cosmetic clutter
> > > > > > > (but I'm willing to hear discussion about why it's more than
> > > > > > > cosmetic; see below).
> > > > > >
> > > > > > It is more than cosmetic and IMHO it is related to the driver role.
> > > > > > This feature is advertised, managed and configured by PF. It is very
> > > > > > natural request that the PF will view/hide those sysfs files.
> > > > >
> > > > > Agreed, it's natural if the PF driver adds/removes those files.  But I
> > > > > don't think it's *essential*, and they *could* be static because of
> > > > > this:
> > > > >
> > > > > > > From the management software point of view, I don't think it matters.
> > > > > > > That software already needs to deal with files that don't exist (on
> > > > > > > old kernels) and files that contain zero (feature not supported or no
> > > > > > > vectors are available).
> > > > >
> > > > > I wonder if sysfs_update_group() would let us have our cake and eat
> > > > > it, too?  Maybe we could define these files as static attributes and
> > > > > call sysfs_update_group() when the PF driver binds or unbinds?
> > > > >
> > > > > Makes me wonder if the device core could call sysfs_update_group()
> > > > > when binding/unbinding drivers.  But there are only a few existing
> > > > > callers, and it looks like none of them are for the bind/unbind
> > > > > situation, so maybe that would be pointless.
> > > >
> > > > Also it will be not an easy task to do it in driver/core. Our
> > > > attributes need to be visible if driver is bound -> we will call to
> > > > sysfs_update_group() after ->bind() callback. It means that in
> > > > uwind, we will call to sysfs_update_group() before ->unbind() and
> > > > the driver will be still bound. So the check is is_supported() for
> > > > driver exists/or not won't be possible.
> > >
> > > Poking around some more, I found .dev_groups, which might be
> > > applicable?  The test patch below applies to v5.11 and makes the "bh"
> > > file visible in devices bound to the uhci_hcd driver if the function
> > > number is odd.
> >
> > This solution can be applicable for generic drivers where we can afford
> > to have custom sysfs files for this driver. In our case, we are talking
> > about hardware device driver. Both RDMA and netdev are against allowing
> > for such drivers to create their own sysfs. It will be real nightmare to
> > have different names/layout/output for the same functionality.
> >
> > This .dev_groups moves responsibility over sysfs to the drivers and it
> > is no-go for us.
>
> But it _is_ the driver's responsibility for sysfs files, right?

It depends on how you declare "responsibility". Direct creating/deletion of
sysfs files is prohibited in netdev and RDMA subsystems. We want to provide
to our users and stack uniformed way of interacting with the system.

It is super painful to manage large fleet of NICs and/or HCAs if every device
driver provides something different for the same feature.

>
> If not, what exactly are you trying to do here, as I am very confused.

https://lore.kernel.org/linux-rdma/20210216203726.GH4247@nvidia.com/T/#m899d883c8a10d95959ac0cd2833762f93729b8ef
Please see more details below.

>
> > Another problem with this approach is addition of VFs, not only every
> > driver will start to manage its own sysfs, but it will need to iterate
> > over PCI bus or internal lists to find VFs, because we want to create
> > .set_msix_vec on VFs after PF is bound.
>
> What?  I don't understand at all.
>
> > So instead of one, controlled place, we will find ourselves with many
> > genius implementations of the same thing in the drivers.
>
> Same _what_ thing?

This thread is part of conversation with Bjorn where he is looking for a
way to avoid creation of sysfs files in the PCI/core.
https://lore.kernel.org/linux-rdma/20210216203726.GH4247@nvidia.com/T/#madc000cf04b5246b450f7183a1d80abdf408a949

https://lore.kernel.org/linux-rdma/20210216203726.GH4247@nvidia.com/T/#Z2e.:..:20210209133445.700225-2-leon::40kernel.org:0drivers:pci:iov.c

>
> > Bjorn, we really do standard enable/disable flow with out overlay thing.
>
> Ok, can you step back and try to explain what problem you are trying to
> solve first, before getting bogged down in odd details?  I find it
> highly unlikely that this is something "unique", but I could be wrong as
> I do not understand what you are wanting to do here at all.

I don't know if you are familiar with SR-IOV concepts, if yes, just skip
the following paragraph.

SR-IOV capable devices have two types of their hardware functions which visible
as PCI devices: physical functions (PF) and virtual functions (VF). Both types
have PCI BDF and driver which probes them during initialization. The PF has extra
properties and it is the one who creates (spawns) new VFs (everything according to
he PCI-SIG).

This series adds new sysfs files to the VFs which are not bound yet (without driver attached)
while the PF driver is loaded. The change to VFs is needed to be done before their driver is
loaded because MSI-X table vector size (the property which we are changing) is used very early
in the initialization sequence.
https://lore.kernel.org/linux-rdma/20210216203726.GH4247@nvidia.com/T/#m899d883c8a10d95959ac0cd2833762f93729b8ef

We have two different flows for supported devices:
1. PF starts and initiates VFs.
2. PF starts and connects to already existing VFs.

So there is nothing "unique" here, as long as this logic is handled by the PCI/core.

Thanks

>
> thanks,
>
> greg k-h

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-19  8:20                   ` Greg Kroah-Hartman
  2021-02-19 16:58                     ` Leon Romanovsky
@ 2021-02-20 19:06                     ` Bjorn Helgaas
  2021-02-21  6:59                       ` Leon Romanovsky
  2021-02-21 13:00                       ` Greg Kroah-Hartman
  1 sibling, 2 replies; 32+ messages in thread
From: Bjorn Helgaas @ 2021-02-20 19:06 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Alexander Duyck, Jakub Kicinski, linux-pci, linux-rdma, netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Krzysztof Wilczyński

On Fri, Feb 19, 2021 at 09:20:18AM +0100, Greg Kroah-Hartman wrote:

> Ok, can you step back and try to explain what problem you are trying to
> solve first, before getting bogged down in odd details?  I find it
> highly unlikely that this is something "unique", but I could be wrong as
> I do not understand what you are wanting to do here at all.

We want to add two new sysfs files:

  sriov_vf_total_msix, for PF devices
  sriov_vf_msix_count, for VF devices associated with the PF

AFAICT it is *acceptable* if they are both present always.  But it
would be *ideal* if they were only present when a driver that
implements the ->sriov_get_vf_total_msix() callback is bound to the
PF.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-20 19:06                     ` Bjorn Helgaas
@ 2021-02-21  6:59                       ` Leon Romanovsky
  2021-02-23 21:07                         ` Bjorn Helgaas
  2021-02-21 13:00                       ` Greg Kroah-Hartman
  1 sibling, 1 reply; 32+ messages in thread
From: Leon Romanovsky @ 2021-02-21  6:59 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Greg Kroah-Hartman, Bjorn Helgaas, Saeed Mahameed,
	Jason Gunthorpe, Alexander Duyck, Jakub Kicinski, linux-pci,
	linux-rdma, netdev, Don Dutile, Alex Williamson,
	David S . Miller, Krzysztof Wilczyński

On Sat, Feb 20, 2021 at 01:06:00PM -0600, Bjorn Helgaas wrote:
> On Fri, Feb 19, 2021 at 09:20:18AM +0100, Greg Kroah-Hartman wrote:
>
> > Ok, can you step back and try to explain what problem you are trying to
> > solve first, before getting bogged down in odd details?  I find it
> > highly unlikely that this is something "unique", but I could be wrong as
> > I do not understand what you are wanting to do here at all.
>
> We want to add two new sysfs files:
>
>   sriov_vf_total_msix, for PF devices
>   sriov_vf_msix_count, for VF devices associated with the PF
>
> AFAICT it is *acceptable* if they are both present always.  But it
> would be *ideal* if they were only present when a driver that
> implements the ->sriov_get_vf_total_msix() callback is bound to the
> PF.

BTW, we already have all possible combinations: static, static with
folder, with and without "sriov_" prefix, dynamic with and without
folders on VFs.

I need to know on which version I'll get Acked-by and that version I
will resubmit.

Thanks

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-20 19:06                     ` Bjorn Helgaas
  2021-02-21  6:59                       ` Leon Romanovsky
@ 2021-02-21 13:00                       ` Greg Kroah-Hartman
  2021-02-21 13:55                         ` Leon Romanovsky
  1 sibling, 1 reply; 32+ messages in thread
From: Greg Kroah-Hartman @ 2021-02-21 13:00 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Alexander Duyck, Jakub Kicinski, linux-pci, linux-rdma, netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Krzysztof Wilczyński

On Sat, Feb 20, 2021 at 01:06:00PM -0600, Bjorn Helgaas wrote:
> On Fri, Feb 19, 2021 at 09:20:18AM +0100, Greg Kroah-Hartman wrote:
> 
> > Ok, can you step back and try to explain what problem you are trying to
> > solve first, before getting bogged down in odd details?  I find it
> > highly unlikely that this is something "unique", but I could be wrong as
> > I do not understand what you are wanting to do here at all.
> 
> We want to add two new sysfs files:
> 
>   sriov_vf_total_msix, for PF devices
>   sriov_vf_msix_count, for VF devices associated with the PF
> 
> AFAICT it is *acceptable* if they are both present always.  But it
> would be *ideal* if they were only present when a driver that
> implements the ->sriov_get_vf_total_msix() callback is bound to the
> PF.

Ok, so in the pci bus probe function, if the driver that successfully
binds to the device is of this type, then create the sysfs files.

The driver core will properly emit a KOBJ_BIND message when the driver
is bound to the device, so userspace knows it is now safe to rescan the
device to see any new attributes.

Here's some horrible pseudo-patch for where this probably should be
done:


diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index ec44a79e951a..5a854a5e3977 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -307,8 +307,14 @@ static long local_pci_probe(void *_ddi)
 	pm_runtime_get_sync(dev);
 	pci_dev->driver = pci_drv;
 	rc = pci_drv->probe(pci_dev, ddi->id);
-	if (!rc)
+	if (!rc) {
+		/* If PF or FV driver was bound, let's add some more sysfs files */
+		if (pci_drv->is_pf)
+			device_add_groups(pci_dev->dev, pf_groups);
+		if (pci_drv->is_fv)
+			device_add_groups(pci_dev->dev, fv_groups);
 		return rc;
+	}
 	if (rc < 0) {
 		pci_dev->driver = NULL;
 		pm_runtime_put_sync(dev);




Add some proper error handling if device_add_groups() fails, and then do
the same thing to remove the sysfs files when the device is unbound from
the driver, and you should be good to go.

Or is this what you all are talking about already and I'm just totally
confused?

thanks,

greg k-h

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-21 13:00                       ` Greg Kroah-Hartman
@ 2021-02-21 13:55                         ` Leon Romanovsky
  2021-02-21 15:01                           ` Greg Kroah-Hartman
  0 siblings, 1 reply; 32+ messages in thread
From: Leon Romanovsky @ 2021-02-21 13:55 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Alexander Duyck, Jakub Kicinski, linux-pci, linux-rdma, netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Krzysztof Wilczyński

On Sun, Feb 21, 2021 at 02:00:41PM +0100, Greg Kroah-Hartman wrote:
> On Sat, Feb 20, 2021 at 01:06:00PM -0600, Bjorn Helgaas wrote:
> > On Fri, Feb 19, 2021 at 09:20:18AM +0100, Greg Kroah-Hartman wrote:
> >
> > > Ok, can you step back and try to explain what problem you are trying to
> > > solve first, before getting bogged down in odd details?  I find it
> > > highly unlikely that this is something "unique", but I could be wrong as
> > > I do not understand what you are wanting to do here at all.
> >
> > We want to add two new sysfs files:
> >
> >   sriov_vf_total_msix, for PF devices
> >   sriov_vf_msix_count, for VF devices associated with the PF
> >
> > AFAICT it is *acceptable* if they are both present always.  But it
> > would be *ideal* if they were only present when a driver that
> > implements the ->sriov_get_vf_total_msix() callback is bound to the
> > PF.
>
> Ok, so in the pci bus probe function, if the driver that successfully
> binds to the device is of this type, then create the sysfs files.
>
> The driver core will properly emit a KOBJ_BIND message when the driver
> is bound to the device, so userspace knows it is now safe to rescan the
> device to see any new attributes.
>
> Here's some horrible pseudo-patch for where this probably should be
> done:
>
>
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index ec44a79e951a..5a854a5e3977 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -307,8 +307,14 @@ static long local_pci_probe(void *_ddi)
>  	pm_runtime_get_sync(dev);
>  	pci_dev->driver = pci_drv;
>  	rc = pci_drv->probe(pci_dev, ddi->id);
> -	if (!rc)
> +	if (!rc) {
> +		/* If PF or FV driver was bound, let's add some more sysfs files */
> +		if (pci_drv->is_pf)
> +			device_add_groups(pci_dev->dev, pf_groups);
> +		if (pci_drv->is_fv)
> +			device_add_groups(pci_dev->dev, fv_groups);
>  		return rc;
> +	}
>  	if (rc < 0) {
>  		pci_dev->driver = NULL;
>  		pm_runtime_put_sync(dev);
>
>
>
>
> Add some proper error handling if device_add_groups() fails, and then do
> the same thing to remove the sysfs files when the device is unbound from
> the driver, and you should be good to go.
>
> Or is this what you all are talking about already and I'm just totally
> confused?

There are two different things here. First we need to add sysfs files
for VF as the event of PF driver bind, not for the VF binds.

In your pseudo code, it will look:
  	rc = pci_drv->probe(pci_dev, ddi->id);
 -	if (!rc)
 +	if (!rc) {
 +		/* If PF or FV driver was bound, let's add some more sysfs files */
 +		if (pci_drv->is_pf) {
 +                      int i = 0;
 +			device_add_groups(pci_dev->dev, pf_groups);
 +                      for (i; i < pci_dev->totalVF; i++) {
 +                              struct pci_device vf_dev = find_vf_device(pci_dev, i);
 +
 +				device_add_groups(vf_dev->dev, fv_groups);
 +                      }
 +              }
  		return rc;

Second, the code proposed by me does that but with driver callback that
PF calls during init/uninit.

Thanks

>
> thanks,
>
> greg k-h

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-21 13:55                         ` Leon Romanovsky
@ 2021-02-21 15:01                           ` Greg Kroah-Hartman
  2021-02-21 15:30                             ` Leon Romanovsky
  0 siblings, 1 reply; 32+ messages in thread
From: Greg Kroah-Hartman @ 2021-02-21 15:01 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Alexander Duyck, Jakub Kicinski, linux-pci, linux-rdma, netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Krzysztof Wilczyński

On Sun, Feb 21, 2021 at 03:55:18PM +0200, Leon Romanovsky wrote:
> On Sun, Feb 21, 2021 at 02:00:41PM +0100, Greg Kroah-Hartman wrote:
> > On Sat, Feb 20, 2021 at 01:06:00PM -0600, Bjorn Helgaas wrote:
> > > On Fri, Feb 19, 2021 at 09:20:18AM +0100, Greg Kroah-Hartman wrote:
> > >
> > > > Ok, can you step back and try to explain what problem you are trying to
> > > > solve first, before getting bogged down in odd details?  I find it
> > > > highly unlikely that this is something "unique", but I could be wrong as
> > > > I do not understand what you are wanting to do here at all.
> > >
> > > We want to add two new sysfs files:
> > >
> > >   sriov_vf_total_msix, for PF devices
> > >   sriov_vf_msix_count, for VF devices associated with the PF
> > >
> > > AFAICT it is *acceptable* if they are both present always.  But it
> > > would be *ideal* if they were only present when a driver that
> > > implements the ->sriov_get_vf_total_msix() callback is bound to the
> > > PF.
> >
> > Ok, so in the pci bus probe function, if the driver that successfully
> > binds to the device is of this type, then create the sysfs files.
> >
> > The driver core will properly emit a KOBJ_BIND message when the driver
> > is bound to the device, so userspace knows it is now safe to rescan the
> > device to see any new attributes.
> >
> > Here's some horrible pseudo-patch for where this probably should be
> > done:
> >
> >
> > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> > index ec44a79e951a..5a854a5e3977 100644
> > --- a/drivers/pci/pci-driver.c
> > +++ b/drivers/pci/pci-driver.c
> > @@ -307,8 +307,14 @@ static long local_pci_probe(void *_ddi)
> >  	pm_runtime_get_sync(dev);
> >  	pci_dev->driver = pci_drv;
> >  	rc = pci_drv->probe(pci_dev, ddi->id);
> > -	if (!rc)
> > +	if (!rc) {
> > +		/* If PF or FV driver was bound, let's add some more sysfs files */
> > +		if (pci_drv->is_pf)
> > +			device_add_groups(pci_dev->dev, pf_groups);
> > +		if (pci_drv->is_fv)
> > +			device_add_groups(pci_dev->dev, fv_groups);
> >  		return rc;
> > +	}
> >  	if (rc < 0) {
> >  		pci_dev->driver = NULL;
> >  		pm_runtime_put_sync(dev);
> >
> >
> >
> >
> > Add some proper error handling if device_add_groups() fails, and then do
> > the same thing to remove the sysfs files when the device is unbound from
> > the driver, and you should be good to go.
> >
> > Or is this what you all are talking about already and I'm just totally
> > confused?
> 
> There are two different things here. First we need to add sysfs files
> for VF as the event of PF driver bind, not for the VF binds.
> 
> In your pseudo code, it will look:
>   	rc = pci_drv->probe(pci_dev, ddi->id);
>  -	if (!rc)
>  +	if (!rc) {
>  +		/* If PF or FV driver was bound, let's add some more sysfs files */
>  +		if (pci_drv->is_pf) {
>  +                      int i = 0;
>  +			device_add_groups(pci_dev->dev, pf_groups);
>  +                      for (i; i < pci_dev->totalVF; i++) {
>  +                              struct pci_device vf_dev = find_vf_device(pci_dev, i);
>  +
>  +				device_add_groups(vf_dev->dev, fv_groups);

Hahaha, no.

You are randomly adding new sysfs files to a _DIFFERENT_ device than the
one that is currently undergoing the probe() call?  That's crazy.  And
will break userspace.

Why would you want that?  The device should ONLY change when the device
that controls it has a driver bound/unbound to it, that should NEVER
cause random other devices on the bus to change state or sysfs files.

>  +                      }
>  +              }
>   		return rc;
> 
> Second, the code proposed by me does that but with driver callback that
> PF calls during init/uninit.

That works too, but really, why not just have the pci core do it for
you?  That way you do not have to go and modify each and every PCI
driver to get this type of support.  PCI core things belong in the PCI
core, not in each individual driver.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-21 15:01                           ` Greg Kroah-Hartman
@ 2021-02-21 15:30                             ` Leon Romanovsky
  0 siblings, 0 replies; 32+ messages in thread
From: Leon Romanovsky @ 2021-02-21 15:30 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Alexander Duyck, Jakub Kicinski, linux-pci, linux-rdma, netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Krzysztof Wilczyński

On Sun, Feb 21, 2021 at 04:01:32PM +0100, Greg Kroah-Hartman wrote:
> On Sun, Feb 21, 2021 at 03:55:18PM +0200, Leon Romanovsky wrote:
> > On Sun, Feb 21, 2021 at 02:00:41PM +0100, Greg Kroah-Hartman wrote:
> > > On Sat, Feb 20, 2021 at 01:06:00PM -0600, Bjorn Helgaas wrote:
> > > > On Fri, Feb 19, 2021 at 09:20:18AM +0100, Greg Kroah-Hartman wrote:
> > > >
> > > > > Ok, can you step back and try to explain what problem you are trying to
> > > > > solve first, before getting bogged down in odd details?  I find it
> > > > > highly unlikely that this is something "unique", but I could be wrong as
> > > > > I do not understand what you are wanting to do here at all.
> > > >
> > > > We want to add two new sysfs files:
> > > >
> > > >   sriov_vf_total_msix, for PF devices
> > > >   sriov_vf_msix_count, for VF devices associated with the PF
> > > >
> > > > AFAICT it is *acceptable* if they are both present always.  But it
> > > > would be *ideal* if they were only present when a driver that
> > > > implements the ->sriov_get_vf_total_msix() callback is bound to the
> > > > PF.
> > >
> > > Ok, so in the pci bus probe function, if the driver that successfully
> > > binds to the device is of this type, then create the sysfs files.
> > >
> > > The driver core will properly emit a KOBJ_BIND message when the driver
> > > is bound to the device, so userspace knows it is now safe to rescan the
> > > device to see any new attributes.
> > >
> > > Here's some horrible pseudo-patch for where this probably should be
> > > done:
> > >
> > >
> > > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> > > index ec44a79e951a..5a854a5e3977 100644
> > > --- a/drivers/pci/pci-driver.c
> > > +++ b/drivers/pci/pci-driver.c
> > > @@ -307,8 +307,14 @@ static long local_pci_probe(void *_ddi)
> > >  	pm_runtime_get_sync(dev);
> > >  	pci_dev->driver = pci_drv;
> > >  	rc = pci_drv->probe(pci_dev, ddi->id);
> > > -	if (!rc)
> > > +	if (!rc) {
> > > +		/* If PF or FV driver was bound, let's add some more sysfs files */
> > > +		if (pci_drv->is_pf)
> > > +			device_add_groups(pci_dev->dev, pf_groups);
> > > +		if (pci_drv->is_fv)
> > > +			device_add_groups(pci_dev->dev, fv_groups);
> > >  		return rc;
> > > +	}
> > >  	if (rc < 0) {
> > >  		pci_dev->driver = NULL;
> > >  		pm_runtime_put_sync(dev);
> > >
> > >
> > >
> > >
> > > Add some proper error handling if device_add_groups() fails, and then do
> > > the same thing to remove the sysfs files when the device is unbound from
> > > the driver, and you should be good to go.
> > >
> > > Or is this what you all are talking about already and I'm just totally
> > > confused?
> >
> > There are two different things here. First we need to add sysfs files
> > for VF as the event of PF driver bind, not for the VF binds.
> >
> > In your pseudo code, it will look:
> >   	rc = pci_drv->probe(pci_dev, ddi->id);
> >  -	if (!rc)
> >  +	if (!rc) {
> >  +		/* If PF or FV driver was bound, let's add some more sysfs files */
> >  +		if (pci_drv->is_pf) {
> >  +                      int i = 0;
> >  +			device_add_groups(pci_dev->dev, pf_groups);
> >  +                      for (i; i < pci_dev->totalVF; i++) {
> >  +                              struct pci_device vf_dev = find_vf_device(pci_dev, i);
> >  +
> >  +				device_add_groups(vf_dev->dev, fv_groups);
>
> Hahaha, no.
>
> You are randomly adding new sysfs files to a _DIFFERENT_ device than the
> one that is currently undergoing the probe() call?  That's crazy.  And
> will break userspace.

It is more complex than _DIFFERENT_ device, we are talking about SR-IOV
capable devices and their VFs which are created by connected PF function.

And VF MUST not be probed, we are checking it and protecting this flow.
To summarize: PF must be bound to the driver, VF mustn't.

>
> Why would you want that?  The device should ONLY change when the device
> that controls it has a driver bound/unbound to it, that should NEVER
> cause random other devices on the bus to change state or sysfs files.

Greg, I don't know if you are familiar with SR-IOV concepts which I
explained before, but we are not talking about random devices. The PF device
owns VFs, it is visible in the bus with different symlinks and even PF driver
iterates over those VFs during its probe.

The PF driver is the one who starts and stops those VF devices.

>
> >  +                      }
> >  +              }
> >   		return rc;
> >
> > Second, the code proposed by me does that but with driver callback that
> > PF calls during init/uninit.
>
> That works too, but really, why not just have the pci core do it for
> you?  That way you do not have to go and modify each and every PCI
> driver to get this type of support.  PCI core things belong in the PCI
> core, not in each individual driver.

There are not many drivers which are supporting this specific configuration
flow. It needs to be SR-IOV capable device, with ability to overwrite
specific PCI default through PF device for its VF devices. For now, we
are talking about one device in the market and I can imagine that extra 2-3
vendors in the world will support this flow.

During review, I was requested to create API that controls those sysfs
for specific devices that explicitly acknowledge support.

Greg, please take a look on the cover letter and Documentation.
https://lore.kernel.org/linux-rdma/20210216203726.GH4247@nvidia.com/T/#m899d883c8a10d95959ac0cd2833762f93729b8ef

Thanks

>
> thanks,
>
> greg k-h

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-21  6:59                       ` Leon Romanovsky
@ 2021-02-23 21:07                         ` Bjorn Helgaas
  2021-02-24  8:09                           ` Greg Kroah-Hartman
  2021-02-24  9:53                           ` Leon Romanovsky
  0 siblings, 2 replies; 32+ messages in thread
From: Bjorn Helgaas @ 2021-02-23 21:07 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Greg Kroah-Hartman, Bjorn Helgaas, Saeed Mahameed,
	Jason Gunthorpe, Alexander Duyck, Jakub Kicinski, linux-pci,
	linux-rdma, netdev, Don Dutile, Alex Williamson,
	David S . Miller, Krzysztof Wilczyński

On Sun, Feb 21, 2021 at 08:59:18AM +0200, Leon Romanovsky wrote:
> On Sat, Feb 20, 2021 at 01:06:00PM -0600, Bjorn Helgaas wrote:
> > On Fri, Feb 19, 2021 at 09:20:18AM +0100, Greg Kroah-Hartman wrote:
> >
> > > Ok, can you step back and try to explain what problem you are trying to
> > > solve first, before getting bogged down in odd details?  I find it
> > > highly unlikely that this is something "unique", but I could be wrong as
> > > I do not understand what you are wanting to do here at all.
> >
> > We want to add two new sysfs files:
> >
> >   sriov_vf_total_msix, for PF devices
> >   sriov_vf_msix_count, for VF devices associated with the PF
> >
> > AFAICT it is *acceptable* if they are both present always.  But it
> > would be *ideal* if they were only present when a driver that
> > implements the ->sriov_get_vf_total_msix() callback is bound to the
> > PF.
> 
> BTW, we already have all possible combinations: static, static with
> folder, with and without "sriov_" prefix, dynamic with and without
> folders on VFs.
> 
> I need to know on which version I'll get Acked-by and that version I
> will resubmit.

I propose that you make static attributes for both files, so
"sriov_vf_total_msix" is visible for *every* PF in the system and
"sriov_vf_msix_count" is visible for *every* VF in the system.

The PF "sriov_vf_total_msix" show function can return zero if there's
no PF driver or it doesn't support ->sriov_get_vf_total_msix().
(Incidentally, I think the documentation should mention that when it
*is* supported, the contents of this file are *constant*, i.e., it
does not decrease as vectors are assigned to VFs.)

The "sriov_vf_msix_count" set function can ignore writes if there's no
PF driver or it doesn't support ->sriov_get_vf_total_msix(), or if a
VF driver is bound.

Any userspace software must be able to deal with those scenarios
anyway, so I don't think the mere presence or absence of the files is
a meaningful signal to that software.

If we figure out a way to make the files visible only when the
appropriate driver is bound, that might be nice and could always be
done later.  But I don't think it's essential.

Bjorn

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-23 21:07                         ` Bjorn Helgaas
@ 2021-02-24  8:09                           ` Greg Kroah-Hartman
  2021-02-24 21:37                             ` Don Dutile
  2021-02-24  9:53                           ` Leon Romanovsky
  1 sibling, 1 reply; 32+ messages in thread
From: Greg Kroah-Hartman @ 2021-02-24  8:09 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Alexander Duyck, Jakub Kicinski, linux-pci, linux-rdma, netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Krzysztof Wilczyński

On Tue, Feb 23, 2021 at 03:07:43PM -0600, Bjorn Helgaas wrote:
> On Sun, Feb 21, 2021 at 08:59:18AM +0200, Leon Romanovsky wrote:
> > On Sat, Feb 20, 2021 at 01:06:00PM -0600, Bjorn Helgaas wrote:
> > > On Fri, Feb 19, 2021 at 09:20:18AM +0100, Greg Kroah-Hartman wrote:
> > >
> > > > Ok, can you step back and try to explain what problem you are trying to
> > > > solve first, before getting bogged down in odd details?  I find it
> > > > highly unlikely that this is something "unique", but I could be wrong as
> > > > I do not understand what you are wanting to do here at all.
> > >
> > > We want to add two new sysfs files:
> > >
> > >   sriov_vf_total_msix, for PF devices
> > >   sriov_vf_msix_count, for VF devices associated with the PF
> > >
> > > AFAICT it is *acceptable* if they are both present always.  But it
> > > would be *ideal* if they were only present when a driver that
> > > implements the ->sriov_get_vf_total_msix() callback is bound to the
> > > PF.
> > 
> > BTW, we already have all possible combinations: static, static with
> > folder, with and without "sriov_" prefix, dynamic with and without
> > folders on VFs.
> > 
> > I need to know on which version I'll get Acked-by and that version I
> > will resubmit.
> 
> I propose that you make static attributes for both files, so
> "sriov_vf_total_msix" is visible for *every* PF in the system and
> "sriov_vf_msix_count" is visible for *every* VF in the system.
> 
> The PF "sriov_vf_total_msix" show function can return zero if there's
> no PF driver or it doesn't support ->sriov_get_vf_total_msix().
> (Incidentally, I think the documentation should mention that when it
> *is* supported, the contents of this file are *constant*, i.e., it
> does not decrease as vectors are assigned to VFs.)
> 
> The "sriov_vf_msix_count" set function can ignore writes if there's no
> PF driver or it doesn't support ->sriov_get_vf_total_msix(), or if a
> VF driver is bound.
> 
> Any userspace software must be able to deal with those scenarios
> anyway, so I don't think the mere presence or absence of the files is
> a meaningful signal to that software.

Hopefully, good luck with that!

> If we figure out a way to make the files visible only when the
> appropriate driver is bound, that might be nice and could always be
> done later.  But I don't think it's essential.

That seems reasonable, feel free to cc: me on the next patch series and
I'll try to review it, which should make more sense to me than this
email thread :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-23 21:07                         ` Bjorn Helgaas
  2021-02-24  8:09                           ` Greg Kroah-Hartman
@ 2021-02-24  9:53                           ` Leon Romanovsky
  2021-02-24 15:07                             ` Bjorn Helgaas
  1 sibling, 1 reply; 32+ messages in thread
From: Leon Romanovsky @ 2021-02-24  9:53 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Greg Kroah-Hartman, Bjorn Helgaas, Saeed Mahameed,
	Jason Gunthorpe, Alexander Duyck, Jakub Kicinski, linux-pci,
	linux-rdma, netdev, Don Dutile, Alex Williamson,
	David S . Miller, Krzysztof Wilczyński

On Tue, Feb 23, 2021 at 03:07:43PM -0600, Bjorn Helgaas wrote:
> On Sun, Feb 21, 2021 at 08:59:18AM +0200, Leon Romanovsky wrote:
> > On Sat, Feb 20, 2021 at 01:06:00PM -0600, Bjorn Helgaas wrote:
> > > On Fri, Feb 19, 2021 at 09:20:18AM +0100, Greg Kroah-Hartman wrote:
> > >
> > > > Ok, can you step back and try to explain what problem you are trying to
> > > > solve first, before getting bogged down in odd details?  I find it
> > > > highly unlikely that this is something "unique", but I could be wrong as
> > > > I do not understand what you are wanting to do here at all.
> > >
> > > We want to add two new sysfs files:
> > >
> > >   sriov_vf_total_msix, for PF devices
> > >   sriov_vf_msix_count, for VF devices associated with the PF
> > >
> > > AFAICT it is *acceptable* if they are both present always.  But it
> > > would be *ideal* if they were only present when a driver that
> > > implements the ->sriov_get_vf_total_msix() callback is bound to the
> > > PF.
> >
> > BTW, we already have all possible combinations: static, static with
> > folder, with and without "sriov_" prefix, dynamic with and without
> > folders on VFs.
> >
> > I need to know on which version I'll get Acked-by and that version I
> > will resubmit.
>
> I propose that you make static attributes for both files, so
> "sriov_vf_total_msix" is visible for *every* PF in the system and
> "sriov_vf_msix_count" is visible for *every* VF in the system.

No problem, this is close to v0/v1.

>
> The PF "sriov_vf_total_msix" show function can return zero if there's
> no PF driver or it doesn't support ->sriov_get_vf_total_msix().
> (Incidentally, I think the documentation should mention that when it
> *is* supported, the contents of this file are *constant*, i.e., it
> does not decrease as vectors are assigned to VFs.)
>
> The "sriov_vf_msix_count" set function can ignore writes if there's no
> PF driver or it doesn't support ->sriov_get_vf_total_msix(), or if a
> VF driver is bound.

Just to be clear, why don't we return EINVAL/EOPNOTSUPP instead of
silently ignore?

Thanks

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-24  9:53                           ` Leon Romanovsky
@ 2021-02-24 15:07                             ` Bjorn Helgaas
  0 siblings, 0 replies; 32+ messages in thread
From: Bjorn Helgaas @ 2021-02-24 15:07 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Greg Kroah-Hartman, Bjorn Helgaas, Saeed Mahameed,
	Jason Gunthorpe, Alexander Duyck, Jakub Kicinski, linux-pci,
	linux-rdma, netdev, Don Dutile, Alex Williamson,
	David S . Miller, Krzysztof Wilczyński

On Wed, Feb 24, 2021 at 11:53:30AM +0200, Leon Romanovsky wrote:
> On Tue, Feb 23, 2021 at 03:07:43PM -0600, Bjorn Helgaas wrote:
> > On Sun, Feb 21, 2021 at 08:59:18AM +0200, Leon Romanovsky wrote:
> > > On Sat, Feb 20, 2021 at 01:06:00PM -0600, Bjorn Helgaas wrote:
> > > > On Fri, Feb 19, 2021 at 09:20:18AM +0100, Greg Kroah-Hartman wrote:
> > > >
> > > > > Ok, can you step back and try to explain what problem you are trying to
> > > > > solve first, before getting bogged down in odd details?  I find it
> > > > > highly unlikely that this is something "unique", but I could be wrong as
> > > > > I do not understand what you are wanting to do here at all.
> > > >
> > > > We want to add two new sysfs files:
> > > >
> > > >   sriov_vf_total_msix, for PF devices
> > > >   sriov_vf_msix_count, for VF devices associated with the PF
> > > >
> > > > AFAICT it is *acceptable* if they are both present always.  But it
> > > > would be *ideal* if they were only present when a driver that
> > > > implements the ->sriov_get_vf_total_msix() callback is bound to the
> > > > PF.
> > >
> > > BTW, we already have all possible combinations: static, static with
> > > folder, with and without "sriov_" prefix, dynamic with and without
> > > folders on VFs.
> > >
> > > I need to know on which version I'll get Acked-by and that version I
> > > will resubmit.
> >
> > I propose that you make static attributes for both files, so
> > "sriov_vf_total_msix" is visible for *every* PF in the system and
> > "sriov_vf_msix_count" is visible for *every* VF in the system.
> 
> No problem, this is close to v0/v1.
> 
> > The PF "sriov_vf_total_msix" show function can return zero if there's
> > no PF driver or it doesn't support ->sriov_get_vf_total_msix().
> > (Incidentally, I think the documentation should mention that when it
> > *is* supported, the contents of this file are *constant*, i.e., it
> > does not decrease as vectors are assigned to VFs.)
> >
> > The "sriov_vf_msix_count" set function can ignore writes if there's no
> > PF driver or it doesn't support ->sriov_get_vf_total_msix(), or if a
> > VF driver is bound.
> 
> Just to be clear, why don't we return EINVAL/EOPNOTSUPP instead of
> silently ignore?

Returning some error is fine.  I just meant that the reads/writes
would have no effect on the PCI core or the device driver.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  2021-02-24  8:09                           ` Greg Kroah-Hartman
@ 2021-02-24 21:37                             ` Don Dutile
  0 siblings, 0 replies; 32+ messages in thread
From: Don Dutile @ 2021-02-24 21:37 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Bjorn Helgaas
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Alexander Duyck, Jakub Kicinski, linux-pci, linux-rdma, netdev,
	Alex Williamson, David S . Miller, Krzysztof Wilczyński

On 2/24/21 3:09 AM, Greg Kroah-Hartman wrote:
> On Tue, Feb 23, 2021 at 03:07:43PM -0600, Bjorn Helgaas wrote:
>> On Sun, Feb 21, 2021 at 08:59:18AM +0200, Leon Romanovsky wrote:
>>> On Sat, Feb 20, 2021 at 01:06:00PM -0600, Bjorn Helgaas wrote:
>>>> On Fri, Feb 19, 2021 at 09:20:18AM +0100, Greg Kroah-Hartman wrote:
>>>>
>>>>> Ok, can you step back and try to explain what problem you are trying to
>>>>> solve first, before getting bogged down in odd details?  I find it
>>>>> highly unlikely that this is something "unique", but I could be wrong as
>>>>> I do not understand what you are wanting to do here at all.
>>>> We want to add two new sysfs files:
>>>>
>>>>    sriov_vf_total_msix, for PF devices
>>>>    sriov_vf_msix_count, for VF devices associated with the PF
>>>>
>>>> AFAICT it is *acceptable* if they are both present always.  But it
>>>> would be *ideal* if they were only present when a driver that
>>>> implements the ->sriov_get_vf_total_msix() callback is bound to the
>>>> PF.
>>> BTW, we already have all possible combinations: static, static with
>>> folder, with and without "sriov_" prefix, dynamic with and without
>>> folders on VFs.
>>>
>>> I need to know on which version I'll get Acked-by and that version I
>>> will resubmit.
>> I propose that you make static attributes for both files, so
>> "sriov_vf_total_msix" is visible for *every* PF in the system and
>> "sriov_vf_msix_count" is visible for *every* VF in the system.
>>
>> The PF "sriov_vf_total_msix" show function can return zero if there's
>> no PF driver or it doesn't support ->sriov_get_vf_total_msix().
>> (Incidentally, I think the documentation should mention that when it
>> *is* supported, the contents of this file are *constant*, i.e., it
>> does not decrease as vectors are assigned to VFs.)
>>
>> The "sriov_vf_msix_count" set function can ignore writes if there's no
>> PF driver or it doesn't support ->sriov_get_vf_total_msix(), or if a
>> VF driver is bound.
>>
>> Any userspace software must be able to deal with those scenarios
>> anyway, so I don't think the mere presence or absence of the files is
>> a meaningful signal to that software.
> Hopefully, good luck with that!
Management sw is use to dealing with optional sysfs files.
libvirt does that now with the VF files for a PF -- all PF's don't have VFs.
The VF files are only created if a VF ext-cfg-hdr exists.

So, as Bjorn said, mgmt sw related to optionally tuning PCIe devices are designed to check for file existence.

>> If we figure out a way to make the files visible only when the
>> appropriate driver is bound, that might be nice and could always be
>> done later.  But I don't think it's essential.
> That seems reasonable, feel free to cc: me on the next patch series and
> I'll try to review it, which should make more sense to me than this
> email thread :)
>
> thanks,
>
> greg k-h
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-02-24 21:39 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-09 13:34 [PATCH mlx5-next v6 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
2021-02-09 13:34 ` [PATCH mlx5-next v6 1/4] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs Leon Romanovsky
2021-02-15 21:01   ` Bjorn Helgaas
2021-02-16  7:33     ` Leon Romanovsky
2021-02-16 16:12       ` Bjorn Helgaas
2021-02-16 19:58         ` Leon Romanovsky
2021-02-17 18:02           ` Bjorn Helgaas
2021-02-17 19:25             ` Jason Gunthorpe
2021-02-17 20:28               ` Bjorn Helgaas
2021-02-17 23:52                 ` Jason Gunthorpe
2021-02-18 10:15             ` Leon Romanovsky
2021-02-18 22:39               ` Bjorn Helgaas
2021-02-19  7:52                 ` Leon Romanovsky
2021-02-19  8:20                   ` Greg Kroah-Hartman
2021-02-19 16:58                     ` Leon Romanovsky
2021-02-20 19:06                     ` Bjorn Helgaas
2021-02-21  6:59                       ` Leon Romanovsky
2021-02-23 21:07                         ` Bjorn Helgaas
2021-02-24  8:09                           ` Greg Kroah-Hartman
2021-02-24 21:37                             ` Don Dutile
2021-02-24  9:53                           ` Leon Romanovsky
2021-02-24 15:07                             ` Bjorn Helgaas
2021-02-21 13:00                       ` Greg Kroah-Hartman
2021-02-21 13:55                         ` Leon Romanovsky
2021-02-21 15:01                           ` Greg Kroah-Hartman
2021-02-21 15:30                             ` Leon Romanovsky
2021-02-16 20:37         ` Jason Gunthorpe
2021-02-09 13:34 ` [PATCH mlx5-next v6 2/4] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
2021-02-09 13:34 ` [PATCH mlx5-next v6 3/4] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
2021-02-09 13:34 ` [PATCH mlx5-next v6 4/4] net/mlx5: Allow to the users to configure number of MSI-X vectors Leon Romanovsky
2021-02-09 21:06 ` [PATCH mlx5-next v6 0/4] Dynamically assign MSI-X vectors count Alexander Duyck
2021-02-14  5:24   ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).