linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
@ 2021-03-01  7:55 Leon Romanovsky
  2021-03-01  7:55 ` [PATCH mlx5-next v7 1/4] PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs Leon Romanovsky
                   ` (6 more replies)
  0 siblings, 7 replies; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-01  7:55 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed, Alexander Duyck
  Cc: Leon Romanovsky, Jason Gunthorpe, Jakub Kicinski, linux-pci,
	linux-rdma, netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

From: Leon Romanovsky <leonro@nvidia.com>

@Alexander Duyck, please update me if I can add your ROB tag again
to the series, because you liked v6 more.

Thanks

---------------------------------------------------------------------------------
Changelog
v7:
 * Rebase on top v5.12-rc1
 * More english fixes
 * Returned to static sysfs creation model as was implemented in v0/v1.
v6: https://lore.kernel.org/linux-pci/20210209133445.700225-1-leon@kernel.org
 * Patch 1:
   * English fixes
   * Moved pci_vf_set_msix_vec_count() from msi.c to iov.c
   * Embedded pci_vf_set_msix_vec_count() into sriov_vf_msix_count_store
   * Deleted sriov_vf_msix_count_show
   * Deleted vfs_overlay folder
   * Renamed functions *_vfs_overlay_* to be *_vf_overlay_*
   * Deleted is_supported and attribute_group because it confused people more than
     it gave advantage.
   * Changed vf_total_msix to be callback
 * Patch 3:
   * Fixed english as suggested by Bjorn
   * Added more explanations to the commit message
 * Patch 4:
   * Protected enable/disable with capability check
v5: https://lore.kernel.org/linux-pci/20210126085730.1165673-1-leon@kernel.org
 * Patch 1:
  * Added forgotten "inline" keyword when declaring empty functions.
v4: https://lore.kernel.org/linux-pci/20210124131119.558563-1-leon@kernel.org
 * Used sysfs_emit() instead of sprintf() in new sysfs entries.
 * Changed EXPORT_SYMBOL to be EXPORT_SYMBOL_GPL for pci_iov_virtfn_devfn().
 * Rewrote sysfs registration code to be driven by PF that wants to enable VF
   overlay instead of creating to all SR-IOV devices.
 * Grouped all such functionality under new "vfs_overlay" folder.
 * Combined two PCI patches into one.
v3: https://lore.kernel.org/linux-pci/20210117081548.1278992-1-leon@kernel.org
 * Renamed pci_set_msix_vec_count to be pci_vf_set_msix_vec_count.
 * Added VF msix_cap check to hide sysfs entry if device doesn't support msix.
 * Changed "-" to be ":" in the mlx5 patch to silence CI warnings about missing
   kdoc description.
 * Split differently error print in mlx5 driver to avoid checkpatch warning.
v2: https://lore.kernel.org/linux-pci/20210114103140.866141-1-leon@kernel.org
 * Patch 1:
  * Renamed vf_msix_vec sysfs knob to be sriov_vf_msix_count
  * Added PF and VF device locks during set MSI-X call to protect from parallel
    driver bind/unbind operations.
  * Removed extra checks when reading sriov_vf_msix, because users will
    be able to distinguish between supported/not supported by looking on
    sriov_vf_total_msix count.
  * Changed all occurrences of "numb" to be "count"
  * Changed returned error from EOPNOTSUPP to be EBUSY if user tries to set
    MSI-X count after driver already bound to the VF.
  * Added extra comment in pci_set_msix_vec_count() to emphasize that driver
    should not be bound.
 * Patch 2:
  * Changed vf_total_msix from int to be u32 and updated function signatures
    accordingly.
  * Improved patch title
v1: https://lore.kernel.org/linux-pci/20210110150727.1965295-1-leon@kernel.org
 * Improved wording and commit messages of first PCI patch
 * Added extra PCI patch to provide total number of MSI-X vectors
 * Prohibited read of vf_msix_vec sysfs file if driver doesn't support write
 * Removed extra function definition in pci.h
v0: https://lore.kernel.org/linux-pci/20210103082440.34994-1-leon@kernel.org

--------------------------------------------------------------------
Hi,

The number of MSI-X vectors is PCI property visible through lspci, that
field is read-only and configured by the device.

The static assignment of an amount of MSI-X vectors doesn't allow utilize
the newly created VF because it is not known to the device the future load
and configuration where that VF will be used.

The VFs are created on the hypervisor and forwarded to the VMs that have
different properties (for example number of CPUs).

To overcome the inefficiency in the spread of such MSI-X vectors, we
allow the kernel to instruct the device with the needed number of such
vectors, before VF is initialized and bounded to the driver.

Before this series:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
        Capabilities: [9c] MSI-X: Enable- Count=12 Masked-

Configuration script:
1. Start fresh
echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
modprobe -q -r mlx5_ib mlx5_core
2. Ensure that driver doesn't run and it is safe to change MSI-X
echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_drivers_autoprobe
3. Load driver for the PF
modprobe mlx5_core
4. Configure one of the VFs with new number
echo 2 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
echo 21 > /sys/bus/pci/devices/0000\:08\:00.2/sriov_vf_msix_count

After this series:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
        Capabilities: [9c] MSI-X: Enable- Count=21 Masked-

Thanks

Leon Romanovsky (4):
  PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs
  net/mlx5: Add dynamic MSI-X capabilities bits
  net/mlx5: Dynamically assign MSI-X vectors count
  net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks

 Documentation/ABI/testing/sysfs-bus-pci       |  29 +++++
 .../net/ethernet/mellanox/mlx5/core/main.c    |   6 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  12 +++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  73 +++++++++++++
 .../net/ethernet/mellanox/mlx5/core/sriov.c   |  48 ++++++++-
 drivers/pci/iov.c                             | 102 ++++++++++++++++--
 drivers/pci/pci-sysfs.c                       |   3 +-
 drivers/pci/pci.h                             |   3 +-
 include/linux/mlx5/mlx5_ifc.h                 |  11 +-
 include/linux/pci.h                           |   8 ++
 10 files changed, 284 insertions(+), 11 deletions(-)

--
2.29.2


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH mlx5-next v7 1/4] PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs
  2021-03-01  7:55 [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
@ 2021-03-01  7:55 ` Leon Romanovsky
  2021-03-01  8:14   ` Greg Kroah-Hartman
  2021-03-01  7:55 ` [PATCH mlx5-next v7 2/4] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-01  7:55 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

From: Leon Romanovsky <leonro@nvidia.com>

A typical cloud provider SR-IOV use case is to create many VFs for use by
guest VMs. The VFs may not be assigned to a VM until a customer requests a
VM of a certain size, e.g., number of CPUs. A VF may need MSI-X vectors
proportional to the number of CPUs in the VM, but there is no standard way
to change the number of MSI-X vectors supported by a VF.

Some Mellanox ConnectX devices support dynamic assignment of MSI-X vectors
to SR-IOV VFs. This can be done by the PF driver after VFs are enabled,
and it can be done without affecting VFs that are already in use. The
hardware supports a limited pool of MSI-X vectors that can be assigned to
the PF or to individual VFs.  This is device-specific behavior that
requires support in the PF driver.

Add a read-only "sriov_vf_total_msix" sysfs file for the PF and a writable
"sriov_vf_msix_count" file for each VF. Management software may use these
to learn how many MSI-X vectors are available and to dynamically assign
them to VFs before the VFs are passed through to a VM.

If the PF driver implements the ->sriov_get_vf_total_msix() callback,
"sriov_vf_total_msix" contains the total number of MSI-X vectors available
for distribution among VFs.

If no driver is bound to the VF, writing "N" to "sriov_vf_msix_count" uses
the PF driver ->sriov_set_msix_vec_count() callback to assign "N" MSI-X
vectors to the VF.  When a VF driver subsequently reads the MSI-X Message
Control register, it will see the new Table Size "N".

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 Documentation/ABI/testing/sysfs-bus-pci |  29 +++++++
 drivers/pci/iov.c                       | 102 ++++++++++++++++++++++--
 drivers/pci/pci-sysfs.c                 |   3 +-
 drivers/pci/pci.h                       |   3 +-
 include/linux/pci.h                     |   8 ++
 5 files changed, 137 insertions(+), 8 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 25c9c39770c6..ebabd0d2ae88 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -375,3 +375,32 @@ Description:
 		The value comes from the PCI kernel device state and can be one
 		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
 		The file is read only.
+
+What:		/sys/bus/pci/devices/.../sriov_vf_total_msix
+Date:		January 2021
+Contact:	Leon Romanovsky <leonro@nvidia.com>
+Description:
+		This file is associated with a SR-IOV PF.  It contains the
+		total number of MSI-X vectors available for assignment to
+		all VFs associated with PF.  It will zero if the device
+		doesn't support this functionality. For supported devices,
+		the value will be constant and won't be changed after MSI-X
+		vectors assignment.
+
+What:		/sys/bus/pci/devices/.../sriov_vf_msix_count
+Date:		January 2021
+Contact:	Leon Romanovsky <leonro@nvidia.com>
+Description:
+		This file is associated with a SR-IOV VF.  It allows
+		configuration of the number of MSI-X vectors for the VF.  This
+		allows devices that have a global pool of MSI-X vectors to
+		optimally divide them between VFs based on VF usage.
+
+		The values accepted are:
+		 * > 0 - this number will be reported as the Table Size in the
+			 VF's MSI-X capability
+		 * < 0 - not valid
+		 * = 0 - will reset to the device default value
+
+		The file is writable if the PF is bound to a driver that
+		implements ->sriov_set_msix_vec_count().
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 4afd4ee4f7f0..9bf6f52ad4d8 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
 	return (dev->devfn + dev->sriov->offset +
 		dev->sriov->stride * vf_id) & 0xff;
 }
+EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn);

 /*
  * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
@@ -157,6 +158,92 @@ int pci_iov_sysfs_link(struct pci_dev *dev,
 	return rc;
 }

+#ifdef CONFIG_PCI_MSI
+static ssize_t sriov_vf_msix_count_store(struct device *dev,
+					 struct device_attribute *attr,
+					 const char *buf, size_t count)
+{
+	struct pci_dev *vf_dev = to_pci_dev(dev);
+	struct pci_dev *pdev = pci_physfn(vf_dev);
+	int val, ret;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	if (val < 0)
+		return -EINVAL;
+
+	device_lock(&pdev->dev);
+	if (!pdev->driver || !pdev->driver->sriov_set_msix_vec_count) {
+		ret = -EOPNOTSUPP;
+		goto err_pdev;
+	}
+
+	device_lock(&vf_dev->dev);
+	if (vf_dev->driver) {
+		/*
+		 * A driver is already attached to this VF and has configured
+		 * itself based on the current MSI-X vector count. Changing
+		 * the vector size could mess up the driver, so block it.
+		 */
+		ret = -EBUSY;
+		goto err_dev;
+	}
+
+	ret = pdev->driver->sriov_set_msix_vec_count(vf_dev, val);
+
+err_dev:
+	device_unlock(&vf_dev->dev);
+err_pdev:
+	device_unlock(&pdev->dev);
+	return ret ? : count;
+}
+static DEVICE_ATTR_WO(sriov_vf_msix_count);
+
+static ssize_t sriov_vf_total_msix_show(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	u32 vf_total_msix = 0;
+
+	device_lock(dev);
+	if (!pdev->driver || !pdev->driver->sriov_get_vf_total_msix)
+		goto unlock;
+
+	vf_total_msix = pdev->driver->sriov_get_vf_total_msix(pdev);
+unlock:
+	device_unlock(dev);
+	return sysfs_emit(buf, "%u\n", vf_total_msix);
+}
+static DEVICE_ATTR_RO(sriov_vf_total_msix);
+#endif
+
+static struct attribute *sriov_vf_dev_attrs[] = {
+#ifdef CONFIG_PCI_MSI
+	&dev_attr_sriov_vf_msix_count.attr,
+#endif
+	NULL,
+};
+
+static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
+					  struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	if (!pdev->is_virtfn)
+		return 0;
+
+	return a->mode;
+}
+
+const struct attribute_group sriov_vf_dev_attr_group = {
+	.attrs = sriov_vf_dev_attrs,
+	.is_visible = sriov_vf_attrs_are_visible,
+};
+
 int pci_iov_add_virtfn(struct pci_dev *dev, int id)
 {
 	int i;
@@ -400,18 +487,21 @@ static DEVICE_ATTR_RO(sriov_stride);
 static DEVICE_ATTR_RO(sriov_vf_device);
 static DEVICE_ATTR_RW(sriov_drivers_autoprobe);

-static struct attribute *sriov_dev_attrs[] = {
+static struct attribute *sriov_pf_dev_attrs[] = {
 	&dev_attr_sriov_totalvfs.attr,
 	&dev_attr_sriov_numvfs.attr,
 	&dev_attr_sriov_offset.attr,
 	&dev_attr_sriov_stride.attr,
 	&dev_attr_sriov_vf_device.attr,
 	&dev_attr_sriov_drivers_autoprobe.attr,
+#ifdef CONFIG_PCI_MSI
+	&dev_attr_sriov_vf_total_msix.attr,
+#endif
 	NULL,
 };

-static umode_t sriov_attrs_are_visible(struct kobject *kobj,
-				       struct attribute *a, int n)
+static umode_t sriov_pf_attrs_are_visible(struct kobject *kobj,
+					  struct attribute *a, int n)
 {
 	struct device *dev = kobj_to_dev(kobj);

@@ -421,9 +511,9 @@ static umode_t sriov_attrs_are_visible(struct kobject *kobj,
 	return a->mode;
 }

-const struct attribute_group sriov_dev_attr_group = {
-	.attrs = sriov_dev_attrs,
-	.is_visible = sriov_attrs_are_visible,
+const struct attribute_group sriov_pf_dev_attr_group = {
+	.attrs = sriov_pf_dev_attrs,
+	.is_visible = sriov_pf_attrs_are_visible,
 };

 int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index f8afd54ca3e1..a6b8fbbba6d2 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1567,7 +1567,8 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
 	&pci_dev_attr_group,
 	&pci_dev_hp_attr_group,
 #ifdef CONFIG_PCI_IOV
-	&sriov_dev_attr_group,
+	&sriov_pf_dev_attr_group,
+	&sriov_vf_dev_attr_group,
 #endif
 	&pci_bridge_attr_group,
 	&pcie_dev_attr_group,
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index ef7c4661314f..afb87b917f07 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -501,7 +501,8 @@ void pci_iov_update_resource(struct pci_dev *dev, int resno);
 resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
 void pci_restore_iov_state(struct pci_dev *dev);
 int pci_iov_bus_range(struct pci_bus *bus);
-extern const struct attribute_group sriov_dev_attr_group;
+extern const struct attribute_group sriov_pf_dev_attr_group;
+extern const struct attribute_group sriov_vf_dev_attr_group;
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 86c799c97b77..9b575a676888 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -856,6 +856,12 @@ struct module;
  *		e.g. drivers/net/e100.c.
  * @sriov_configure: Optional driver callback to allow configuration of
  *		number of VFs to enable via sysfs "sriov_numvfs" file.
+ * @sriov_set_msix_vec_count: PF Driver callback to change number of MSI-X
+ *              vectors on a VF. Triggered via sysfs "sriov_vf_msix_count".
+ *              This will change MSI-X Table Size in the VF Message Control
+ *              registers.
+ * @sriov_get_vf_total_msix: PF driver callback to get the total number of
+ *              MSI-X vectors available for distribution to the VFs.
  * @err_handler: See Documentation/PCI/pci-error-recovery.rst
  * @groups:	Sysfs attribute groups.
  * @driver:	Driver model structure.
@@ -871,6 +877,8 @@ struct pci_driver {
 	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
 	void (*shutdown)(struct pci_dev *dev);
 	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
+	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
+	u32  (*sriov_get_vf_total_msix)(struct pci_dev *pf);
 	const struct pci_error_handlers *err_handler;
 	const struct attribute_group **groups;
 	struct device_driver	driver;
--
2.29.2


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH mlx5-next v7 2/4] net/mlx5: Add dynamic MSI-X capabilities bits
  2021-03-01  7:55 [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
  2021-03-01  7:55 ` [PATCH mlx5-next v7 1/4] PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs Leon Romanovsky
@ 2021-03-01  7:55 ` Leon Romanovsky
  2021-03-01  7:55 ` [PATCH mlx5-next v7 3/4] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-01  7:55 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

From: Leon Romanovsky <leonro@nvidia.com>

These new fields declare the number of MSI-X vectors that is possible to
allocate on the VF through PF configuration.

Value must be in range defined by min_dynamic_vf_msix_table_size and
max_dynamic_vf_msix_table_size.

The driver should continue to query its MSI-X table through PCI
configuration header.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 60b97556ee14..4215bb6a6b52 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -1685,7 +1685,16 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8	   reserved_at_6e0[0x10];
 	u8	   sf_base_id[0x10];

-	u8	   reserved_at_700[0x80];
+	u8	   reserved_at_700[0x8];
+	u8	   num_total_dynamic_vf_msix[0x18];
+	u8	   reserved_at_720[0x14];
+	u8	   dynamic_msix_table_size[0xc];
+	u8	   reserved_at_740[0xc];
+	u8	   min_dynamic_vf_msix_table_size[0x4];
+	u8	   reserved_at_750[0x4];
+	u8	   max_dynamic_vf_msix_table_size[0xc];
+
+	u8	   reserved_at_760[0x20];
 	u8	   vhca_tunnel_commands[0x40];
 	u8	   reserved_at_7c0[0x40];
 };
--
2.29.2


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH mlx5-next v7 3/4] net/mlx5: Dynamically assign MSI-X vectors count
  2021-03-01  7:55 [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
  2021-03-01  7:55 ` [PATCH mlx5-next v7 1/4] PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs Leon Romanovsky
  2021-03-01  7:55 ` [PATCH mlx5-next v7 2/4] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
@ 2021-03-01  7:55 ` Leon Romanovsky
  2021-03-01  7:55 ` [PATCH mlx5-next v7 4/4] net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks Leon Romanovsky
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-01  7:55 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

From: Leon Romanovsky <leonro@nvidia.com>

The number of MSI-X vectors is a PCI property visible through lspci. The
field is read-only and configured by the device. The mlx5 devices work in
a static or dynamic assignment mode.

Static assignment means that all newly created VFs have a preset number of
MSI-X vectors determined by device configuration parameters. This can
result in some VFs having too many or too few MSI-X vectors. Till now this
has been the only means of fine-tuning the MSI-X vector count and it was
acceptable for small numbers of VFs.

With dynamic assignment the inefficiency of having a fixed number of MSI-X
vectors can be avoided with each VF having exactly the required
vectors. Userspace will provide this information while provisioning the VF
for use, based on the intended use. For instance if being used with a VM,
the MSI-X vector count might be matched to the CPU count of the VM.

For compatibility mlx5 continues to start up with MSI-X vector assignment,
but the kernel can now access a larger dynamic vector pool and assign more
vectors to created VFs.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/main.c    |  4 +
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  5 ++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 73 +++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 13 +++-
 4 files changed, 93 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index c568896cfb23..0489712865b7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -571,6 +571,10 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)

 	mlx5_vhca_state_cap_handle(dev, set_hca_cap);

+	if (MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix))
+		MLX5_SET(cmd_hca_cap, set_hca_cap, num_total_dynamic_vf_msix,
+			 MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix));
+
 	return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE);
 }

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index efe403c7e354..f0aed664dd35 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -174,6 +174,11 @@ int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, int vecidx,
 		       struct notifier_block *nb);
 int mlx5_irq_detach_nb(struct mlx5_irq_table *irq_table, int vecidx,
 		       struct notifier_block *nb);
+
+int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int devfn,
+			    int msix_vec_count);
+int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs);
+
 struct cpumask *
 mlx5_irq_get_affinity_mask(struct mlx5_irq_table *irq_table, int vecidx);
 struct cpu_rmap *mlx5_irq_get_rmap(struct mlx5_irq_table *table);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index a61e09aff152..19e3e978267e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -61,6 +61,79 @@ static struct mlx5_irq *mlx5_irq_get(struct mlx5_core_dev *dev, int vecidx)
 	return &irq_table->irq[vecidx];
 }

+/**
+ * mlx5_get_default_msix_vec_count - Get the default number of MSI-X vectors
+ *                                   to be ssigned to each VF.
+ * @dev: PF to work on
+ * @num_vfs: Number of enabled VFs
+ */
+int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs)
+{
+	int num_vf_msix, min_msix, max_msix;
+
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+	if (!num_vf_msix)
+		return 0;
+
+	min_msix = MLX5_CAP_GEN(dev, min_dynamic_vf_msix_table_size);
+	max_msix = MLX5_CAP_GEN(dev, max_dynamic_vf_msix_table_size);
+
+	/* Limit maximum number of MSI-X vectors so the default configuration
+	 * has some available in the pool. This will allow the user to increase
+	 * the number of vectors in a VF without having to first size-down other
+	 * VFs.
+	 */
+	return max(min(num_vf_msix / num_vfs, max_msix / 2), min_msix);
+}
+
+/**
+ * mlx5_set_msix_vec_count - Set dynamically allocated MSI-X on the VF
+ * @dev: PF to work on
+ * @function_id: Internal PCI VF function IDd
+ * @msix_vec_count: Number of MSI-X vectors to set
+ */
+int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int function_id,
+			    int msix_vec_count)
+{
+	int sz = MLX5_ST_SZ_BYTES(set_hca_cap_in);
+	int num_vf_msix, min_msix, max_msix;
+	void *hca_cap, *cap;
+	int ret;
+
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+	if (!num_vf_msix)
+		return 0;
+
+	if (!MLX5_CAP_GEN(dev, vport_group_manager) || !mlx5_core_is_pf(dev))
+		return -EOPNOTSUPP;
+
+	min_msix = MLX5_CAP_GEN(dev, min_dynamic_vf_msix_table_size);
+	max_msix = MLX5_CAP_GEN(dev, max_dynamic_vf_msix_table_size);
+
+	if (msix_vec_count < min_msix)
+		return -EINVAL;
+
+	if (msix_vec_count > max_msix)
+		return -EOVERFLOW;
+
+	hca_cap = kzalloc(sz, GFP_KERNEL);
+	if (!hca_cap)
+		return -ENOMEM;
+
+	cap = MLX5_ADDR_OF(set_hca_cap_in, hca_cap, capability);
+	MLX5_SET(cmd_hca_cap, cap, dynamic_msix_table_size, msix_vec_count);
+
+	MLX5_SET(set_hca_cap_in, hca_cap, opcode, MLX5_CMD_OP_SET_HCA_CAP);
+	MLX5_SET(set_hca_cap_in, hca_cap, other_function, 1);
+	MLX5_SET(set_hca_cap_in, hca_cap, function_id, function_id);
+
+	MLX5_SET(set_hca_cap_in, hca_cap, op_mod,
+		 MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE << 1);
+	ret = mlx5_cmd_exec_in(dev, set_hca_cap, hca_cap);
+	kfree(hca_cap);
+	return ret;
+}
+
 int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, int vecidx,
 		       struct notifier_block *nb)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index 3094d20297a9..f0ec86a1c8a6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -71,8 +71,7 @@ static int sriov_restore_guids(struct mlx5_core_dev *dev, int vf)
 static int mlx5_device_enable_sriov(struct mlx5_core_dev *dev, int num_vfs)
 {
 	struct mlx5_core_sriov *sriov = &dev->priv.sriov;
-	int err;
-	int vf;
+	int err, vf, num_msix_count;

 	if (!MLX5_ESWITCH_MANAGER(dev))
 		goto enable_vfs_hca;
@@ -85,12 +84,22 @@ static int mlx5_device_enable_sriov(struct mlx5_core_dev *dev, int num_vfs)
 	}

 enable_vfs_hca:
+	num_msix_count = mlx5_get_default_msix_vec_count(dev, num_vfs);
 	for (vf = 0; vf < num_vfs; vf++) {
 		err = mlx5_core_enable_hca(dev, vf + 1);
 		if (err) {
 			mlx5_core_warn(dev, "failed to enable VF %d (%d)\n", vf, err);
 			continue;
 		}
+
+		err = mlx5_set_msix_vec_count(dev, vf + 1, num_msix_count);
+		if (err) {
+			mlx5_core_warn(dev,
+				       "failed to set MSI-X vector counts VF %d, err %d\n",
+				       vf, err);
+			continue;
+		}
+
 		sriov->vfs_ctx[vf].enabled = 1;
 		if (MLX5_CAP_GEN(dev, port_type) == MLX5_CAP_PORT_TYPE_IB) {
 			err = sriov_restore_guids(dev, vf);
--
2.29.2


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH mlx5-next v7 4/4] net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks
  2021-03-01  7:55 [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
                   ` (2 preceding siblings ...)
  2021-03-01  7:55 ` [PATCH mlx5-next v7 3/4] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
@ 2021-03-01  7:55 ` Leon Romanovsky
  2021-03-07  8:11 ` [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-01  7:55 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

From: Leon Romanovsky <leonro@nvidia.com>

The mlx5 implementation executes a firmware command on the PF to change
the configuration of the selected VF.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/main.c    |  2 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  7 ++++
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 35 +++++++++++++++++++
 3 files changed, 44 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 0489712865b7..edca6bc87639 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1700,6 +1700,8 @@ static struct pci_driver mlx5_core_driver = {
 	.shutdown	= shutdown,
 	.err_handler	= &mlx5_err_handler,
 	.sriov_configure   = mlx5_core_sriov_configure,
+	.sriov_get_vf_total_msix = mlx5_sriov_get_vf_total_msix,
+	.sriov_set_msix_vec_count = mlx5_core_sriov_set_msix_vec_count,
 };

 static void mlx5_core_verify_params(void)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index f0aed664dd35..99007f2d0424 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -140,6 +140,7 @@ void mlx5_sriov_cleanup(struct mlx5_core_dev *dev);
 int mlx5_sriov_attach(struct mlx5_core_dev *dev);
 void mlx5_sriov_detach(struct mlx5_core_dev *dev);
 int mlx5_core_sriov_configure(struct pci_dev *dev, int num_vfs);
+int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count);
 int mlx5_core_enable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_core_disable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_create_scheduling_element_cmd(struct mlx5_core_dev *dev, u8 hierarchy,
@@ -278,4 +279,10 @@ int mlx5_load_one(struct mlx5_core_dev *dev, bool boot);
 int mlx5_vport_get_other_func_cap(struct mlx5_core_dev *dev, u16 function_id, void *out);

 void mlx5_events_work_enqueue(struct mlx5_core_dev *dev, struct work_struct *work);
+static inline u32 mlx5_sriov_get_vf_total_msix(struct pci_dev *pdev)
+{
+	struct mlx5_core_dev *dev = pci_get_drvdata(pdev);
+
+	return MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+}
 #endif /* __MLX5_CORE_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index f0ec86a1c8a6..2338989d4403 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -187,6 +187,41 @@ int mlx5_core_sriov_configure(struct pci_dev *pdev, int num_vfs)
 	return err ? err : num_vfs;
 }

+int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count)
+{
+	struct pci_dev *pf = pci_physfn(vf);
+	struct mlx5_core_sriov *sriov;
+	struct mlx5_core_dev *dev;
+	int num_vf_msix, id;
+
+	dev = pci_get_drvdata(pf);
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+	if (!num_vf_msix)
+		return -EOPNOTSUPP;
+
+	if (!msix_vec_count)
+		msix_vec_count =
+			mlx5_get_default_msix_vec_count(dev, pci_num_vf(pf));
+
+	sriov = &dev->priv.sriov;
+
+	/* Reversed translation of PCI VF function number to the internal
+	 * function_id, which exists in the name of virtfn symlink.
+	 */
+	for (id = 0; id < pci_num_vf(pf); id++) {
+		if (!sriov->vfs_ctx[id].enabled)
+			continue;
+
+		if (vf->devfn == pci_iov_virtfn_devfn(pf, id))
+			break;
+	}
+
+	if (id == pci_num_vf(pf) || !sriov->vfs_ctx[id].enabled)
+		return -EINVAL;
+
+	return mlx5_set_msix_vec_count(dev, id + 1, msix_vec_count);
+}
+
 int mlx5_sriov_attach(struct mlx5_core_dev *dev)
 {
 	if (!mlx5_core_is_pf(dev) || !pci_num_vf(dev->pdev))
--
2.29.2


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 1/4] PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs
  2021-03-01  7:55 ` [PATCH mlx5-next v7 1/4] PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs Leon Romanovsky
@ 2021-03-01  8:14   ` Greg Kroah-Hartman
  2021-03-01  8:32     ` Leon Romanovsky
  0 siblings, 1 reply; 74+ messages in thread
From: Greg Kroah-Hartman @ 2021-03-01  8:14 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky, Jason Gunthorpe,
	Alexander Duyck, Jakub Kicinski, linux-pci, linux-rdma, netdev,
	Don Dutile, Alex Williamson, David S . Miller

On Mon, Mar 01, 2021 at 09:55:21AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> A typical cloud provider SR-IOV use case is to create many VFs for use by
> guest VMs. The VFs may not be assigned to a VM until a customer requests a
> VM of a certain size, e.g., number of CPUs. A VF may need MSI-X vectors
> proportional to the number of CPUs in the VM, but there is no standard way
> to change the number of MSI-X vectors supported by a VF.
> 
> Some Mellanox ConnectX devices support dynamic assignment of MSI-X vectors
> to SR-IOV VFs. This can be done by the PF driver after VFs are enabled,
> and it can be done without affecting VFs that are already in use. The
> hardware supports a limited pool of MSI-X vectors that can be assigned to
> the PF or to individual VFs.  This is device-specific behavior that
> requires support in the PF driver.
> 
> Add a read-only "sriov_vf_total_msix" sysfs file for the PF and a writable
> "sriov_vf_msix_count" file for each VF. Management software may use these
> to learn how many MSI-X vectors are available and to dynamically assign
> them to VFs before the VFs are passed through to a VM.
> 
> If the PF driver implements the ->sriov_get_vf_total_msix() callback,
> "sriov_vf_total_msix" contains the total number of MSI-X vectors available
> for distribution among VFs.
> 
> If no driver is bound to the VF, writing "N" to "sriov_vf_msix_count" uses
> the PF driver ->sriov_set_msix_vec_count() callback to assign "N" MSI-X
> vectors to the VF.  When a VF driver subsequently reads the MSI-X Message
> Control register, it will see the new Table Size "N".
> 
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci |  29 +++++++
>  drivers/pci/iov.c                       | 102 ++++++++++++++++++++++--
>  drivers/pci/pci-sysfs.c                 |   3 +-
>  drivers/pci/pci.h                       |   3 +-
>  include/linux/pci.h                     |   8 ++
>  5 files changed, 137 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 25c9c39770c6..ebabd0d2ae88 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -375,3 +375,32 @@ Description:
>  		The value comes from the PCI kernel device state and can be one
>  		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
>  		The file is read only.
> +
> +What:		/sys/bus/pci/devices/.../sriov_vf_total_msix
> +Date:		January 2021
> +Contact:	Leon Romanovsky <leonro@nvidia.com>
> +Description:
> +		This file is associated with a SR-IOV PF.  It contains the
> +		total number of MSI-X vectors available for assignment to
> +		all VFs associated with PF.  It will zero if the device

"The value will be zero if the device..."

And definition of "VF" and PF" are where in this file?

Otherwise looks semi-sane to me, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 1/4] PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs
  2021-03-01  8:14   ` Greg Kroah-Hartman
@ 2021-03-01  8:32     ` Leon Romanovsky
  2021-03-01  8:37       ` Greg Kroah-Hartman
  0 siblings, 1 reply; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-01  8:32 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

On Mon, Mar 01, 2021 at 09:14:42AM +0100, Greg Kroah-Hartman wrote:
> On Mon, Mar 01, 2021 at 09:55:21AM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > A typical cloud provider SR-IOV use case is to create many VFs for use by
> > guest VMs. The VFs may not be assigned to a VM until a customer requests a
> > VM of a certain size, e.g., number of CPUs. A VF may need MSI-X vectors
> > proportional to the number of CPUs in the VM, but there is no standard way
> > to change the number of MSI-X vectors supported by a VF.
> >
> > Some Mellanox ConnectX devices support dynamic assignment of MSI-X vectors
> > to SR-IOV VFs. This can be done by the PF driver after VFs are enabled,
> > and it can be done without affecting VFs that are already in use. The
> > hardware supports a limited pool of MSI-X vectors that can be assigned to
> > the PF or to individual VFs.  This is device-specific behavior that
> > requires support in the PF driver.
> >
> > Add a read-only "sriov_vf_total_msix" sysfs file for the PF and a writable
> > "sriov_vf_msix_count" file for each VF. Management software may use these
> > to learn how many MSI-X vectors are available and to dynamically assign
> > them to VFs before the VFs are passed through to a VM.
> >
> > If the PF driver implements the ->sriov_get_vf_total_msix() callback,
> > "sriov_vf_total_msix" contains the total number of MSI-X vectors available
> > for distribution among VFs.
> >
> > If no driver is bound to the VF, writing "N" to "sriov_vf_msix_count" uses
> > the PF driver ->sriov_set_msix_vec_count() callback to assign "N" MSI-X
> > vectors to the VF.  When a VF driver subsequently reads the MSI-X Message
> > Control register, it will see the new Table Size "N".
> >
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  Documentation/ABI/testing/sysfs-bus-pci |  29 +++++++
> >  drivers/pci/iov.c                       | 102 ++++++++++++++++++++++--
> >  drivers/pci/pci-sysfs.c                 |   3 +-
> >  drivers/pci/pci.h                       |   3 +-
> >  include/linux/pci.h                     |   8 ++
> >  5 files changed, 137 insertions(+), 8 deletions(-)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > index 25c9c39770c6..ebabd0d2ae88 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > @@ -375,3 +375,32 @@ Description:
> >  		The value comes from the PCI kernel device state and can be one
> >  		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
> >  		The file is read only.
> > +
> > +What:		/sys/bus/pci/devices/.../sriov_vf_total_msix
> > +Date:		January 2021
> > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > +Description:
> > +		This file is associated with a SR-IOV PF.  It contains the
> > +		total number of MSI-X vectors available for assignment to
> > +		all VFs associated with PF.  It will zero if the device
>
> "The value will be zero if the device..."

Thanks, will fix when apply or resend if more fixes will be needed.

>
> And definition of "VF" and PF" are where in this file?

They come from the PCI spec. It is part of SR-IOV lingo.

>
> Otherwise looks semi-sane to me, thanks.
>
> greg k-h

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 1/4] PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs
  2021-03-01  8:32     ` Leon Romanovsky
@ 2021-03-01  8:37       ` Greg Kroah-Hartman
  2021-03-01  8:53         ` Leon Romanovsky
  0 siblings, 1 reply; 74+ messages in thread
From: Greg Kroah-Hartman @ 2021-03-01  8:37 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

On Mon, Mar 01, 2021 at 10:32:09AM +0200, Leon Romanovsky wrote:
> On Mon, Mar 01, 2021 at 09:14:42AM +0100, Greg Kroah-Hartman wrote:
> > On Mon, Mar 01, 2021 at 09:55:21AM +0200, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > A typical cloud provider SR-IOV use case is to create many VFs for use by
> > > guest VMs. The VFs may not be assigned to a VM until a customer requests a
> > > VM of a certain size, e.g., number of CPUs. A VF may need MSI-X vectors
> > > proportional to the number of CPUs in the VM, but there is no standard way
> > > to change the number of MSI-X vectors supported by a VF.
> > >
> > > Some Mellanox ConnectX devices support dynamic assignment of MSI-X vectors
> > > to SR-IOV VFs. This can be done by the PF driver after VFs are enabled,
> > > and it can be done without affecting VFs that are already in use. The
> > > hardware supports a limited pool of MSI-X vectors that can be assigned to
> > > the PF or to individual VFs.  This is device-specific behavior that
> > > requires support in the PF driver.
> > >
> > > Add a read-only "sriov_vf_total_msix" sysfs file for the PF and a writable
> > > "sriov_vf_msix_count" file for each VF. Management software may use these
> > > to learn how many MSI-X vectors are available and to dynamically assign
> > > them to VFs before the VFs are passed through to a VM.
> > >
> > > If the PF driver implements the ->sriov_get_vf_total_msix() callback,
> > > "sriov_vf_total_msix" contains the total number of MSI-X vectors available
> > > for distribution among VFs.
> > >
> > > If no driver is bound to the VF, writing "N" to "sriov_vf_msix_count" uses
> > > the PF driver ->sriov_set_msix_vec_count() callback to assign "N" MSI-X
> > > vectors to the VF.  When a VF driver subsequently reads the MSI-X Message
> > > Control register, it will see the new Table Size "N".
> > >
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > ---
> > >  Documentation/ABI/testing/sysfs-bus-pci |  29 +++++++
> > >  drivers/pci/iov.c                       | 102 ++++++++++++++++++++++--
> > >  drivers/pci/pci-sysfs.c                 |   3 +-
> > >  drivers/pci/pci.h                       |   3 +-
> > >  include/linux/pci.h                     |   8 ++
> > >  5 files changed, 137 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > index 25c9c39770c6..ebabd0d2ae88 100644
> > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > @@ -375,3 +375,32 @@ Description:
> > >  		The value comes from the PCI kernel device state and can be one
> > >  		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
> > >  		The file is read only.
> > > +
> > > +What:		/sys/bus/pci/devices/.../sriov_vf_total_msix
> > > +Date:		January 2021
> > > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > > +Description:
> > > +		This file is associated with a SR-IOV PF.  It contains the
> > > +		total number of MSI-X vectors available for assignment to
> > > +		all VFs associated with PF.  It will zero if the device
> >
> > "The value will be zero if the device..."
> 
> Thanks, will fix when apply or resend if more fixes will be needed.
> 
> >
> > And definition of "VF" and PF" are where in this file?
> 
> They come from the PCI spec. It is part of SR-IOV lingo.

Yes, and the world does not have access to the PCI spec, so please
provide a hint as to what they are for those of us without access to
such things.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 1/4] PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs
  2021-03-01  8:37       ` Greg Kroah-Hartman
@ 2021-03-01  8:53         ` Leon Romanovsky
  0 siblings, 0 replies; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-01  8:53 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Alexander Duyck,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Don Dutile,
	Alex Williamson, David S . Miller

On Mon, Mar 01, 2021 at 09:37:17AM +0100, Greg Kroah-Hartman wrote:
> On Mon, Mar 01, 2021 at 10:32:09AM +0200, Leon Romanovsky wrote:
> > On Mon, Mar 01, 2021 at 09:14:42AM +0100, Greg Kroah-Hartman wrote:
> > > On Mon, Mar 01, 2021 at 09:55:21AM +0200, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > >
> > > > A typical cloud provider SR-IOV use case is to create many VFs for use by
> > > > guest VMs. The VFs may not be assigned to a VM until a customer requests a
> > > > VM of a certain size, e.g., number of CPUs. A VF may need MSI-X vectors
> > > > proportional to the number of CPUs in the VM, but there is no standard way
> > > > to change the number of MSI-X vectors supported by a VF.
> > > >
> > > > Some Mellanox ConnectX devices support dynamic assignment of MSI-X vectors
> > > > to SR-IOV VFs. This can be done by the PF driver after VFs are enabled,
> > > > and it can be done without affecting VFs that are already in use. The
> > > > hardware supports a limited pool of MSI-X vectors that can be assigned to
> > > > the PF or to individual VFs.  This is device-specific behavior that
> > > > requires support in the PF driver.
> > > >
> > > > Add a read-only "sriov_vf_total_msix" sysfs file for the PF and a writable
> > > > "sriov_vf_msix_count" file for each VF. Management software may use these
> > > > to learn how many MSI-X vectors are available and to dynamically assign
> > > > them to VFs before the VFs are passed through to a VM.
> > > >
> > > > If the PF driver implements the ->sriov_get_vf_total_msix() callback,
> > > > "sriov_vf_total_msix" contains the total number of MSI-X vectors available
> > > > for distribution among VFs.
> > > >
> > > > If no driver is bound to the VF, writing "N" to "sriov_vf_msix_count" uses
> > > > the PF driver ->sriov_set_msix_vec_count() callback to assign "N" MSI-X
> > > > vectors to the VF.  When a VF driver subsequently reads the MSI-X Message
> > > > Control register, it will see the new Table Size "N".
> > > >
> > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > ---
> > > >  Documentation/ABI/testing/sysfs-bus-pci |  29 +++++++
> > > >  drivers/pci/iov.c                       | 102 ++++++++++++++++++++++--
> > > >  drivers/pci/pci-sysfs.c                 |   3 +-
> > > >  drivers/pci/pci.h                       |   3 +-
> > > >  include/linux/pci.h                     |   8 ++
> > > >  5 files changed, 137 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > > index 25c9c39770c6..ebabd0d2ae88 100644
> > > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > > @@ -375,3 +375,32 @@ Description:
> > > >  		The value comes from the PCI kernel device state and can be one
> > > >  		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
> > > >  		The file is read only.
> > > > +
> > > > +What:		/sys/bus/pci/devices/.../sriov_vf_total_msix
> > > > +Date:		January 2021
> > > > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > > > +Description:
> > > > +		This file is associated with a SR-IOV PF.  It contains the
> > > > +		total number of MSI-X vectors available for assignment to
> > > > +		all VFs associated with PF.  It will zero if the device
> > >
> > > "The value will be zero if the device..."
> >
> > Thanks, will fix when apply or resend if more fixes will be needed.
> >
> > >
> > > And definition of "VF" and PF" are where in this file?
> >
> > They come from the PCI spec. It is part of SR-IOV lingo.
>
> Yes, and the world does not have access to the PCI spec, so please
> provide a hint as to what they are for those of us without access to
> such things.

No problem, I will replace PF to be "physical function" and VF to be
"virtual function".

Thanks

>
> thanks,
>
> greg k-h

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-01  7:55 [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
                   ` (3 preceding siblings ...)
  2021-03-01  7:55 ` [PATCH mlx5-next v7 4/4] net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks Leon Romanovsky
@ 2021-03-07  8:11 ` Leon Romanovsky
  2021-03-07 18:55 ` Alexander Duyck
  2021-03-10  5:58 ` Leon Romanovsky
  6 siblings, 0 replies; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-07  8:11 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Saeed Mahameed, Alexander Duyck, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Mon, Mar 01, 2021 at 09:55:20AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
>
> @Alexander Duyck, please update me if I can add your ROB tag again
> to the series, because you liked v6 more.
>
> Thanks
>
> ---------------------------------------------------------------------------------
> Changelog
> v7:
>  * Rebase on top v5.12-rc1
>  * More english fixes
>  * Returned to static sysfs creation model as was implemented in v0/v1.

Bjorn, are we good here?

Thanks

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-01  7:55 [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
                   ` (4 preceding siblings ...)
  2021-03-07  8:11 ` [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
@ 2021-03-07 18:55 ` Alexander Duyck
  2021-03-07 19:19   ` Leon Romanovsky
  2021-03-10 19:09   ` Bjorn Helgaas
  2021-03-10  5:58 ` Leon Romanovsky
  6 siblings, 2 replies; 74+ messages in thread
From: Alexander Duyck @ 2021-03-07 18:55 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> From: Leon Romanovsky <leonro@nvidia.com>
>
> @Alexander Duyck, please update me if I can add your ROB tag again
> to the series, because you liked v6 more.
>
> Thanks
>
> ---------------------------------------------------------------------------------
> Changelog
> v7:
>  * Rebase on top v5.12-rc1
>  * More english fixes
>  * Returned to static sysfs creation model as was implemented in v0/v1.

Yeah, so I am not a fan of the series. The problem is there is only
one driver that supports this, all VFs are going to expose this sysfs,
and I don't know how likely it is that any others are going to
implement this functionality. I feel like you threw out all the
progress from v2-v6.

I really feel like the big issue is that this model is broken as you
have the VFs exposing sysfs interfaces that make use of the PFs to
actually implement. Greg's complaint was the PF pushing sysfs onto the
VFs. My complaint is VFs sysfs files operating on the PF. The trick is
to find a way to address both issues.

Maybe the compromise is to reach down into the IOV code and have it
register the sysfs interface at device creation time in something like
pci_iov_sysfs_link if the PF has the functionality present to support
it.

Also we might want to double check that the PF cannot be unbound while
the VF is present. I know for a while there it was possible to remove
the PF driver while the VF was present. The Mellanox drivers may not
allow it but it might not hurt to look at taking a reference against
the PF driver if you are allocating the VF MSI-X configuration sysfs
file.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-07 18:55 ` Alexander Duyck
@ 2021-03-07 19:19   ` Leon Romanovsky
  2021-03-08 16:33     ` Alexander Duyck
  2021-03-10 19:09   ` Bjorn Helgaas
  1 sibling, 1 reply; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-07 19:19 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > @Alexander Duyck, please update me if I can add your ROB tag again
> > to the series, because you liked v6 more.
> >
> > Thanks
> >
> > ---------------------------------------------------------------------------------
> > Changelog
> > v7:
> >  * Rebase on top v5.12-rc1
> >  * More english fixes
> >  * Returned to static sysfs creation model as was implemented in v0/v1.
>
> Yeah, so I am not a fan of the series. The problem is there is only
> one driver that supports this, all VFs are going to expose this sysfs,
> and I don't know how likely it is that any others are going to
> implement this functionality. I feel like you threw out all the
> progress from v2-v6.

I'm with you here and tried to present the rationale in v6 when had
a discussion with Bjorn, so it is unfair to say "you threw out".

Bjorn expressed his preference, and no one came forward to support v6.

>
> I really feel like the big issue is that this model is broken as you
> have the VFs exposing sysfs interfaces that make use of the PFs to
> actually implement. Greg's complaint was the PF pushing sysfs onto the
> VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> to find a way to address both issues.

It is hard to say something meaningful about Greg's complain, he was
added in the middle of the discussion without much chances to get full
picture.

>
> Maybe the compromise is to reach down into the IOV code and have it
> register the sysfs interface at device creation time in something like
> pci_iov_sysfs_link if the PF has the functionality present to support
> it.

IMHO, it adds nothing.

>
> Also we might want to double check that the PF cannot be unbound while
> the VF is present. I know for a while there it was possible to remove
> the PF driver while the VF was present. The Mellanox drivers may not
> allow it but it might not hurt to look at taking a reference against
> the PF driver if you are allocating the VF MSI-X configuration sysfs
> file.

Right now, we always allocate these sysfs without relation if PF
supports or not. The check is done during write() call to such sysfs
and at that phase we check the existence of the drivers. It greatly
simplifies creation phase.

Thanks

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-07 19:19   ` Leon Romanovsky
@ 2021-03-08 16:33     ` Alexander Duyck
  2021-03-08 19:20       ` Leon Romanovsky
  0 siblings, 1 reply; 74+ messages in thread
From: Alexander Duyck @ 2021-03-08 16:33 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Sun, Mar 7, 2021 at 11:19 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > to the series, because you liked v6 more.
> > >
> > > Thanks
> > >
> > > ---------------------------------------------------------------------------------
> > > Changelog
> > > v7:
> > >  * Rebase on top v5.12-rc1
> > >  * More english fixes
> > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> >
> > Yeah, so I am not a fan of the series. The problem is there is only
> > one driver that supports this, all VFs are going to expose this sysfs,
> > and I don't know how likely it is that any others are going to
> > implement this functionality. I feel like you threw out all the
> > progress from v2-v6.
>
> I'm with you here and tried to present the rationale in v6 when had
> a discussion with Bjorn, so it is unfair to say "you threw out".
>
> Bjorn expressed his preference, and no one came forward to support v6.

Sorry, it wasn't my intention to be accusatory. I'm just not a fan of
going back to where we were with v1.

With that said, if it is what Bjorn wants then you are probably better
off going with that. However if that is the direction we are going in
then you should probably focus on getting his Reviewed-by or Ack since
he will ultimately be the maintainer for the code.

> >
> > I really feel like the big issue is that this model is broken as you
> > have the VFs exposing sysfs interfaces that make use of the PFs to
> > actually implement. Greg's complaint was the PF pushing sysfs onto the
> > VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> > to find a way to address both issues.
>
> It is hard to say something meaningful about Greg's complain, he was
> added in the middle of the discussion without much chances to get full
> picture.

Right, but what I am getting at is that the underlying problem is that
you either have sysfs being pushed onto a remote device, or sysfs that
is having to call into another device. It's not exactly something we
have had precedent for enabling before, and either perspective seems a
bit ugly.

> >
> > Maybe the compromise is to reach down into the IOV code and have it
> > register the sysfs interface at device creation time in something like
> > pci_iov_sysfs_link if the PF has the functionality present to support
> > it.
>
> IMHO, it adds nothing.

My thought was to reduce clutter. As I mentioned before with this
patch set we are enabling sysfs for functionality that is currently
only exposed by one device. I'm not sure it will be used by many
others or not. Having these sysfs interfaces instantiated at probe
time or at creation time in the case of VFs was preferable to me.

> >
> > Also we might want to double check that the PF cannot be unbound while
> > the VF is present. I know for a while there it was possible to remove
> > the PF driver while the VF was present. The Mellanox drivers may not
> > allow it but it might not hurt to look at taking a reference against
> > the PF driver if you are allocating the VF MSI-X configuration sysfs
> > file.
>
> Right now, we always allocate these sysfs without relation if PF
> supports or not. The check is done during write() call to such sysfs
> and at that phase we check the existence of the drivers. It greatly
> simplifies creation phase.

Yeah, I see that. From what I can tell the locking looks correct to
keep things from breaking. For what we have it is probably good enough
to keep things from causing any issues. My concern was more about
preventing the driver from reloading if we only exposed these
interfaces if the PF driver supported them.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-08 16:33     ` Alexander Duyck
@ 2021-03-08 19:20       ` Leon Romanovsky
  0 siblings, 0 replies; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-08 19:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Mon, Mar 08, 2021 at 08:33:03AM -0800, Alexander Duyck wrote:
> On Sun, Mar 7, 2021 at 11:19 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > >
> > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > to the series, because you liked v6 more.
> > > >
> > > > Thanks
> > > >
> > > > ---------------------------------------------------------------------------------
> > > > Changelog
> > > > v7:
> > > >  * Rebase on top v5.12-rc1
> > > >  * More english fixes
> > > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> > >
> > > Yeah, so I am not a fan of the series. The problem is there is only
> > > one driver that supports this, all VFs are going to expose this sysfs,
> > > and I don't know how likely it is that any others are going to
> > > implement this functionality. I feel like you threw out all the
> > > progress from v2-v6.
> >
> > I'm with you here and tried to present the rationale in v6 when had
> > a discussion with Bjorn, so it is unfair to say "you threw out".
> >
> > Bjorn expressed his preference, and no one came forward to support v6.
>
> Sorry, it wasn't my intention to be accusatory. I'm just not a fan of
> going back to where we were with v1.
>
> With that said, if it is what Bjorn wants then you are probably better
> off going with that. However if that is the direction we are going in
> then you should probably focus on getting his Reviewed-by or Ack since
> he will ultimately be the maintainer for the code.

I hope that he will do it soon.

>
> > >
> > > I really feel like the big issue is that this model is broken as you
> > > have the VFs exposing sysfs interfaces that make use of the PFs to
> > > actually implement. Greg's complaint was the PF pushing sysfs onto the
> > > VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> > > to find a way to address both issues.
> >
> > It is hard to say something meaningful about Greg's complain, he was
> > added in the middle of the discussion without much chances to get full
> > picture.
>
> Right, but what I am getting at is that the underlying problem is that
> you either have sysfs being pushed onto a remote device, or sysfs that
> is having to call into another device. It's not exactly something we
> have had precedent for enabling before, and either perspective seems a
> bit ugly.

I don't agree with the word "ugly", but it is not the point. The point
is that this interface is backed by the ecosystem and must-to-be for the
right SR-IOV utilization.

>
> > >
> > > Maybe the compromise is to reach down into the IOV code and have it
> > > register the sysfs interface at device creation time in something like
> > > pci_iov_sysfs_link if the PF has the functionality present to support
> > > it.
> >
> > IMHO, it adds nothing.
>
> My thought was to reduce clutter. As I mentioned before with this
> patch set we are enabling sysfs for functionality that is currently
> only exposed by one device. I'm not sure it will be used by many
> others or not. Having these sysfs interfaces instantiated at probe
> time or at creation time in the case of VFs was preferable to me.

I said that in v6 to Bjorn, that I expect up to 2-3 vendors to support
this knob. There are not many devices in the market that are comparable
to the mlx5 both in their complexity and adoption.

Thanks

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-01  7:55 [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
                   ` (5 preceding siblings ...)
  2021-03-07 18:55 ` Alexander Duyck
@ 2021-03-10  5:58 ` Leon Romanovsky
  6 siblings, 0 replies; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-10  5:58 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Saeed Mahameed, Alexander Duyck, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Mon, Mar 01, 2021 at 09:55:20AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>

<...>

> Leon Romanovsky (4):
>   PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs
>   net/mlx5: Add dynamic MSI-X capabilities bits
>   net/mlx5: Dynamically assign MSI-X vectors count
>   net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks

Bjorn,

This is gentle reminder, can we please progress with this series?

Thanks

>
>  Documentation/ABI/testing/sysfs-bus-pci       |  29 +++++
>  .../net/ethernet/mellanox/mlx5/core/main.c    |   6 ++
>  .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  12 +++
>  .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  73 +++++++++++++
>  .../net/ethernet/mellanox/mlx5/core/sriov.c   |  48 ++++++++-
>  drivers/pci/iov.c                             | 102 ++++++++++++++++--
>  drivers/pci/pci-sysfs.c                       |   3 +-
>  drivers/pci/pci.h                             |   3 +-
>  include/linux/mlx5/mlx5_ifc.h                 |  11 +-
>  include/linux/pci.h                           |   8 ++
>  10 files changed, 284 insertions(+), 11 deletions(-)
>
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-07 18:55 ` Alexander Duyck
  2021-03-07 19:19   ` Leon Romanovsky
@ 2021-03-10 19:09   ` Bjorn Helgaas
  2021-03-10 20:10     ` Leon Romanovsky
  2021-03-10 23:34     ` Alexander Duyck
  1 sibling, 2 replies; 74+ messages in thread
From: Bjorn Helgaas @ 2021-03-10 19:09 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky,
	Jason Gunthorpe, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > @Alexander Duyck, please update me if I can add your ROB tag again
> > to the series, because you liked v6 more.
> >
> > Thanks
> >
> > ---------------------------------------------------------------------------------
> > Changelog
> > v7:
> >  * Rebase on top v5.12-rc1
> >  * More english fixes
> >  * Returned to static sysfs creation model as was implemented in v0/v1.
> 
> Yeah, so I am not a fan of the series. The problem is there is only
> one driver that supports this, all VFs are going to expose this sysfs,
> and I don't know how likely it is that any others are going to
> implement this functionality. I feel like you threw out all the
> progress from v2-v6.

pci_enable_vfs_overlay() turned up in v4, so I think v0-v3 had static
sysfs files regardless of whether the PF driver was bound.

> I really feel like the big issue is that this model is broken as you
> have the VFs exposing sysfs interfaces that make use of the PFs to
> actually implement. Greg's complaint was the PF pushing sysfs onto the
> VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> to find a way to address both issues.
> 
> Maybe the compromise is to reach down into the IOV code and have it
> register the sysfs interface at device creation time in something like
> pci_iov_sysfs_link if the PF has the functionality present to support
> it.

IIUC there are two questions on the table:

  1) Should the sysfs files be visible only when a PF driver that
     supports MSI-X vector assignment is bound?

     I think this is a cosmetic issue.  The presence of the file is
     not a reliable signal to management software; it must always
     tolerate files that don't exist (e.g., on old kernels) or files
     that are visible but don't work (e.g., vectors may be exhausted).

     If we start with the files always being visible, we should be
     able to add smarts later to expose them only when the PF driver
     is bound.

     My concerns with pci_enable_vf_overlay() are that it uses a
     little more sysfs internals than I'd like (although there are
     many callers of sysfs_create_files()) and it uses
     pci_get_domain_bus_and_slot(), which is generally a hack and
     creates refcounting hassles.  Speaking of which, isn't v6 missing
     a pci_dev_put() to match the pci_get_domain_bus_and_slot()?

  2) Should a VF sysfs file use the PF to implement this?

     Can you elaborate on your idea here?  I guess
     pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the
     VF, and you're thinking we could also make a "virtfnX_msix_count"
     in the PF directory?  That's a really interesting idea.

> Also we might want to double check that the PF cannot be unbound while
> the VF is present. I know for a while there it was possible to remove
> the PF driver while the VF was present. The Mellanox drivers may not
> allow it but it might not hurt to look at taking a reference against
> the PF driver if you are allocating the VF MSI-X configuration sysfs
> file.

Unbinding the PF driver will either remove the *_msix_count files or
make them stop working.  Is that a problem?  I'm not sure we should
add a magic link that prevents driver unbinding.  Seems like it would
be hard for users to figure out why the driver can't be removed.

Bjorn

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-10 19:09   ` Bjorn Helgaas
@ 2021-03-10 20:10     ` Leon Romanovsky
  2021-03-10 20:21       ` Greg Kroah-Hartman
  2021-03-10 23:34     ` Alexander Duyck
  1 sibling, 1 reply; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-10 20:10 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

On Wed, Mar 10, 2021 at 01:09:06PM -0600, Bjorn Helgaas wrote:
> On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > to the series, because you liked v6 more.
> > >
> > > Thanks
> > >
> > > ---------------------------------------------------------------------------------
> > > Changelog
> > > v7:
> > >  * Rebase on top v5.12-rc1
> > >  * More english fixes
> > >  * Returned to static sysfs creation model as was implemented in v0/v1.

<...>

>   2) Should a VF sysfs file use the PF to implement this?
>
>      Can you elaborate on your idea here?  I guess
>      pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the
>      VF, and you're thinking we could also make a "virtfnX_msix_count"
>      in the PF directory?  That's a really interesting idea.

I want to remind that we are talking about mlx5 devices that support
upto 255 VFs and they indeed are used to their limits. So seeing 255
links of virtfnX_msix_count in the same directory looks too much unpleasant
to me.

Thanks

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-10 20:10     ` Leon Romanovsky
@ 2021-03-10 20:21       ` Greg Kroah-Hartman
  2021-03-11  8:37         ` Leon Romanovsky
  0 siblings, 1 reply; 74+ messages in thread
From: Greg Kroah-Hartman @ 2021-03-10 20:21 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Alexander Duyck, Bjorn Helgaas, Saeed Mahameed,
	Jason Gunthorpe, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller

On Wed, Mar 10, 2021 at 10:10:41PM +0200, Leon Romanovsky wrote:
> On Wed, Mar 10, 2021 at 01:09:06PM -0600, Bjorn Helgaas wrote:
> > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > >
> > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > to the series, because you liked v6 more.
> > > >
> > > > Thanks
> > > >
> > > > ---------------------------------------------------------------------------------
> > > > Changelog
> > > > v7:
> > > >  * Rebase on top v5.12-rc1
> > > >  * More english fixes
> > > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> 
> <...>
> 
> >   2) Should a VF sysfs file use the PF to implement this?
> >
> >      Can you elaborate on your idea here?  I guess
> >      pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the
> >      VF, and you're thinking we could also make a "virtfnX_msix_count"
> >      in the PF directory?  That's a really interesting idea.
> 
> I want to remind that we are talking about mlx5 devices that support
> upto 255 VFs and they indeed are used to their limits. So seeing 255
> links of virtfnX_msix_count in the same directory looks too much unpleasant
> to me.

255 files are nothing, if that's what the hardware supports, what is the
problem?  If it's "unpleasant", go complain to the hardware designers :)

greg k-h

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-10 19:09   ` Bjorn Helgaas
  2021-03-10 20:10     ` Leon Romanovsky
@ 2021-03-10 23:34     ` Alexander Duyck
  2021-03-11 18:17       ` Bjorn Helgaas
  1 sibling, 1 reply; 74+ messages in thread
From: Alexander Duyck @ 2021-03-10 23:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky,
	Jason Gunthorpe, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Wed, Mar 10, 2021 at 11:09 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > to the series, because you liked v6 more.
> > >
> > > Thanks
> > >
> > > ---------------------------------------------------------------------------------
> > > Changelog
> > > v7:
> > >  * Rebase on top v5.12-rc1
> > >  * More english fixes
> > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> >
> > Yeah, so I am not a fan of the series. The problem is there is only
> > one driver that supports this, all VFs are going to expose this sysfs,
> > and I don't know how likely it is that any others are going to
> > implement this functionality. I feel like you threw out all the
> > progress from v2-v6.
>
> pci_enable_vfs_overlay() turned up in v4, so I think v0-v3 had static
> sysfs files regardless of whether the PF driver was bound.
>
> > I really feel like the big issue is that this model is broken as you
> > have the VFs exposing sysfs interfaces that make use of the PFs to
> > actually implement. Greg's complaint was the PF pushing sysfs onto the
> > VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> > to find a way to address both issues.
> >
> > Maybe the compromise is to reach down into the IOV code and have it
> > register the sysfs interface at device creation time in something like
> > pci_iov_sysfs_link if the PF has the functionality present to support
> > it.
>
> IIUC there are two questions on the table:
>
>   1) Should the sysfs files be visible only when a PF driver that
>      supports MSI-X vector assignment is bound?
>
>      I think this is a cosmetic issue.  The presence of the file is
>      not a reliable signal to management software; it must always
>      tolerate files that don't exist (e.g., on old kernels) or files
>      that are visible but don't work (e.g., vectors may be exhausted).
>
>      If we start with the files always being visible, we should be
>      able to add smarts later to expose them only when the PF driver
>      is bound.
>
>      My concerns with pci_enable_vf_overlay() are that it uses a
>      little more sysfs internals than I'd like (although there are
>      many callers of sysfs_create_files()) and it uses
>      pci_get_domain_bus_and_slot(), which is generally a hack and
>      creates refcounting hassles.  Speaking of which, isn't v6 missing
>      a pci_dev_put() to match the pci_get_domain_bus_and_slot()?

I'm not so much worried about management software as the fact that
this is a vendor specific implementation detail that is shaping how
the kernel interfaces are meant to work. Other than the mlx5 I don't
know if there are any other vendors really onboard with this sort of
solution.

In addition it still feels rather hacky to be modifying read-only PCIe
configuration space on the fly via a backdoor provided by the PF. It
almost feels like this should be some sort of quirk rather than a
standard feature for an SR-IOV VF.

>   2) Should a VF sysfs file use the PF to implement this?
>
>      Can you elaborate on your idea here?  I guess
>      pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the
>      VF, and you're thinking we could also make a "virtfnX_msix_count"
>      in the PF directory?  That's a really interesting idea.

I would honestly be more comfortable if the PF owned these files
instead of the VFs. One of the things I didn't like about this back
during the V1/2 days was the fact that it gave the impression that
MSI-X count was something that is meant to be edited. Since then I
think at least the naming was changed so that it implies that this is
only possible due to SR-IOV.

I also didn't like that it makes the VFs feel like they are port
representors rather than being actual PCIe devices. Having
functionality that only works when the VF driver is not loaded just
feels off. The VF sysfs directory feels like it is being used as a
subdirectory of the PF rather than being a device on its own.

> > Also we might want to double check that the PF cannot be unbound while
> > the VF is present. I know for a while there it was possible to remove
> > the PF driver while the VF was present. The Mellanox drivers may not
> > allow it but it might not hurt to look at taking a reference against
> > the PF driver if you are allocating the VF MSI-X configuration sysfs
> > file.
>
> Unbinding the PF driver will either remove the *_msix_count files or
> make them stop working.  Is that a problem?  I'm not sure we should
> add a magic link that prevents driver unbinding.  Seems like it would
> be hard for users to figure out why the driver can't be removed.

I checked it again, it will make the *_msix_count files stop working.
In order to guarantee it cannot happen in the middle of things though
we are sitting on the device locks for both the PF and the VF. I'm not
a fan of having to hold 2 locks while we perform a firmware operation
for one device, but I couldn't find anything where we would deadlock
so it should be fine.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-10 20:21       ` Greg Kroah-Hartman
@ 2021-03-11  8:37         ` Leon Romanovsky
  0 siblings, 0 replies; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-11  8:37 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Bjorn Helgaas, Alexander Duyck, Bjorn Helgaas, Saeed Mahameed,
	Jason Gunthorpe, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller

On Wed, Mar 10, 2021 at 09:21:55PM +0100, Greg Kroah-Hartman wrote:
> On Wed, Mar 10, 2021 at 10:10:41PM +0200, Leon Romanovsky wrote:
> > On Wed, Mar 10, 2021 at 01:09:06PM -0600, Bjorn Helgaas wrote:
> > > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > >
> > > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > > to the series, because you liked v6 more.
> > > > >
> > > > > Thanks
> > > > >
> > > > > ---------------------------------------------------------------------------------
> > > > > Changelog
> > > > > v7:
> > > > >  * Rebase on top v5.12-rc1
> > > > >  * More english fixes
> > > > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> >
> > <...>
> >
> > >   2) Should a VF sysfs file use the PF to implement this?
> > >
> > >      Can you elaborate on your idea here?  I guess
> > >      pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the
> > >      VF, and you're thinking we could also make a "virtfnX_msix_count"
> > >      in the PF directory?  That's a really interesting idea.
> >
> > I want to remind that we are talking about mlx5 devices that support
> > upto 255 VFs and they indeed are used to their limits. So seeing 255
> > links of virtfnX_msix_count in the same directory looks too much unpleasant
> > to me.
>
> 255 files are nothing, if that's what the hardware supports, what is the
> problem?  If it's "unpleasant", go complain to the hardware designers :)

It is 255 same files that every SR-IOV user will see in /sys/bus/pci/devices/*/
folder, unless we will do dynamic creation of those files and this is something
that Bjorn didn't like in v7.

So instead of complaining to the hardware designers, I will complain here.
I probably implemented all possible variants already. :)

Thanks

>
> greg k-h

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-10 23:34     ` Alexander Duyck
@ 2021-03-11 18:17       ` Bjorn Helgaas
  2021-03-11 19:16         ` Keith Busch
                           ` (3 more replies)
  0 siblings, 4 replies; 74+ messages in thread
From: Bjorn Helgaas @ 2021-03-11 18:17 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky,
	Jason Gunthorpe, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> On Wed, Mar 10, 2021 at 11:09 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > >
> > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > to the series, because you liked v6 more.
> > > >
> > > > Thanks
> > > >
> > > > ---------------------------------------------------------------------------------
> > > > Changelog
> > > > v7:
> > > >  * Rebase on top v5.12-rc1
> > > >  * More english fixes
> > > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> > >
> > > Yeah, so I am not a fan of the series. The problem is there is only
> > > one driver that supports this, all VFs are going to expose this sysfs,
> > > and I don't know how likely it is that any others are going to
> > > implement this functionality. I feel like you threw out all the
> > > progress from v2-v6.
> >
> > pci_enable_vfs_overlay() turned up in v4, so I think v0-v3 had static
> > sysfs files regardless of whether the PF driver was bound.
> >
> > > I really feel like the big issue is that this model is broken as you
> > > have the VFs exposing sysfs interfaces that make use of the PFs to
> > > actually implement. Greg's complaint was the PF pushing sysfs onto the
> > > VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> > > to find a way to address both issues.
> > >
> > > Maybe the compromise is to reach down into the IOV code and have it
> > > register the sysfs interface at device creation time in something like
> > > pci_iov_sysfs_link if the PF has the functionality present to support
> > > it.
> >
> > IIUC there are two questions on the table:
> >
> >   1) Should the sysfs files be visible only when a PF driver that
> >      supports MSI-X vector assignment is bound?
> >
> >      I think this is a cosmetic issue.  The presence of the file is
> >      not a reliable signal to management software; it must always
> >      tolerate files that don't exist (e.g., on old kernels) or files
> >      that are visible but don't work (e.g., vectors may be exhausted).
> >
> >      If we start with the files always being visible, we should be
> >      able to add smarts later to expose them only when the PF driver
> >      is bound.
> >
> >      My concerns with pci_enable_vf_overlay() are that it uses a
> >      little more sysfs internals than I'd like (although there are
> >      many callers of sysfs_create_files()) and it uses
> >      pci_get_domain_bus_and_slot(), which is generally a hack and
> >      creates refcounting hassles.  Speaking of which, isn't v6 missing
> >      a pci_dev_put() to match the pci_get_domain_bus_and_slot()?
> 
> I'm not so much worried about management software as the fact that
> this is a vendor specific implementation detail that is shaping how
> the kernel interfaces are meant to work. Other than the mlx5 I don't
> know if there are any other vendors really onboard with this sort of
> solution.

I know this is currently vendor-specific, but I thought the value
proposition of dynamic configuration of VFs for different clients
sounded compelling enough that other vendors would do something
similar.  But I'm not an SR-IOV guy and have no vendor insight, so
maybe that's not the case?

> In addition it still feels rather hacky to be modifying read-only PCIe
> configuration space on the fly via a backdoor provided by the PF. It
> almost feels like this should be some sort of quirk rather than a
> standard feature for an SR-IOV VF.

I agree, I'm not 100% comfortable with modifying the read-only Table
Size register.  Maybe there's another approach that would be better?
It *is* nice that the current approach doesn't require changes in the
VF driver.

> >   2) Should a VF sysfs file use the PF to implement this?
> >
> >      Can you elaborate on your idea here?  I guess
> >      pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the
> >      VF, and you're thinking we could also make a "virtfnX_msix_count"
> >      in the PF directory?  That's a really interesting idea.
> 
> I would honestly be more comfortable if the PF owned these files
> instead of the VFs. One of the things I didn't like about this back
> during the V1/2 days was the fact that it gave the impression that
> MSI-X count was something that is meant to be edited. Since then I
> think at least the naming was changed so that it implies that this is
> only possible due to SR-IOV.
> 
> I also didn't like that it makes the VFs feel like they are port
> representors rather than being actual PCIe devices. Having
> functionality that only works when the VF driver is not loaded just
> feels off. The VF sysfs directory feels like it is being used as a
> subdirectory of the PF rather than being a device on its own.

Moving "virtfnX_msix_count" to the PF seems like it would mitigate
this somewhat.  I don't know how to make this work while a VF driver
is bound without making the VF feel even less like a PCIe device,
i.e., we won't be able to use the standard MSI-X model.

> > > Also we might want to double check that the PF cannot be unbound while
> > > the VF is present. I know for a while there it was possible to remove
> > > the PF driver while the VF was present. The Mellanox drivers may not
> > > allow it but it might not hurt to look at taking a reference against
> > > the PF driver if you are allocating the VF MSI-X configuration sysfs
> > > file.
> >
> > Unbinding the PF driver will either remove the *_msix_count files or
> > make them stop working.  Is that a problem?  I'm not sure we should
> > add a magic link that prevents driver unbinding.  Seems like it would
> > be hard for users to figure out why the driver can't be removed.
> 
> I checked it again, it will make the *_msix_count files stop working.
> In order to guarantee it cannot happen in the middle of things though
> we are sitting on the device locks for both the PF and the VF. I'm not
> a fan of having to hold 2 locks while we perform a firmware operation
> for one device, but I couldn't find anything where we would deadlock
> so it should be fine.

I agree again, it's not ideal to hold two locks.  Is it possible we
could get by without the VF lock?  If we simply check whether a VF
driver is bound (without a lock), a VF driver bind can race with the
PF sriov_set_msix_vec_count().

If the VF driver bind wins, it reads the old Table Size.  If it reads
a too-small size, it won't use all the vectors.  If it reads a
too-large size, it will try to use too many vectors and some won't
work.  But the race would be caused by a bug in the management
software, and the consequence doesn't seem *terrible*.

Bjorn

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-11 18:17       ` Bjorn Helgaas
@ 2021-03-11 19:16         ` Keith Busch
  2021-03-11 19:21           ` Leon Romanovsky
  2021-03-11 20:22           ` Jason Gunthorpe
  2021-03-11 19:17         ` Leon Romanovsky
                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 74+ messages in thread
From: Keith Busch @ 2021-03-11 19:16 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed,
	Leon Romanovsky, Jason Gunthorpe, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:
> On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > 
> > I'm not so much worried about management software as the fact that
> > this is a vendor specific implementation detail that is shaping how
> > the kernel interfaces are meant to work. Other than the mlx5 I don't
> > know if there are any other vendors really onboard with this sort of
> > solution.
> 
> I know this is currently vendor-specific, but I thought the value
> proposition of dynamic configuration of VFs for different clients
> sounded compelling enough that other vendors would do something
> similar.  But I'm not an SR-IOV guy and have no vendor insight, so
> maybe that's not the case?

NVMe has a similar feature defined by the standard where a PF controller can
dynamically assign MSIx vectors to VFs. The whole thing is managed in user
space with an ioctl, though. I guess we could wire up the driver to handle it
through this sysfs interface too, but I think the protocol specific tooling is
more appropriate for nvme.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-11 18:17       ` Bjorn Helgaas
  2021-03-11 19:16         ` Keith Busch
@ 2021-03-11 19:17         ` Leon Romanovsky
  2021-03-11 19:37         ` Alexander Duyck
  2021-03-11 20:31         ` Jason Gunthorpe
  3 siblings, 0 replies; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-11 19:17 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:
> On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > On Wed, Mar 10, 2021 at 11:09 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > >
> > > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > > to the series, because you liked v6 more.
> > > > >
> > > > > Thanks
> > > > >
> > > > > ---------------------------------------------------------------------------------
> > > > > Changelog
> > > > > v7:
> > > > >  * Rebase on top v5.12-rc1
> > > > >  * More english fixes
> > > > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> > > >
> > > > Yeah, so I am not a fan of the series. The problem is there is only
> > > > one driver that supports this, all VFs are going to expose this sysfs,
> > > > and I don't know how likely it is that any others are going to
> > > > implement this functionality. I feel like you threw out all the
> > > > progress from v2-v6.
> > >
> > > pci_enable_vfs_overlay() turned up in v4, so I think v0-v3 had static
> > > sysfs files regardless of whether the PF driver was bound.
> > >
> > > > I really feel like the big issue is that this model is broken as you
> > > > have the VFs exposing sysfs interfaces that make use of the PFs to
> > > > actually implement. Greg's complaint was the PF pushing sysfs onto the
> > > > VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> > > > to find a way to address both issues.
> > > >
> > > > Maybe the compromise is to reach down into the IOV code and have it
> > > > register the sysfs interface at device creation time in something like
> > > > pci_iov_sysfs_link if the PF has the functionality present to support
> > > > it.
> > >
> > > IIUC there are two questions on the table:
> > >
> > >   1) Should the sysfs files be visible only when a PF driver that
> > >      supports MSI-X vector assignment is bound?
> > >
> > >      I think this is a cosmetic issue.  The presence of the file is
> > >      not a reliable signal to management software; it must always
> > >      tolerate files that don't exist (e.g., on old kernels) or files
> > >      that are visible but don't work (e.g., vectors may be exhausted).
> > >
> > >      If we start with the files always being visible, we should be
> > >      able to add smarts later to expose them only when the PF driver
> > >      is bound.
> > >
> > >      My concerns with pci_enable_vf_overlay() are that it uses a
> > >      little more sysfs internals than I'd like (although there are
> > >      many callers of sysfs_create_files()) and it uses
> > >      pci_get_domain_bus_and_slot(), which is generally a hack and
> > >      creates refcounting hassles.  Speaking of which, isn't v6 missing
> > >      a pci_dev_put() to match the pci_get_domain_bus_and_slot()?
> >
> > I'm not so much worried about management software as the fact that
> > this is a vendor specific implementation detail that is shaping how
> > the kernel interfaces are meant to work. Other than the mlx5 I don't
> > know if there are any other vendors really onboard with this sort of
> > solution.
>
> I know this is currently vendor-specific, but I thought the value
> proposition of dynamic configuration of VFs for different clients
> sounded compelling enough that other vendors would do something
> similar.  But I'm not an SR-IOV guy and have no vendor insight, so
> maybe that's not the case?

It is not the case, any vendor who wants to compete with Mellanox
devices in large scale clouds, will be asked to implement this feature.

You can't provide reliable VM service without it.

<...>

> > I checked it again, it will make the *_msix_count files stop working.
> > In order to guarantee it cannot happen in the middle of things though
> > we are sitting on the device locks for both the PF and the VF. I'm not
> > a fan of having to hold 2 locks while we perform a firmware operation
> > for one device, but I couldn't find anything where we would deadlock
> > so it should be fine.
>
> I agree again, it's not ideal to hold two locks.  Is it possible we
> could get by without the VF lock?  If we simply check whether a VF
> driver is bound (without a lock), a VF driver bind can race with the
> PF sriov_set_msix_vec_count().

We are holding huge amount of locks simultaneously. I don't understand
why two locks can be so big deal now.

Alex said that my locking scheme is correct, I think you also said that.
Why do we need to add races and make correct solution to be half baked?

>
> If the VF driver bind wins, it reads the old Table Size.  If it reads
> a too-small size, it won't use all the vectors.  If it reads a
> too-large size, it will try to use too many vectors and some won't
> work.  But the race would be caused by a bug in the management
> software, and the consequence doesn't seem *terrible*.

It will present to the user/administrator incorrect picture where lspci
output won't be aligned with the real situation.

Thanks

>
> Bjorn

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-11 19:16         ` Keith Busch
@ 2021-03-11 19:21           ` Leon Romanovsky
  2021-03-11 20:22           ` Jason Gunthorpe
  1 sibling, 0 replies; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-11 19:21 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, Alexander Duyck, Bjorn Helgaas, Saeed Mahameed,
	Jason Gunthorpe, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 12:16:02PM -0700, Keith Busch wrote:
> On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:
> > On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > >
> > > I'm not so much worried about management software as the fact that
> > > this is a vendor specific implementation detail that is shaping how
> > > the kernel interfaces are meant to work. Other than the mlx5 I don't
> > > know if there are any other vendors really onboard with this sort of
> > > solution.
> >
> > I know this is currently vendor-specific, but I thought the value
> > proposition of dynamic configuration of VFs for different clients
> > sounded compelling enough that other vendors would do something
> > similar.  But I'm not an SR-IOV guy and have no vendor insight, so
> > maybe that's not the case?
>
> NVMe has a similar feature defined by the standard where a PF controller can
> dynamically assign MSIx vectors to VFs. The whole thing is managed in user
> space with an ioctl, though. I guess we could wire up the driver to handle it
> through this sysfs interface too, but I think the protocol specific tooling is
> more appropriate for nvme.

We need this MSI-X thing for IB and ethernet devices too. This is why PCI
sysfs is the best place to put it, so all SR-IOV flavours will have sane
way to configure it.

Thanks

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-11 18:17       ` Bjorn Helgaas
  2021-03-11 19:16         ` Keith Busch
  2021-03-11 19:17         ` Leon Romanovsky
@ 2021-03-11 19:37         ` Alexander Duyck
  2021-03-11 19:51           ` Leon Romanovsky
  2021-03-11 20:19           ` Jason Gunthorpe
  2021-03-11 20:31         ` Jason Gunthorpe
  3 siblings, 2 replies; 74+ messages in thread
From: Alexander Duyck @ 2021-03-11 19:37 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky,
	Jason Gunthorpe, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 10:17 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > On Wed, Mar 10, 2021 at 11:09 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > >
> > > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > > to the series, because you liked v6 more.
> > > > >
> > > > > Thanks
> > > > >
> > > > > ---------------------------------------------------------------------------------
> > > > > Changelog
> > > > > v7:
> > > > >  * Rebase on top v5.12-rc1
> > > > >  * More english fixes
> > > > >  * Returned to static sysfs creation model as was implemented in v0/v1.
> > > >
> > > > Yeah, so I am not a fan of the series. The problem is there is only
> > > > one driver that supports this, all VFs are going to expose this sysfs,
> > > > and I don't know how likely it is that any others are going to
> > > > implement this functionality. I feel like you threw out all the
> > > > progress from v2-v6.
> > >
> > > pci_enable_vfs_overlay() turned up in v4, so I think v0-v3 had static
> > > sysfs files regardless of whether the PF driver was bound.
> > >
> > > > I really feel like the big issue is that this model is broken as you
> > > > have the VFs exposing sysfs interfaces that make use of the PFs to
> > > > actually implement. Greg's complaint was the PF pushing sysfs onto the
> > > > VFs. My complaint is VFs sysfs files operating on the PF. The trick is
> > > > to find a way to address both issues.
> > > >
> > > > Maybe the compromise is to reach down into the IOV code and have it
> > > > register the sysfs interface at device creation time in something like
> > > > pci_iov_sysfs_link if the PF has the functionality present to support
> > > > it.
> > >
> > > IIUC there are two questions on the table:
> > >
> > >   1) Should the sysfs files be visible only when a PF driver that
> > >      supports MSI-X vector assignment is bound?
> > >
> > >      I think this is a cosmetic issue.  The presence of the file is
> > >      not a reliable signal to management software; it must always
> > >      tolerate files that don't exist (e.g., on old kernels) or files
> > >      that are visible but don't work (e.g., vectors may be exhausted).
> > >
> > >      If we start with the files always being visible, we should be
> > >      able to add smarts later to expose them only when the PF driver
> > >      is bound.
> > >
> > >      My concerns with pci_enable_vf_overlay() are that it uses a
> > >      little more sysfs internals than I'd like (although there are
> > >      many callers of sysfs_create_files()) and it uses
> > >      pci_get_domain_bus_and_slot(), which is generally a hack and
> > >      creates refcounting hassles.  Speaking of which, isn't v6 missing
> > >      a pci_dev_put() to match the pci_get_domain_bus_and_slot()?
> >
> > I'm not so much worried about management software as the fact that
> > this is a vendor specific implementation detail that is shaping how
> > the kernel interfaces are meant to work. Other than the mlx5 I don't
> > know if there are any other vendors really onboard with this sort of
> > solution.
>
> I know this is currently vendor-specific, but I thought the value
> proposition of dynamic configuration of VFs for different clients
> sounded compelling enough that other vendors would do something
> similar.  But I'm not an SR-IOV guy and have no vendor insight, so
> maybe that's not the case?

The problem is there are multiple ways to deal with this issue. I have
worked on parts in the past that simply advertised a fixed table size
and then only allowed for configuring the number of vectors internally
so some vectors simply couldn't be unmasked. I don't know if that was
any better than this though. It is just yet another way to do this.

> > In addition it still feels rather hacky to be modifying read-only PCIe
> > configuration space on the fly via a backdoor provided by the PF. It
> > almost feels like this should be some sort of quirk rather than a
> > standard feature for an SR-IOV VF.
>
> I agree, I'm not 100% comfortable with modifying the read-only Table
> Size register.  Maybe there's another approach that would be better?
> It *is* nice that the current approach doesn't require changes in the
> VF driver.

Changes in the VF driver would be a non-starter since the VF driver
may be running in the guest. One thing that would have been nice is if
the hardware cthis change could have been delayed until the VF went
through something like an FLR where it would have had to unregister
all the MSI-X vectors and then reconfigure them after the reset. It
would have allowed us to get away from needing all the extra locking.
It would then just be something the PF would configure and apply to
the VF following a reset.

> > >   2) Should a VF sysfs file use the PF to implement this?
> > >
> > >      Can you elaborate on your idea here?  I guess
> > >      pci_iov_sysfs_link() makes a "virtfnX" link from the PF to the
> > >      VF, and you're thinking we could also make a "virtfnX_msix_count"
> > >      in the PF directory?  That's a really interesting idea.
> >
> > I would honestly be more comfortable if the PF owned these files
> > instead of the VFs. One of the things I didn't like about this back
> > during the V1/2 days was the fact that it gave the impression that
> > MSI-X count was something that is meant to be edited. Since then I
> > think at least the naming was changed so that it implies that this is
> > only possible due to SR-IOV.
> >
> > I also didn't like that it makes the VFs feel like they are port
> > representors rather than being actual PCIe devices. Having
> > functionality that only works when the VF driver is not loaded just
> > feels off. The VF sysfs directory feels like it is being used as a
> > subdirectory of the PF rather than being a device on its own.
>
> Moving "virtfnX_msix_count" to the PF seems like it would mitigate
> this somewhat.  I don't know how to make this work while a VF driver
> is bound without making the VF feel even less like a PCIe device,
> i.e., we won't be able to use the standard MSI-X model.

Yeah, I actually do kind of like that idea. In addition it would
address one of the things I pointed out as an issue before as you
could place the virtfn values that the total value in the same folder
so that it is all in one central spot rather than having to walk all
over the sysfs hierarchy to check the setting for each VF when trying
to figure out how the vectors are currently distributed.

> > > > Also we might want to double check that the PF cannot be unbound while
> > > > the VF is present. I know for a while there it was possible to remove
> > > > the PF driver while the VF was present. The Mellanox drivers may not
> > > > allow it but it might not hurt to look at taking a reference against
> > > > the PF driver if you are allocating the VF MSI-X configuration sysfs
> > > > file.
> > >
> > > Unbinding the PF driver will either remove the *_msix_count files or
> > > make them stop working.  Is that a problem?  I'm not sure we should
> > > add a magic link that prevents driver unbinding.  Seems like it would
> > > be hard for users to figure out why the driver can't be removed.
> >
> > I checked it again, it will make the *_msix_count files stop working.
> > In order to guarantee it cannot happen in the middle of things though
> > we are sitting on the device locks for both the PF and the VF. I'm not
> > a fan of having to hold 2 locks while we perform a firmware operation
> > for one device, but I couldn't find anything where we would deadlock
> > so it should be fine.
>
> I agree again, it's not ideal to hold two locks.  Is it possible we
> could get by without the VF lock?  If we simply check whether a VF
> driver is bound (without a lock), a VF driver bind can race with the
> PF sriov_set_msix_vec_count().

I wonder if we couldn't do something like a reversible version of the
kill_device function to make it so that a VF is marked as being
blocked from being probed until we come through and release it.

> If the VF driver bind wins, it reads the old Table Size.  If it reads
> a too-small size, it won't use all the vectors.  If it reads a
> too-large size, it will try to use too many vectors and some won't
> work.  But the race would be caused by a bug in the management
> software, and the consequence doesn't seem *terrible*.

Based on the idea I just mentioned above it might make sense to
possibly have an option to mark a device as "stale" instead of "dead",
or maybe we add an extra bit that can flag that it is only "mostly
dead" and we are expecting a recovery soon.

Then the flow for this could be changed where we take the VF lock and
mark it as "stale" to prevent any driver binding and then we can
release the VF lock. Next we would perform the PF operation telling it
to update the VF.  Then we spin on the VF waiting for the stale data
to be updated and once that happens we can pop the indication that the
device is "stale" freeing it for use. If something comes along that
decides it wants to kill it then it just clears the "stale" bit or
"mostly dead" flag indicating that in fact this is "dead" and needs to
be unloaded resulting in the loop being broken.

I feel like that might be a much better fit for the SR-IOV model as it
would be the PF directly disabling, modifying, and then restoring a VF
rather than requesting things from the VF itself.

Thoughts?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-11 19:37         ` Alexander Duyck
@ 2021-03-11 19:51           ` Leon Romanovsky
  2021-03-11 20:11             ` Alexander Duyck
  2021-03-11 20:19           ` Jason Gunthorpe
  1 sibling, 1 reply; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-11 19:51 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 11:37:28AM -0800, Alexander Duyck wrote:
> On Thu, Mar 11, 2021 at 10:17 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > > On Wed, Mar 10, 2021 at 11:09 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > >
> > > > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > > > to the series, because you liked v6 more.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > ---------------------------------------------------------------------------------
> > > > > > Changelog
> > > > > > v7:
> > > > > >  * Rebase on top v5.12-rc1
> > > > > >  * More english fixes
> > > > > >  * Returned to static sysfs creation model as was implemented in v0/v1.

<...>

> > > representors rather than being actual PCIe devices. Having
> > > functionality that only works when the VF driver is not loaded just
> > > feels off. The VF sysfs directory feels like it is being used as a
> > > subdirectory of the PF rather than being a device on its own.
> >
> > Moving "virtfnX_msix_count" to the PF seems like it would mitigate
> > this somewhat.  I don't know how to make this work while a VF driver
> > is bound without making the VF feel even less like a PCIe device,
> > i.e., we won't be able to use the standard MSI-X model.
>
> Yeah, I actually do kind of like that idea. In addition it would
> address one of the things I pointed out as an issue before as you
> could place the virtfn values that the total value in the same folder
> so that it is all in one central spot rather than having to walk all
> over the sysfs hierarchy to check the setting for each VF when trying
> to figure out how the vectors are currently distributed.

User binds specific VF with specific PCI ID to VM, so instead of
changing MSI-X table for that specific VF, he will need to translate
from virtfn25_msix_count to PCI ID.

I also gave an example of my system where I have many PFs and VFs
function numbers are not distributed nicely. On that system virtfn25_msix_count
won't translate to AA:BB:CC.25 but to something else.

Thanks

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-11 19:51           ` Leon Romanovsky
@ 2021-03-11 20:11             ` Alexander Duyck
  0 siblings, 0 replies; 74+ messages in thread
From: Alexander Duyck @ 2021-03-11 20:11 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 11:51 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Thu, Mar 11, 2021 at 11:37:28AM -0800, Alexander Duyck wrote:
> > On Thu, Mar 11, 2021 at 10:17 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > > > On Wed, Mar 10, 2021 at 11:09 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Sun, Mar 07, 2021 at 10:55:24AM -0800, Alexander Duyck wrote:
> > > > > > On Sun, Feb 28, 2021 at 11:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > > >
> > > > > > > @Alexander Duyck, please update me if I can add your ROB tag again
> > > > > > > to the series, because you liked v6 more.
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > ---------------------------------------------------------------------------------
> > > > > > > Changelog
> > > > > > > v7:
> > > > > > >  * Rebase on top v5.12-rc1
> > > > > > >  * More english fixes
> > > > > > >  * Returned to static sysfs creation model as was implemented in v0/v1.
>
> <...>
>
> > > > representors rather than being actual PCIe devices. Having
> > > > functionality that only works when the VF driver is not loaded just
> > > > feels off. The VF sysfs directory feels like it is being used as a
> > > > subdirectory of the PF rather than being a device on its own.
> > >
> > > Moving "virtfnX_msix_count" to the PF seems like it would mitigate
> > > this somewhat.  I don't know how to make this work while a VF driver
> > > is bound without making the VF feel even less like a PCIe device,
> > > i.e., we won't be able to use the standard MSI-X model.
> >
> > Yeah, I actually do kind of like that idea. In addition it would
> > address one of the things I pointed out as an issue before as you
> > could place the virtfn values that the total value in the same folder
> > so that it is all in one central spot rather than having to walk all
> > over the sysfs hierarchy to check the setting for each VF when trying
> > to figure out how the vectors are currently distributed.
>
> User binds specific VF with specific PCI ID to VM, so instead of
> changing MSI-X table for that specific VF, he will need to translate
> from virtfn25_msix_count to PCI ID.

Wouldn't that just be a matter of changing the naming so that the PCI
ID was present in the virtfn name?

> I also gave an example of my system where I have many PFs and VFs
> function numbers are not distributed nicely. On that system virtfn25_msix_count
> won't translate to AA:BB:CC.25 but to something else.

That isn't too surprising since normally we only support 7 functions
per device. I am okay with not using the name virtfnX. If you wanted
to embed the bus, device, func in the naming scheme that would work
for me too.

Really in general as a naming scheme just using a logical number have
probably never provided all that much value. There may be an argument
to be made for renaming the virtfn symlinks to include bus, device,
function instead.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-11 19:37         ` Alexander Duyck
  2021-03-11 19:51           ` Leon Romanovsky
@ 2021-03-11 20:19           ` Jason Gunthorpe
  2021-03-11 21:49             ` Alexander Duyck
  1 sibling, 1 reply; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-11 20:19 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed,
	Leon Romanovsky, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 11:37:28AM -0800, Alexander Duyck wrote:


> Then the flow for this could be changed where we take the VF lock and
> mark it as "stale" to prevent any driver binding and then we can
> release the VF lock. Next we would perform the PF operation telling it
> to update the VF.  Then we spin on the VF waiting for the stale data
> to be updated and once that happens we can pop the indication that the
> device is "stale" freeing it for use. 

I always get leary when people propose to open code locking constructs
:\

There is already an existing lock to prevent probe() it is the
device_lock() mutex on the VF. With no driver bound there is not much
issue to hold it over the HW activity.

This lock is normally held around the entire probe() and remove()
function which has huge amounts of HW activity already.

We don't need to invent new locks and new complexity for something
that is trivially solved already.

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-11 19:16         ` Keith Busch
  2021-03-11 19:21           ` Leon Romanovsky
@ 2021-03-11 20:22           ` Jason Gunthorpe
  2021-03-11 20:50             ` Keith Busch
  1 sibling, 1 reply; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-11 20:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, Alexander Duyck, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Leon Romanovsky, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 12:16:02PM -0700, Keith Busch wrote:
> On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:
> > On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > > 
> > > I'm not so much worried about management software as the fact that
> > > this is a vendor specific implementation detail that is shaping how
> > > the kernel interfaces are meant to work. Other than the mlx5 I don't
> > > know if there are any other vendors really onboard with this sort of
> > > solution.
> > 
> > I know this is currently vendor-specific, but I thought the value
> > proposition of dynamic configuration of VFs for different clients
> > sounded compelling enough that other vendors would do something
> > similar.  But I'm not an SR-IOV guy and have no vendor insight, so
> > maybe that's not the case?
> 
> NVMe has a similar feature defined by the standard where a PF controller can
> dynamically assign MSIx vectors to VFs. The whole thing is managed in user
> space with an ioctl, though. I guess we could wire up the driver to handle it
> through this sysfs interface too, but I think the protocol specific tooling is
> more appropriate for nvme.

Really? Why not share a common uAPI?

Do you have a standards reference for this?

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-11 18:17       ` Bjorn Helgaas
                           ` (2 preceding siblings ...)
  2021-03-11 19:37         ` Alexander Duyck
@ 2021-03-11 20:31         ` Jason Gunthorpe
  3 siblings, 0 replies; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-11 20:31 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed,
	Leon Romanovsky, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:

> > In addition it still feels rather hacky to be modifying read-only PCIe
> > configuration space on the fly via a backdoor provided by the PF. It
> > almost feels like this should be some sort of quirk rather than a
> > standard feature for an SR-IOV VF.
> 
> I agree, I'm not 100% comfortable with modifying the read-only Table
> Size register.  Maybe there's another approach that would be better?
> It *is* nice that the current approach doesn't require changes in the
> VF driver.

Right, that is the whole point here - if we wanted to change every
driver we could do that in some other way. Changing the table size
register is the only way to keep existing drivers working.

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-11 20:22           ` Jason Gunthorpe
@ 2021-03-11 20:50             ` Keith Busch
  2021-03-11 21:44               ` Jason Gunthorpe
  0 siblings, 1 reply; 74+ messages in thread
From: Keith Busch @ 2021-03-11 20:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, Alexander Duyck, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Leon Romanovsky, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 04:22:34PM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 11, 2021 at 12:16:02PM -0700, Keith Busch wrote:
> > On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:
> > > On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > > > 
> > > > I'm not so much worried about management software as the fact that
> > > > this is a vendor specific implementation detail that is shaping how
> > > > the kernel interfaces are meant to work. Other than the mlx5 I don't
> > > > know if there are any other vendors really onboard with this sort of
> > > > solution.
> > > 
> > > I know this is currently vendor-specific, but I thought the value
> > > proposition of dynamic configuration of VFs for different clients
> > > sounded compelling enough that other vendors would do something
> > > similar.  But I'm not an SR-IOV guy and have no vendor insight, so
> > > maybe that's not the case?
> > 
> > NVMe has a similar feature defined by the standard where a PF controller can
> > dynamically assign MSIx vectors to VFs. The whole thing is managed in user
> > space with an ioctl, though. I guess we could wire up the driver to handle it
> > through this sysfs interface too, but I think the protocol specific tooling is
> > more appropriate for nvme.
> 
> Really? Why not share a common uAPI?

We associate interrupt vectors with other dynamically assigned nvme
specific resources (IO queues), and these are not always allocated 1:1.
A common uAPI for MSIx only gets us half way to configuring the VFs for
that particular driver.
 
> Do you have a standards reference for this?

Yes, sections 5.22 and 8.5 from this spec:

  https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf

An example of open source tooling implementing this is nvme-cli's
"nvme virt-mgmt" command.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-11 20:50             ` Keith Busch
@ 2021-03-11 21:44               ` Jason Gunthorpe
  2021-03-25 17:21                 ` Bjorn Helgaas
  0 siblings, 1 reply; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-11 21:44 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, Alexander Duyck, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Leon Romanovsky, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Fri, Mar 12, 2021 at 05:50:34AM +0900, Keith Busch wrote:
> On Thu, Mar 11, 2021 at 04:22:34PM -0400, Jason Gunthorpe wrote:
> > On Thu, Mar 11, 2021 at 12:16:02PM -0700, Keith Busch wrote:
> > > On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:
> > > > On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > > > > 
> > > > > I'm not so much worried about management software as the fact that
> > > > > this is a vendor specific implementation detail that is shaping how
> > > > > the kernel interfaces are meant to work. Other than the mlx5 I don't
> > > > > know if there are any other vendors really onboard with this sort of
> > > > > solution.
> > > > 
> > > > I know this is currently vendor-specific, but I thought the value
> > > > proposition of dynamic configuration of VFs for different clients
> > > > sounded compelling enough that other vendors would do something
> > > > similar.  But I'm not an SR-IOV guy and have no vendor insight, so
> > > > maybe that's not the case?
> > > 
> > > NVMe has a similar feature defined by the standard where a PF controller can
> > > dynamically assign MSIx vectors to VFs. The whole thing is managed in user
> > > space with an ioctl, though. I guess we could wire up the driver to handle it
> > > through this sysfs interface too, but I think the protocol specific tooling is
> > > more appropriate for nvme.
> > 
> > Really? Why not share a common uAPI?
> 
> We associate interrupt vectors with other dynamically assigned nvme
> specific resources (IO queues), and these are not always allocated 1:1.

mlx5 is doing that too, the end driver gets to assign the MSI vector
to a CPU and then dynamically attach queues to it.

I'm not sure I get why nvme would want to link those two things as the
CPU assignment and queue attach could happen in a VM while the MSIX
should be in the host?

> A common uAPI for MSIx only gets us half way to configuring the VFs for
> that particular driver.
>
> > Do you have a standards reference for this?
> 
> Yes, sections 5.22 and 8.5 from this spec:
> 
>   https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf
> 
> An example of open source tooling implementing this is nvme-cli's
> "nvme virt-mgmt" command.

Oh it is fascinating! 8.5.2 looks like exactly the same thing being
implemented here for mlx5, including changing the "Read only" config
space value

Still confused why this shouldn't be the same API??

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-11 20:19           ` Jason Gunthorpe
@ 2021-03-11 21:49             ` Alexander Duyck
  2021-03-11 23:20               ` Jason Gunthorpe
  0 siblings, 1 reply; 74+ messages in thread
From: Alexander Duyck @ 2021-03-11 21:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed,
	Leon Romanovsky, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 12:19 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Mar 11, 2021 at 11:37:28AM -0800, Alexander Duyck wrote:
>
>
> > Then the flow for this could be changed where we take the VF lock and
> > mark it as "stale" to prevent any driver binding and then we can
> > release the VF lock. Next we would perform the PF operation telling it
> > to update the VF.  Then we spin on the VF waiting for the stale data
> > to be updated and once that happens we can pop the indication that the
> > device is "stale" freeing it for use.
>
> I always get leary when people propose to open code locking constructs
> :\

I'm not suggesting we replace the lock. It is more about essentially
revoking the VF. What we are doing is essentially rewriting the PCIe
config of the VF so in my mind it makes sense to take sort of an RCU
approach where the old one is readable, but not something a new driver
can be bound to.

> There is already an existing lock to prevent probe() it is the
> device_lock() mutex on the VF. With no driver bound there is not much
> issue to hold it over the HW activity.

Yes. But sitting on those locks also has side effects such as
preventing us from taking any other actions such as disabling SR-IOV.
One concern I have is that if somebody else tries to implement this in
the future and they don't have a synchronous setup, or worse yet they
do but it takes a long time to process a request because they have a
slow controller it would be preferable to just have us post the
message to the PF and then have the thread spin and wait on the VF to
be updated rather than block on the PF while sitting on two locks.

> This lock is normally held around the entire probe() and remove()
> function which has huge amounts of HW activity already.

Yes, but usually that activity is time bound. You are usually reading
values and processing them in a timely fashion. In the case of probe
we even have cases where we have to defer because we don't want to
hold these locks for too long.

> We don't need to invent new locks and new complexity for something
> that is trivially solved already.

I am not wanting a new lock. What I am wanting is a way to mark the VF
as being stale/offline while we are performing the update. With that
we would be able to apply similar logic to any changes in the future.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-11 21:49             ` Alexander Duyck
@ 2021-03-11 23:20               ` Jason Gunthorpe
  2021-03-12  2:53                 ` Alexander Duyck
  0 siblings, 1 reply; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-11 23:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed,
	Leon Romanovsky, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 01:49:24PM -0800, Alexander Duyck wrote:
> > We don't need to invent new locks and new complexity for something
> > that is trivially solved already.
> 
> I am not wanting a new lock. What I am wanting is a way to mark the VF
> as being stale/offline while we are performing the update. With that
> we would be able to apply similar logic to any changes in the future.

I think we should hold off doing this until someone comes up with HW
that needs it. The response time here is microseconds, it is not worth
any complexity

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-11 23:20               ` Jason Gunthorpe
@ 2021-03-12  2:53                 ` Alexander Duyck
  2021-03-12  6:32                   ` Leon Romanovsky
  2021-03-12 13:00                   ` Jason Gunthorpe
  0 siblings, 2 replies; 74+ messages in thread
From: Alexander Duyck @ 2021-03-12  2:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed,
	Leon Romanovsky, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 3:21 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Mar 11, 2021 at 01:49:24PM -0800, Alexander Duyck wrote:
> > > We don't need to invent new locks and new complexity for something
> > > that is trivially solved already.
> >
> > I am not wanting a new lock. What I am wanting is a way to mark the VF
> > as being stale/offline while we are performing the update. With that
> > we would be able to apply similar logic to any changes in the future.
>
> I think we should hold off doing this until someone comes up with HW
> that needs it. The response time here is microseconds, it is not worth
> any complexity

I disagree. Take a look at section 8.5.3 in the NVMe document that was
linked to earlier:
https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf

This is exactly what they are doing and I think it makes a ton of
sense. Basically the VF has to be taken "offline" before you are
allowed to start changing resources on it. It would basically consist
of one extra sysfs file and has additional uses beyond just the
configuration of MSI-X vectors.

We would just have to add one additional sysfs file, maybe modify the
"dead" device flag to be "offline", and we could make this work with
minimal changes to the patch set you already have. We could probably
toggle to "offline" while holding just the VF lock. To toggle the VF
back to being "online" we might need to take the PF device lock since
it is ultimately responsible for guaranteeing we have the resources.

Another way to think of this is that we are essentially pulling a
device back after we have already allocated the VFs and we are
reconfiguring it before pushing it back out for usage. Having a flag
that we could set on the VF device to say it is "under
construction"/modification/"not ready for use" would be quite useful I
would think.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-12  2:53                 ` Alexander Duyck
@ 2021-03-12  6:32                   ` Leon Romanovsky
  2021-03-12 16:59                     ` Alexander Duyck
  2021-03-12 13:00                   ` Jason Gunthorpe
  1 sibling, 1 reply; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-12  6:32 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jason Gunthorpe, Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 06:53:16PM -0800, Alexander Duyck wrote:
> On Thu, Mar 11, 2021 at 3:21 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Mar 11, 2021 at 01:49:24PM -0800, Alexander Duyck wrote:
> > > > We don't need to invent new locks and new complexity for something
> > > > that is trivially solved already.
> > >
> > > I am not wanting a new lock. What I am wanting is a way to mark the VF
> > > as being stale/offline while we are performing the update. With that
> > > we would be able to apply similar logic to any changes in the future.
> >
> > I think we should hold off doing this until someone comes up with HW
> > that needs it. The response time here is microseconds, it is not worth
> > any complexity

<...>

> Another way to think of this is that we are essentially pulling a
> device back after we have already allocated the VFs and we are
> reconfiguring it before pushing it back out for usage. Having a flag
> that we could set on the VF device to say it is "under
> construction"/modification/"not ready for use" would be quite useful I
> would think.

It is not simple flag change, but change of PCI state machine, which is
far more complex than holding two locks or call to sysfs_create_file in
the loop that made Bjorn nervous.

I want to remind again that the suggestion here has nothing to do with
the real use case of SR-IOV capable devices in the Linux.

The flow is:
1. Disable SR-IOV driver autoprobe
2. Create as much as possible VFs
3. Wait for request from the user to get VM
4. Change MSI-X table according to requested in item #3
5. Bind ready to go VF to VM
6. Inform user about VM readiness

The destroy flow includes VM destroy and unbind.

Let's focus on solutions for real problems instead of trying to solve theoretical
cases that are not going to be tested and deployed.

Thanks

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-12  2:53                 ` Alexander Duyck
  2021-03-12  6:32                   ` Leon Romanovsky
@ 2021-03-12 13:00                   ` Jason Gunthorpe
  2021-03-12 13:36                     ` Keith Busch
  1 sibling, 1 reply; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-12 13:00 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed,
	Leon Romanovsky, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 06:53:16PM -0800, Alexander Duyck wrote:
> On Thu, Mar 11, 2021 at 3:21 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Mar 11, 2021 at 01:49:24PM -0800, Alexander Duyck wrote:
> > > > We don't need to invent new locks and new complexity for something
> > > > that is trivially solved already.
> > >
> > > I am not wanting a new lock. What I am wanting is a way to mark the VF
> > > as being stale/offline while we are performing the update. With that
> > > we would be able to apply similar logic to any changes in the future.
> >
> > I think we should hold off doing this until someone comes up with HW
> > that needs it. The response time here is microseconds, it is not worth
> > any complexity
> 
> I disagree. Take a look at section 8.5.3 in the NVMe document that was
> linked to earlier:
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf
> 
> This is exactly what they are doing and I think it makes a ton of
> sense. Basically the VF has to be taken "offline" before you are

AFAIK this is internal to the NVMe command protocol, not something we
can expose generically to the OS. mlx5 has no protocol to "offline" an
already running VF, for instance.

The way Leon has it arranged that online/offline scheme has no
relevance because there is no driver or guest attached to the VF to
see the online/offline transition.

I wonder if people actually do offline a NVMe VF from a hypervisor?
Seems pretty weird.

> Another way to think of this is that we are essentially pulling a
> device back after we have already allocated the VFs and we are
> reconfiguring it before pushing it back out for usage. Having a flag
> that we could set on the VF device to say it is "under
> construction"/modification/"not ready for use" would be quite useful I
> would think.

Well, yes, the whole SRIOV VF lifecycle is a pretty bad fit for the
modern world.

I'd rather not see a half-job on a lifecycle model by hacking in
random flags. It needs a proper comprehensive design.

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-12 13:00                   ` Jason Gunthorpe
@ 2021-03-12 13:36                     ` Keith Busch
  0 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2021-03-12 13:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Duyck, Bjorn Helgaas, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Leon Romanovsky, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Fri, Mar 12, 2021 at 09:00:17AM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 11, 2021 at 06:53:16PM -0800, Alexander Duyck wrote:
> > On Thu, Mar 11, 2021 at 3:21 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Thu, Mar 11, 2021 at 01:49:24PM -0800, Alexander Duyck wrote:
> > > > > We don't need to invent new locks and new complexity for something
> > > > > that is trivially solved already.
> > > >
> > > > I am not wanting a new lock. What I am wanting is a way to mark the VF
> > > > as being stale/offline while we are performing the update. With that
> > > > we would be able to apply similar logic to any changes in the future.
> > >
> > > I think we should hold off doing this until someone comes up with HW
> > > that needs it. The response time here is microseconds, it is not worth
> > > any complexity
> > 
> > I disagree. Take a look at section 8.5.3 in the NVMe document that was
> > linked to earlier:
> > https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf
> > 
> > This is exactly what they are doing and I think it makes a ton of
> > sense. Basically the VF has to be taken "offline" before you are
> 
> AFAIK this is internal to the NVMe command protocol, not something we
> can expose generically to the OS. mlx5 has no protocol to "offline" an
> already running VF, for instance.
> 
> The way Leon has it arranged that online/offline scheme has no
> relevance because there is no driver or guest attached to the VF to
> see the online/offline transition.
> 
> I wonder if people actually do offline a NVMe VF from a hypervisor?
> Seems pretty weird.

I agree, that would be weird. I'm pretty sure you can't modify these resources
once you attach the nvme VF to a guest. The resource allocation needs to happen
prior to that.
 
> > Another way to think of this is that we are essentially pulling a
> > device back after we have already allocated the VFs and we are
> > reconfiguring it before pushing it back out for usage. Having a flag
> > that we could set on the VF device to say it is "under
> > construction"/modification/"not ready for use" would be quite useful I
> > would think.
> 
> Well, yes, the whole SRIOV VF lifecycle is a pretty bad fit for the
> modern world.
> 
> I'd rather not see a half-job on a lifecycle model by hacking in
> random flags. It needs a proper comprehensive design.
> 
> Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-12  6:32                   ` Leon Romanovsky
@ 2021-03-12 16:59                     ` Alexander Duyck
  2021-03-12 17:03                       ` Jason Gunthorpe
  2021-03-12 18:41                       ` Leon Romanovsky
  0 siblings, 2 replies; 74+ messages in thread
From: Alexander Duyck @ 2021-03-12 16:59 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 10:32 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Thu, Mar 11, 2021 at 06:53:16PM -0800, Alexander Duyck wrote:
> > On Thu, Mar 11, 2021 at 3:21 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Thu, Mar 11, 2021 at 01:49:24PM -0800, Alexander Duyck wrote:
> > > > > We don't need to invent new locks and new complexity for something
> > > > > that is trivially solved already.
> > > >
> > > > I am not wanting a new lock. What I am wanting is a way to mark the VF
> > > > as being stale/offline while we are performing the update. With that
> > > > we would be able to apply similar logic to any changes in the future.
> > >
> > > I think we should hold off doing this until someone comes up with HW
> > > that needs it. The response time here is microseconds, it is not worth
> > > any complexity
>
> <...>
>
> > Another way to think of this is that we are essentially pulling a
> > device back after we have already allocated the VFs and we are
> > reconfiguring it before pushing it back out for usage. Having a flag
> > that we could set on the VF device to say it is "under
> > construction"/modification/"not ready for use" would be quite useful I
> > would think.
>
> It is not simple flag change, but change of PCI state machine, which is
> far more complex than holding two locks or call to sysfs_create_file in
> the loop that made Bjorn nervous.
>
> I want to remind again that the suggestion here has nothing to do with
> the real use case of SR-IOV capable devices in the Linux.
>
> The flow is:
> 1. Disable SR-IOV driver autoprobe
> 2. Create as much as possible VFs
> 3. Wait for request from the user to get VM
> 4. Change MSI-X table according to requested in item #3
> 5. Bind ready to go VF to VM
> 6. Inform user about VM readiness
>
> The destroy flow includes VM destroy and unbind.
>
> Let's focus on solutions for real problems instead of trying to solve theoretical
> cases that are not going to be tested and deployed.
>
> Thanks

So part of the problem with this all along has been that you are only
focused on how you are going to use this and don't think about how
somebody else might need to use or implement it. In addition there are
a number of half measures even within your own flow. In reality if we
are thinking we are going to have to reconfigure every device it might
make sense to simply block the driver from being able to load until
you have configured it. Then the SR-IOV autoprobe would be redundant
since you could use something like the "offline" flag to avoid that.

If you are okay with step 1 where you are setting a flag to prevent
driver auto probing why is it so much more overhead to set a bit
blocking drivers from loading entirely while you are changing the
config space? Sitting on two locks and assuming a synchronous
operation is assuming a lot about the hardware and how this is going
to be used.

In addition it seems like the logic is that step 4 will always
succeed. What happens if for example you send the message to the
firmware and you don't get a response? Do you just say the request
failed let the VF be used anyway? This is another reason why I would
be much more comfortable with the option to offline the device and
then tinker with it rather than hope that your operation can somehow
do everything in one shot.

In my mind step 4 really should be 4 steps.

1. Offline VF to reserve it for modification
2. Submit request for modification
3. Verify modification has occurred, reset if needed.
4. Online VF

Doing it in that order allows for handling many more scenarios
including those where perhaps step 2 actually consists of several
changes to support any future extensions that are needed. Splitting
step 2 and 3 allows for an asynchronous event where you can wait if
firmware takes an excessively long time, or if step 2 somehow fails
you can then repeat or revert it to get back to a consistent state.
Lastly by splitting out the onlining step you can avoid potentially
releasing a broken VF to be reserved if there is some sort of
unrecoverable error between steps 2 and 3.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-12 16:59                     ` Alexander Duyck
@ 2021-03-12 17:03                       ` Jason Gunthorpe
  2021-03-12 18:34                         ` Leon Romanovsky
  2021-03-12 18:41                       ` Leon Romanovsky
  1 sibling, 1 reply; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-12 17:03 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Leon Romanovsky, Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

On Fri, Mar 12, 2021 at 08:59:38AM -0800, Alexander Duyck wrote:

> Lastly by splitting out the onlining step you can avoid potentially
> releasing a broken VF to be reserved if there is some sort of
> unrecoverable error between steps 2 and 3.

If the PF FW gets in a state that it can't respond to a trivial
command like this then *every* VF is broken. It not a useful scenario
to focus on.

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-12 17:03                       ` Jason Gunthorpe
@ 2021-03-12 18:34                         ` Leon Romanovsky
  0 siblings, 0 replies; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-12 18:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Duyck, Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

On Fri, Mar 12, 2021 at 01:03:19PM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 12, 2021 at 08:59:38AM -0800, Alexander Duyck wrote:
>
> > Lastly by splitting out the onlining step you can avoid potentially
> > releasing a broken VF to be reserved if there is some sort of
> > unrecoverable error between steps 2 and 3.
>
> If the PF FW gets in a state that it can't respond to a trivial
> command like this then *every* VF is broken. It not a useful scenario
> to focus on.

Right, many VF initial configurations are done through PF. It MUST work.

Thanks

>
> Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-12 16:59                     ` Alexander Duyck
  2021-03-12 17:03                       ` Jason Gunthorpe
@ 2021-03-12 18:41                       ` Leon Romanovsky
  1 sibling, 0 replies; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-12 18:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jason Gunthorpe, Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

On Fri, Mar 12, 2021 at 08:59:38AM -0800, Alexander Duyck wrote:
> On Thu, Mar 11, 2021 at 10:32 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Thu, Mar 11, 2021 at 06:53:16PM -0800, Alexander Duyck wrote:
> > > On Thu, Mar 11, 2021 at 3:21 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >
> > > > On Thu, Mar 11, 2021 at 01:49:24PM -0800, Alexander Duyck wrote:
> > > > > > We don't need to invent new locks and new complexity for something
> > > > > > that is trivially solved already.
> > > > >
> > > > > I am not wanting a new lock. What I am wanting is a way to mark the VF
> > > > > as being stale/offline while we are performing the update. With that
> > > > > we would be able to apply similar logic to any changes in the future.
> > > >
> > > > I think we should hold off doing this until someone comes up with HW
> > > > that needs it. The response time here is microseconds, it is not worth
> > > > any complexity
> >
> > <...>
> >
> > > Another way to think of this is that we are essentially pulling a
> > > device back after we have already allocated the VFs and we are
> > > reconfiguring it before pushing it back out for usage. Having a flag
> > > that we could set on the VF device to say it is "under
> > > construction"/modification/"not ready for use" would be quite useful I
> > > would think.
> >
> > It is not simple flag change, but change of PCI state machine, which is
> > far more complex than holding two locks or call to sysfs_create_file in
> > the loop that made Bjorn nervous.
> >
> > I want to remind again that the suggestion here has nothing to do with
> > the real use case of SR-IOV capable devices in the Linux.
> >
> > The flow is:
> > 1. Disable SR-IOV driver autoprobe
> > 2. Create as much as possible VFs
> > 3. Wait for request from the user to get VM
> > 4. Change MSI-X table according to requested in item #3
> > 5. Bind ready to go VF to VM
> > 6. Inform user about VM readiness
> >
> > The destroy flow includes VM destroy and unbind.
> >
> > Let's focus on solutions for real problems instead of trying to solve theoretical
> > cases that are not going to be tested and deployed.
> >
> > Thanks
>
> So part of the problem with this all along has been that you are only
> focused on how you are going to use this and don't think about how
> somebody else might need to use or implement it. In addition there are
> a number of half measures even within your own flow. In reality if we
> are thinking we are going to have to reconfigure every device it might
> make sense to simply block the driver from being able to load until
> you have configured it. Then the SR-IOV autoprobe would be redundant
> since you could use something like the "offline" flag to avoid that.
>
> If you are okay with step 1 where you are setting a flag to prevent
> driver auto probing why is it so much more overhead to set a bit
> blocking drivers from loading entirely while you are changing the
> config space? Sitting on two locks and assuming a synchronous
> operation is assuming a lot about the hardware and how this is going
> to be used.
>
> In addition it seems like the logic is that step 4 will always
> succeed. What happens if for example you send the message to the
> firmware and you don't get a response? Do you just say the request
> failed let the VF be used anyway? This is another reason why I would
> be much more comfortable with the option to offline the device and
> then tinker with it rather than hope that your operation can somehow
> do everything in one shot.
>
> In my mind step 4 really should be 4 steps.
>
> 1. Offline VF to reserve it for modification
> 2. Submit request for modification
> 3. Verify modification has occurred, reset if needed.
> 4. Online VF
>
> Doing it in that order allows for handling many more scenarios
> including those where perhaps step 2 actually consists of several
> changes to support any future extensions that are needed. Splitting
> step 2 and 3 allows for an asynchronous event where you can wait if
> firmware takes an excessively long time, or if step 2 somehow fails
> you can then repeat or revert it to get back to a consistent state.
> Lastly by splitting out the onlining step you can avoid potentially
> releasing a broken VF to be reserved if there is some sort of
> unrecoverable error between steps 2 and 3.

In all scenarios users need to disable autoprobe and unbind drivers.
This is actually the "offline" mode. Any SR-IOV capable HW that will
need this asynchronous probe will be able to extend current mechanism.

However I don't expect to see in Foreseen future any new SR-IOV player
that will be able to provide brand new high-speed, high-performance
and customizable SR-IOV card that will need asynchronous probe.

BTW, even NVMe with their "offline" mode in their spec implemented
the flow exactly like I'm proposing here.

Thanks

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-11 21:44               ` Jason Gunthorpe
@ 2021-03-25 17:21                 ` Bjorn Helgaas
  2021-03-25 17:36                   ` Jason Gunthorpe
  0 siblings, 1 reply; 74+ messages in thread
From: Bjorn Helgaas @ 2021-03-25 17:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Keith Busch, Alexander Duyck, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Leon Romanovsky, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Thu, Mar 11, 2021 at 05:44:24PM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 12, 2021 at 05:50:34AM +0900, Keith Busch wrote:
> > On Thu, Mar 11, 2021 at 04:22:34PM -0400, Jason Gunthorpe wrote:
> > > On Thu, Mar 11, 2021 at 12:16:02PM -0700, Keith Busch wrote:
> > > > On Thu, Mar 11, 2021 at 12:17:29PM -0600, Bjorn Helgaas wrote:
> > > > > On Wed, Mar 10, 2021 at 03:34:01PM -0800, Alexander Duyck wrote:
> > > > > > 
> > > > > > I'm not so much worried about management software as the
> > > > > > fact that this is a vendor specific implementation detail
> > > > > > that is shaping how the kernel interfaces are meant to
> > > > > > work. Other than the mlx5 I don't know if there are any
> > > > > > other vendors really onboard with this sort of solution.
> > > > > 
> > > > > I know this is currently vendor-specific, but I thought the
> > > > > value proposition of dynamic configuration of VFs for
> > > > > different clients sounded compelling enough that other
> > > > > vendors would do something similar.  But I'm not an SR-IOV
> > > > > guy and have no vendor insight, so maybe that's not the
> > > > > case?
> > > > 
> > > > NVMe has a similar feature defined by the standard where a PF
> > > > controller can dynamically assign MSIx vectors to VFs. The
> > > > whole thing is managed in user space with an ioctl, though. I
> > > > guess we could wire up the driver to handle it through this
> > > > sysfs interface too, but I think the protocol specific tooling
> > > > is more appropriate for nvme.
> > > 
> > > Really? Why not share a common uAPI?
> > 
> > We associate interrupt vectors with other dynamically assigned
> > nvme specific resources (IO queues), and these are not always
> > allocated 1:1.
> 
> mlx5 is doing that too, the end driver gets to assign the MSI vector
> to a CPU and then dynamically attach queues to it.
> 
> I'm not sure I get why nvme would want to link those two things as
> the CPU assignment and queue attach could happen in a VM while the
> MSIX should be in the host?
> 
> > A common uAPI for MSIx only gets us half way to configuring the
> > VFs for that particular driver.
> >
> > > Do you have a standards reference for this?
> > 
> > Yes, sections 5.22 and 8.5 from this spec:
> > 
> >   https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4a-2020.03.09-Ratified.pdf
> > 
> > An example of open source tooling implementing this is nvme-cli's
> > "nvme virt-mgmt" command.
> 
> Oh it is fascinating! 8.5.2 looks like exactly the same thing being
> implemented here for mlx5, including changing the "Read only" config
> space value
> 
> Still confused why this shouldn't be the same API??

NVMe and mlx5 have basically identical functionality in this respect.
Other devices and vendors will likely implement similar functionality.
It would be ideal if we had an interface generic enough to support
them all.

Is the mlx5 interface proposed here sufficient to support the NVMe
model?  I think it's close, but not quite, because the the NVMe
"offline" state isn't explicitly visible in the mlx5 model.

I'd like to see an argument that nvme-cli *could* be implemented on
top of this.  nvme-cli uses an ioctl and we may not want to
reimplement it with a new interface, but if Leon's interface is
*capable* of supporting the NVMe model, it's a good clue that it may
also work for future devices.

If this isn't quite enough to support the NVMe model, what would it
take to get there?

Bjorn

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-25 17:21                 ` Bjorn Helgaas
@ 2021-03-25 17:36                   ` Jason Gunthorpe
  2021-03-25 18:20                     ` Bjorn Helgaas
  2021-03-25 18:31                     ` Keith Busch
  0 siblings, 2 replies; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-25 17:36 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Keith Busch, Alexander Duyck, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Leon Romanovsky, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:

> NVMe and mlx5 have basically identical functionality in this respect.
> Other devices and vendors will likely implement similar functionality.
> It would be ideal if we had an interface generic enough to support
> them all.
> 
> Is the mlx5 interface proposed here sufficient to support the NVMe
> model?  I think it's close, but not quite, because the the NVMe
> "offline" state isn't explicitly visible in the mlx5 model.

I thought Keith basically said "offline" wasn't really useful as a
distinct idea. It is an artifact of nvme being a standards body
divorced from the operating system.

In linux offline and no driver attached are the same thing, you'd
never want an API to make a nvme device with a driver attached offline
because it would break the driver.

So I think it is good as is (well one of the 8 versions anyhow).

Keith didn't go into detail why the queue allocations in nvme were any
different than the queue allocations in mlx5. I expect they can
probably work the same where the # of interrupts is an upper bound on
the # of CPUs that can get queues and the device, once instantiated,
could be configured for the number of queues to actually operate, if
it wants.

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-25 17:36                   ` Jason Gunthorpe
@ 2021-03-25 18:20                     ` Bjorn Helgaas
  2021-03-25 18:28                       ` Jason Gunthorpe
  2021-03-25 18:31                     ` Keith Busch
  1 sibling, 1 reply; 74+ messages in thread
From: Bjorn Helgaas @ 2021-03-25 18:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Keith Busch, Alexander Duyck, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Leon Romanovsky, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> 
> > NVMe and mlx5 have basically identical functionality in this respect.
> > Other devices and vendors will likely implement similar functionality.
> > It would be ideal if we had an interface generic enough to support
> > them all.
> > 
> > Is the mlx5 interface proposed here sufficient to support the NVMe
> > model?  I think it's close, but not quite, because the the NVMe
> > "offline" state isn't explicitly visible in the mlx5 model.
> 
> I thought Keith basically said "offline" wasn't really useful as a
> distinct idea. It is an artifact of nvme being a standards body
> divorced from the operating system.
> 
> In linux offline and no driver attached are the same thing, you'd
> never want an API to make a nvme device with a driver attached offline
> because it would break the driver.

I think the sticky part is that Linux driver attach is not visible to
the hardware device, while the NVMe "offline" state *is*.  An NVMe PF
can only assign resources to a VF when the VF is offline, and the VF
is only usable when it is online.

For NVMe, software must ask the PF to make those online/offline
transitions via Secondary Controller Offline and Secondary Controller
Online commands [1].  How would this be integrated into this sysfs
interface?

> So I think it is good as is (well one of the 8 versions anyhow).
> 
> Keith didn't go into detail why the queue allocations in nvme were any
> different than the queue allocations in mlx5. I expect they can
> probably work the same where the # of interrupts is an upper bound on
> the # of CPUs that can get queues and the device, once instantiated,
> could be configured for the number of queues to actually operate, if
> it wants.

I don't really care about the queue allocations.  I don't think we
need to solve those here; we just need to make sure that what we do
here doesn't preclude NVMe queue allocations.

Bjorn

[1] NVMe 1.4a, sec 5.22

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-25 18:20                     ` Bjorn Helgaas
@ 2021-03-25 18:28                       ` Jason Gunthorpe
  2021-03-26  6:44                         ` Leon Romanovsky
  0 siblings, 1 reply; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-25 18:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Keith Busch, Alexander Duyck, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Leon Romanovsky, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Thu, Mar 25, 2021 at 01:20:21PM -0500, Bjorn Helgaas wrote:
> On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> > On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> > 
> > > NVMe and mlx5 have basically identical functionality in this respect.
> > > Other devices and vendors will likely implement similar functionality.
> > > It would be ideal if we had an interface generic enough to support
> > > them all.
> > > 
> > > Is the mlx5 interface proposed here sufficient to support the NVMe
> > > model?  I think it's close, but not quite, because the the NVMe
> > > "offline" state isn't explicitly visible in the mlx5 model.
> > 
> > I thought Keith basically said "offline" wasn't really useful as a
> > distinct idea. It is an artifact of nvme being a standards body
> > divorced from the operating system.
> > 
> > In linux offline and no driver attached are the same thing, you'd
> > never want an API to make a nvme device with a driver attached offline
> > because it would break the driver.
> 
> I think the sticky part is that Linux driver attach is not visible to
> the hardware device, while the NVMe "offline" state *is*.  An NVMe PF
> can only assign resources to a VF when the VF is offline, and the VF
> is only usable when it is online.
> 
> For NVMe, software must ask the PF to make those online/offline
> transitions via Secondary Controller Offline and Secondary Controller
> Online commands [1].  How would this be integrated into this sysfs
> interface?

Either the NVMe PF driver tracks the driver attach state using a bus
notifier and mirrors it to the offline state, or it simply
offline/onlines as part of the sequence to program the MSI change.

I don't see why we need any additional modeling of this behavior. 

What would be the point of onlining a device without a driver?

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-25 17:36                   ` Jason Gunthorpe
  2021-03-25 18:20                     ` Bjorn Helgaas
@ 2021-03-25 18:31                     ` Keith Busch
  2021-03-25 18:36                       ` Jason Gunthorpe
  1 sibling, 1 reply; 74+ messages in thread
From: Keith Busch @ 2021-03-25 18:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, Alexander Duyck, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Leon Romanovsky, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> 
> > NVMe and mlx5 have basically identical functionality in this respect.
> > Other devices and vendors will likely implement similar functionality.
> > It would be ideal if we had an interface generic enough to support
> > them all.
> > 
> > Is the mlx5 interface proposed here sufficient to support the NVMe
> > model?  I think it's close, but not quite, because the the NVMe
> > "offline" state isn't explicitly visible in the mlx5 model.
> 
> I thought Keith basically said "offline" wasn't really useful as a
> distinct idea. It is an artifact of nvme being a standards body
> divorced from the operating system.

I think that was someone else who said that.

FWIW, the nvme "offline" state just means a driver can't use the nvme
capabilities of the device. You can bind a driver to it if you want, but
no IO will be possible, so it's fine if you bind your VF to something
like vfio prior to starting a VM, or just not have a driver bound to
anything during the intial resource assignment.
 
> In linux offline and no driver attached are the same thing, you'd
> never want an API to make a nvme device with a driver attached offline
> because it would break the driver.
> 
> So I think it is good as is (well one of the 8 versions anyhow).
> 
> Keith didn't go into detail why the queue allocations in nvme were any
> different than the queue allocations in mlx5. 

The NVMe IO queue resources are assignable just like the MSIx vectors.
But they're not always assigned 1:1. For example:

  NVMe has an admin queue that always requires an interrupt vector. Does
  the VM driver want this vector to share with the IO queues, or do we
  want a +1 vector for that queue? 

  Maybe the VM is going to use a user space polling driver, so now you
  don't even need MSIx vectors on the function assigned to that VM. You
  just need to assign the IO queue resouces, and reserve the MSIx
  resources for another function.

  The Linux nvme driver allows a mix of poll + interrupt queues, so the
  user may want to allocate more IO queues than interrupts.

A kernel interface for assigning interrupt vectors gets us only halfway
to configuring the assignable resources.

> I expect they can probably work the same where the # of interrupts is
> an upper bound on the # of CPUs that can get queues and the device,
> once instantiated, could be configured for the number of queues to
> actually operate, if it wants.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-25 18:31                     ` Keith Busch
@ 2021-03-25 18:36                       ` Jason Gunthorpe
  0 siblings, 0 replies; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-25 18:36 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, Alexander Duyck, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Leon Romanovsky, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Fri, Mar 26, 2021 at 03:31:57AM +0900, Keith Busch wrote:

> The NVMe IO queue resources are assignable just like the MSIx vectors.
> But they're not always assigned 1:1. For example:

But this is all driver configuration, the driver might be running in
some VM.

It seems saner to have two kernel interfaces, one used in the
hypervisor on the PF to setup the MSI vector count

And a second interface that becomes available after the driver is
bound to configure the driver, where-ever/how-ever that driver may be
running.

I don't know how you could combine them in one step except for the
simple case of the driver running in the hypervisor?

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-25 18:28                       ` Jason Gunthorpe
@ 2021-03-26  6:44                         ` Leon Romanovsky
  2021-03-26 16:00                           ` Alexander Duyck
  0 siblings, 1 reply; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-26  6:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas, Keith Busch
  Cc: Alexander Duyck, Bjorn Helgaas, Saeed Mahameed, Jakub Kicinski,
	linux-pci, linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller, Greg Kroah-Hartman

On Thu, Mar 25, 2021 at 03:28:36PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 25, 2021 at 01:20:21PM -0500, Bjorn Helgaas wrote:
> > On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> > > 
> > > > NVMe and mlx5 have basically identical functionality in this respect.
> > > > Other devices and vendors will likely implement similar functionality.
> > > > It would be ideal if we had an interface generic enough to support
> > > > them all.
> > > > 
> > > > Is the mlx5 interface proposed here sufficient to support the NVMe
> > > > model?  I think it's close, but not quite, because the the NVMe
> > > > "offline" state isn't explicitly visible in the mlx5 model.
> > > 
> > > I thought Keith basically said "offline" wasn't really useful as a
> > > distinct idea. It is an artifact of nvme being a standards body
> > > divorced from the operating system.
> > > 
> > > In linux offline and no driver attached are the same thing, you'd
> > > never want an API to make a nvme device with a driver attached offline
> > > because it would break the driver.
> > 
> > I think the sticky part is that Linux driver attach is not visible to
> > the hardware device, while the NVMe "offline" state *is*.  An NVMe PF
> > can only assign resources to a VF when the VF is offline, and the VF
> > is only usable when it is online.
> > 
> > For NVMe, software must ask the PF to make those online/offline
> > transitions via Secondary Controller Offline and Secondary Controller
> > Online commands [1].  How would this be integrated into this sysfs
> > interface?
> 
> Either the NVMe PF driver tracks the driver attach state using a bus
> notifier and mirrors it to the offline state, or it simply
> offline/onlines as part of the sequence to program the MSI change.
> 
> I don't see why we need any additional modeling of this behavior. 
> 
> What would be the point of onlining a device without a driver?

Agree, we should remember that we are talking about Linux kernel model
and implementation, where _no_driver_ means _offline_.

Thanks

> 
> Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-26  6:44                         ` Leon Romanovsky
@ 2021-03-26 16:00                           ` Alexander Duyck
  2021-03-26 16:56                             ` Jason Gunthorpe
  2021-03-26 17:08                             ` Bjorn Helgaas
  0 siblings, 2 replies; 74+ messages in thread
From: Alexander Duyck @ 2021-03-26 16:00 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Bjorn Helgaas, Keith Busch, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Thu, Mar 25, 2021 at 11:44 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Thu, Mar 25, 2021 at 03:28:36PM -0300, Jason Gunthorpe wrote:
> > On Thu, Mar 25, 2021 at 01:20:21PM -0500, Bjorn Helgaas wrote:
> > > On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> > > >
> > > > > NVMe and mlx5 have basically identical functionality in this respect.
> > > > > Other devices and vendors will likely implement similar functionality.
> > > > > It would be ideal if we had an interface generic enough to support
> > > > > them all.
> > > > >
> > > > > Is the mlx5 interface proposed here sufficient to support the NVMe
> > > > > model?  I think it's close, but not quite, because the the NVMe
> > > > > "offline" state isn't explicitly visible in the mlx5 model.
> > > >
> > > > I thought Keith basically said "offline" wasn't really useful as a
> > > > distinct idea. It is an artifact of nvme being a standards body
> > > > divorced from the operating system.
> > > >
> > > > In linux offline and no driver attached are the same thing, you'd
> > > > never want an API to make a nvme device with a driver attached offline
> > > > because it would break the driver.
> > >
> > > I think the sticky part is that Linux driver attach is not visible to
> > > the hardware device, while the NVMe "offline" state *is*.  An NVMe PF
> > > can only assign resources to a VF when the VF is offline, and the VF
> > > is only usable when it is online.
> > >
> > > For NVMe, software must ask the PF to make those online/offline
> > > transitions via Secondary Controller Offline and Secondary Controller
> > > Online commands [1].  How would this be integrated into this sysfs
> > > interface?
> >
> > Either the NVMe PF driver tracks the driver attach state using a bus
> > notifier and mirrors it to the offline state, or it simply
> > offline/onlines as part of the sequence to program the MSI change.
> >
> > I don't see why we need any additional modeling of this behavior.
> >
> > What would be the point of onlining a device without a driver?
>
> Agree, we should remember that we are talking about Linux kernel model
> and implementation, where _no_driver_ means _offline_.

The only means you have of guaranteeing the driver is "offline" is by
holding on the device lock and checking it. So it is only really
useful for one operation and then you have to release the lock. The
idea behind having an "offline" state would be to allow you to
aggregate multiple potential operations into a single change.

For example you would place the device offline, then change
interrupts, and then queues, and then you could online it again. The
kernel code could have something in place to prevent driver load on
"offline" devices. What it gives you is more of a transactional model
versus what you have right now which is more of a concurrent model.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-26 16:00                           ` Alexander Duyck
@ 2021-03-26 16:56                             ` Jason Gunthorpe
  2021-03-26 17:08                             ` Bjorn Helgaas
  1 sibling, 0 replies; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-26 16:56 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Leon Romanovsky, Bjorn Helgaas, Keith Busch, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Fri, Mar 26, 2021 at 09:00:50AM -0700, Alexander Duyck wrote:
> On Thu, Mar 25, 2021 at 11:44 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Thu, Mar 25, 2021 at 03:28:36PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 25, 2021 at 01:20:21PM -0500, Bjorn Helgaas wrote:
> > > > On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> > > > >
> > > > > > NVMe and mlx5 have basically identical functionality in this respect.
> > > > > > Other devices and vendors will likely implement similar functionality.
> > > > > > It would be ideal if we had an interface generic enough to support
> > > > > > them all.
> > > > > >
> > > > > > Is the mlx5 interface proposed here sufficient to support the NVMe
> > > > > > model?  I think it's close, but not quite, because the the NVMe
> > > > > > "offline" state isn't explicitly visible in the mlx5 model.
> > > > >
> > > > > I thought Keith basically said "offline" wasn't really useful as a
> > > > > distinct idea. It is an artifact of nvme being a standards body
> > > > > divorced from the operating system.
> > > > >
> > > > > In linux offline and no driver attached are the same thing, you'd
> > > > > never want an API to make a nvme device with a driver attached offline
> > > > > because it would break the driver.
> > > >
> > > > I think the sticky part is that Linux driver attach is not visible to
> > > > the hardware device, while the NVMe "offline" state *is*.  An NVMe PF
> > > > can only assign resources to a VF when the VF is offline, and the VF
> > > > is only usable when it is online.
> > > >
> > > > For NVMe, software must ask the PF to make those online/offline
> > > > transitions via Secondary Controller Offline and Secondary Controller
> > > > Online commands [1].  How would this be integrated into this sysfs
> > > > interface?
> > >
> > > Either the NVMe PF driver tracks the driver attach state using a bus
> > > notifier and mirrors it to the offline state, or it simply
> > > offline/onlines as part of the sequence to program the MSI change.
> > >
> > > I don't see why we need any additional modeling of this behavior.
> > >
> > > What would be the point of onlining a device without a driver?
> >
> > Agree, we should remember that we are talking about Linux kernel model
> > and implementation, where _no_driver_ means _offline_.
> 
> The only means you have of guaranteeing the driver is "offline" is by
> holding on the device lock and checking it. So it is only really
> useful for one operation and then you have to release the lock. The
> idea behind having an "offline" state would be to allow you to
> aggregate multiple potential operations into a single change.

What we really want is a solution where the SRIOV device exist for the
HW but isn't registered yet as a pci_device. We have endless problems
with needing to configure SRIOV instances at the PF before they get
plugged into the kernel and the no driver autoprobe buisness is such a
hack.

But that is a huge problem and not this series.

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-26 16:00                           ` Alexander Duyck
  2021-03-26 16:56                             ` Jason Gunthorpe
@ 2021-03-26 17:08                             ` Bjorn Helgaas
  2021-03-26 17:12                               ` Jason Gunthorpe
                                                 ` (2 more replies)
  1 sibling, 3 replies; 74+ messages in thread
From: Bjorn Helgaas @ 2021-03-26 17:08 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Leon Romanovsky, Jason Gunthorpe, Keith Busch, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Fri, Mar 26, 2021 at 09:00:50AM -0700, Alexander Duyck wrote:
> On Thu, Mar 25, 2021 at 11:44 PM Leon Romanovsky <leon@kernel.org> wrote:
> > On Thu, Mar 25, 2021 at 03:28:36PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 25, 2021 at 01:20:21PM -0500, Bjorn Helgaas wrote:
> > > > On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> > > > >
> > > > > > NVMe and mlx5 have basically identical functionality in this respect.
> > > > > > Other devices and vendors will likely implement similar functionality.
> > > > > > It would be ideal if we had an interface generic enough to support
> > > > > > them all.
> > > > > >
> > > > > > Is the mlx5 interface proposed here sufficient to support the NVMe
> > > > > > model?  I think it's close, but not quite, because the the NVMe
> > > > > > "offline" state isn't explicitly visible in the mlx5 model.
> > > > >
> > > > > I thought Keith basically said "offline" wasn't really useful as a
> > > > > distinct idea. It is an artifact of nvme being a standards body
> > > > > divorced from the operating system.
> > > > >
> > > > > In linux offline and no driver attached are the same thing, you'd
> > > > > never want an API to make a nvme device with a driver attached offline
> > > > > because it would break the driver.
> > > >
> > > > I think the sticky part is that Linux driver attach is not visible to
> > > > the hardware device, while the NVMe "offline" state *is*.  An NVMe PF
> > > > can only assign resources to a VF when the VF is offline, and the VF
> > > > is only usable when it is online.
> > > >
> > > > For NVMe, software must ask the PF to make those online/offline
> > > > transitions via Secondary Controller Offline and Secondary Controller
> > > > Online commands [1].  How would this be integrated into this sysfs
> > > > interface?
> > >
> > > Either the NVMe PF driver tracks the driver attach state using a bus
> > > notifier and mirrors it to the offline state, or it simply
> > > offline/onlines as part of the sequence to program the MSI change.
> > >
> > > I don't see why we need any additional modeling of this behavior.
> > >
> > > What would be the point of onlining a device without a driver?
> >
> > Agree, we should remember that we are talking about Linux kernel model
> > and implementation, where _no_driver_ means _offline_.
> 
> The only means you have of guaranteeing the driver is "offline" is by
> holding on the device lock and checking it. So it is only really
> useful for one operation and then you have to release the lock. The
> idea behind having an "offline" state would be to allow you to
> aggregate multiple potential operations into a single change.
> 
> For example you would place the device offline, then change
> interrupts, and then queues, and then you could online it again. The
> kernel code could have something in place to prevent driver load on
> "offline" devices. What it gives you is more of a transactional model
> versus what you have right now which is more of a concurrent model.

Thanks, Alex.  Leon currently does enforce the "offline" situation by
holding the VF device lock while checking that it has no driver and
asking the PF to do the assignment.  I agree this is only useful for a
single operation.  Would the current series *prevent* a transactional
model from being added later if it turns out to be useful?  I think I
can imagine keeping the same sysfs files but changing the
implementation to check for the VF being offline, while adding
something new to control online/offline.

I also want to resurrect your idea of associating
"sriov_vf_msix_count" with the PF instead of the VF.  I really like
that idea, and it better reflects the way both mlx5 and NVMe work.  I
don't think there was a major objection to it, but the discussion
seems to have petered out after your suggestion of putting the PCI
bus/device/funcion in the filename, which I also like [1].

Leon has implemented a ton of variations, but I don't think having all
the files in the PF directory was one of them.

Bjorn

[1] https://lore.kernel.org/r/CAKgT0Ue363fZEwqGUa1UAAYotUYH8QpEADW1U5yfNS7XkOLx0Q@mail.gmail.com

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-26 17:08                             ` Bjorn Helgaas
@ 2021-03-26 17:12                               ` Jason Gunthorpe
  2021-03-27  6:00                                 ` Leon Romanovsky
  2021-03-26 17:29                               ` Keith Busch
  2021-03-26 18:50                               ` Alexander Duyck
  2 siblings, 1 reply; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-26 17:12 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Leon Romanovsky, Keith Busch, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Fri, Mar 26, 2021 at 12:08:31PM -0500, Bjorn Helgaas wrote:

> Leon has implemented a ton of variations, but I don't think having all
> the files in the PF directory was one of them.

If you promise this is the last of this bike painting adventure then
let's do it.

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-26 17:08                             ` Bjorn Helgaas
  2021-03-26 17:12                               ` Jason Gunthorpe
@ 2021-03-26 17:29                               ` Keith Busch
  2021-03-26 17:31                                 ` Jason Gunthorpe
  2021-03-26 18:50                               ` Alexander Duyck
  2 siblings, 1 reply; 74+ messages in thread
From: Keith Busch @ 2021-03-26 17:29 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Leon Romanovsky, Jason Gunthorpe, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Fri, Mar 26, 2021 at 12:08:31PM -0500, Bjorn Helgaas wrote:
> I also want to resurrect your idea of associating
> "sriov_vf_msix_count" with the PF instead of the VF.  I really like
> that idea, and it better reflects the way both mlx5 and NVMe work.

That is a better match for nvme: we can assign resources to VFs with
the PF's "VF Enable" set to '0', so configuring VFs without requiring
them be enumerated in sysfs is a plus. The VFs would also implicitly be
in the "offline" state for that condition and able to accept these
requests.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-26 17:29                               ` Keith Busch
@ 2021-03-26 17:31                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-26 17:31 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, Alexander Duyck, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Sat, Mar 27, 2021 at 02:29:00AM +0900, Keith Busch wrote:
> On Fri, Mar 26, 2021 at 12:08:31PM -0500, Bjorn Helgaas wrote:
> > I also want to resurrect your idea of associating
> > "sriov_vf_msix_count" with the PF instead of the VF.  I really like
> > that idea, and it better reflects the way both mlx5 and NVMe work.
> 
> That is a better match for nvme: we can assign resources to VFs with
> the PF's "VF Enable" set to '0', so configuring VFs without requiring
> them be enumerated in sysfs is a plus. 

If the VF is not in sysfs already in the normal place, why would it be
in the special configuration directory? Do you want the driver to
somehow provide the configuration directory content?

I'm confused what you mean

As I said to Alex configuring things before they even get plugged in
sounds like the right direction..

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-26 17:08                             ` Bjorn Helgaas
  2021-03-26 17:12                               ` Jason Gunthorpe
  2021-03-26 17:29                               ` Keith Busch
@ 2021-03-26 18:50                               ` Alexander Duyck
  2021-03-26 19:01                                 ` Jason Gunthorpe
  2021-03-26 19:36                                 ` Bjorn Helgaas
  2 siblings, 2 replies; 74+ messages in thread
From: Alexander Duyck @ 2021-03-26 18:50 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Leon Romanovsky, Jason Gunthorpe, Keith Busch, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Fri, Mar 26, 2021 at 10:08 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Fri, Mar 26, 2021 at 09:00:50AM -0700, Alexander Duyck wrote:
> > On Thu, Mar 25, 2021 at 11:44 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > On Thu, Mar 25, 2021 at 03:28:36PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Mar 25, 2021 at 01:20:21PM -0500, Bjorn Helgaas wrote:
> > > > > On Thu, Mar 25, 2021 at 02:36:46PM -0300, Jason Gunthorpe wrote:
> > > > > > On Thu, Mar 25, 2021 at 12:21:44PM -0500, Bjorn Helgaas wrote:
> > > > > >
> > > > > > > NVMe and mlx5 have basically identical functionality in this respect.
> > > > > > > Other devices and vendors will likely implement similar functionality.
> > > > > > > It would be ideal if we had an interface generic enough to support
> > > > > > > them all.
> > > > > > >
> > > > > > > Is the mlx5 interface proposed here sufficient to support the NVMe
> > > > > > > model?  I think it's close, but not quite, because the the NVMe
> > > > > > > "offline" state isn't explicitly visible in the mlx5 model.
> > > > > >
> > > > > > I thought Keith basically said "offline" wasn't really useful as a
> > > > > > distinct idea. It is an artifact of nvme being a standards body
> > > > > > divorced from the operating system.
> > > > > >
> > > > > > In linux offline and no driver attached are the same thing, you'd
> > > > > > never want an API to make a nvme device with a driver attached offline
> > > > > > because it would break the driver.
> > > > >
> > > > > I think the sticky part is that Linux driver attach is not visible to
> > > > > the hardware device, while the NVMe "offline" state *is*.  An NVMe PF
> > > > > can only assign resources to a VF when the VF is offline, and the VF
> > > > > is only usable when it is online.
> > > > >
> > > > > For NVMe, software must ask the PF to make those online/offline
> > > > > transitions via Secondary Controller Offline and Secondary Controller
> > > > > Online commands [1].  How would this be integrated into this sysfs
> > > > > interface?
> > > >
> > > > Either the NVMe PF driver tracks the driver attach state using a bus
> > > > notifier and mirrors it to the offline state, or it simply
> > > > offline/onlines as part of the sequence to program the MSI change.
> > > >
> > > > I don't see why we need any additional modeling of this behavior.
> > > >
> > > > What would be the point of onlining a device without a driver?
> > >
> > > Agree, we should remember that we are talking about Linux kernel model
> > > and implementation, where _no_driver_ means _offline_.
> >
> > The only means you have of guaranteeing the driver is "offline" is by
> > holding on the device lock and checking it. So it is only really
> > useful for one operation and then you have to release the lock. The
> > idea behind having an "offline" state would be to allow you to
> > aggregate multiple potential operations into a single change.
> >
> > For example you would place the device offline, then change
> > interrupts, and then queues, and then you could online it again. The
> > kernel code could have something in place to prevent driver load on
> > "offline" devices. What it gives you is more of a transactional model
> > versus what you have right now which is more of a concurrent model.
>
> Thanks, Alex.  Leon currently does enforce the "offline" situation by
> holding the VF device lock while checking that it has no driver and
> asking the PF to do the assignment.  I agree this is only useful for a
> single operation.  Would the current series *prevent* a transactional
> model from being added later if it turns out to be useful?  I think I
> can imagine keeping the same sysfs files but changing the
> implementation to check for the VF being offline, while adding
> something new to control online/offline.

My concern would be that we are defining the user space interface.
Once we have this working as a single operation I could see us having
to support it that way going forward as somebody will script something
not expecting an "offline" sysfs file, and the complaint would be that
we are breaking userspace if we require the use of an "offline" file.
So my preference would be to just do it that way now rather than wait
as the behavior will be grandfathered in once we allow the operation
without it.

> I also want to resurrect your idea of associating
> "sriov_vf_msix_count" with the PF instead of the VF.  I really like
> that idea, and it better reflects the way both mlx5 and NVMe work.  I
> don't think there was a major objection to it, but the discussion
> seems to have petered out after your suggestion of putting the PCI
> bus/device/funcion in the filename, which I also like [1].
>
> Leon has implemented a ton of variations, but I don't think having all
> the files in the PF directory was one of them.
>
> Bjorn
>
> [1] https://lore.kernel.org/r/CAKgT0Ue363fZEwqGUa1UAAYotUYH8QpEADW1U5yfNS7XkOLx0Q@mail.gmail.com

I almost wonder if it wouldn't make sense to just partition this up to
handle flexible resources in the future. Maybe something like having
the directory setup such that you have "sriov_resources/msix/" and
then you could have individual files with one for the total and the
rest with the VF BDF naming scheme. Then if we have to, we could add
other subdirectories in the future to handle things like queues in the
future.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-26 18:50                               ` Alexander Duyck
@ 2021-03-26 19:01                                 ` Jason Gunthorpe
  2021-03-30  1:29                                   ` Bjorn Helgaas
  2021-03-26 19:36                                 ` Bjorn Helgaas
  1 sibling, 1 reply; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-26 19:01 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Leon Romanovsky, Keith Busch, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Fri, Mar 26, 2021 at 11:50:44AM -0700, Alexander Duyck wrote:

> My concern would be that we are defining the user space interface.
> Once we have this working as a single operation I could see us having
> to support it that way going forward as somebody will script something
> not expecting an "offline" sysfs file, and the complaint would be that
> we are breaking userspace if we require the use of an "offline"
> file.

Well, we wouldn't do that. The semantic we define here is that the
msix_count interface 'auto-offlines' if that is what is required. If
we add some formal offline someday then 'auto-offline' would be a NOP
when the device is offline and do the same online/offline sequence as
today if it isn't.

> I almost wonder if it wouldn't make sense to just partition this up to
> handle flexible resources in the future. Maybe something like having
> the directory setup such that you have "sriov_resources/msix/" and

This is supposed to be about PCI properties, that is why we are doing
it in the PCI layer.

If you want to see something that handles non-PCI properties too then
Leon needs to make the whole thing general so the device driver can
give a list of properties it wants to configure and the core manages
the thing.

But at that point, why involve the PCI core in the first place? Just
put the complex configuration in the driver, use configfs or devlink
or nvmecli or whatever is appropriate.

And we are doing that too, there will also be pre-driver configuration
in devlink for *non PCI* properties. *shrug*

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-26 18:50                               ` Alexander Duyck
  2021-03-26 19:01                                 ` Jason Gunthorpe
@ 2021-03-26 19:36                                 ` Bjorn Helgaas
  2021-03-27 12:38                                   ` Greg Kroah-Hartman
  1 sibling, 1 reply; 74+ messages in thread
From: Bjorn Helgaas @ 2021-03-26 19:36 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Leon Romanovsky, Jason Gunthorpe, Keith Busch, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Fri, Mar 26, 2021 at 11:50:44AM -0700, Alexander Duyck wrote:

> I almost wonder if it wouldn't make sense to just partition this up to
> handle flexible resources in the future. Maybe something like having
> the directory setup such that you have "sriov_resources/msix/" and
> then you could have individual files with one for the total and the
> rest with the VF BDF naming scheme. Then if we have to, we could add
> other subdirectories in the future to handle things like queues in the
> future.

Subdirectories would be nice, but Greg KH said earlier in a different
context that there's an issue with them [1].  He went on to say tools
like udev would miss uevents for the subdirs [2].

I don't know whether that's really a problem in this case -- it
doesn't seem like we would care about uevents for files that do MSI-X
vector assignment.

[1] https://lore.kernel.org/linux-pci/20191121211017.GA854512@kroah.com/
[2] https://lore.kernel.org/linux-pci/20191124170207.GA2267252@kroah.com/

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-26 17:12                               ` Jason Gunthorpe
@ 2021-03-27  6:00                                 ` Leon Romanovsky
  0 siblings, 0 replies; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-27  6:00 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas
  Cc: Alexander Duyck, Keith Busch, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

On Fri, Mar 26, 2021 at 02:12:09PM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 26, 2021 at 12:08:31PM -0500, Bjorn Helgaas wrote:
> 
> > Leon has implemented a ton of variations, but I don't think having all
> > the files in the PF directory was one of them.
> 
> If you promise this is the last of this bike painting adventure then
> let's do it.

So Bjorn, can you promise?

Thanks

> 
> Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-26 19:36                                 ` Bjorn Helgaas
@ 2021-03-27 12:38                                   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 74+ messages in thread
From: Greg Kroah-Hartman @ 2021-03-27 12:38 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Leon Romanovsky, Jason Gunthorpe, Keith Busch,
	Bjorn Helgaas, Saeed Mahameed, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller

On Fri, Mar 26, 2021 at 02:36:31PM -0500, Bjorn Helgaas wrote:
> On Fri, Mar 26, 2021 at 11:50:44AM -0700, Alexander Duyck wrote:
> 
> > I almost wonder if it wouldn't make sense to just partition this up to
> > handle flexible resources in the future. Maybe something like having
> > the directory setup such that you have "sriov_resources/msix/" and
> > then you could have individual files with one for the total and the
> > rest with the VF BDF naming scheme. Then if we have to, we could add
> > other subdirectories in the future to handle things like queues in the
> > future.
> 
> Subdirectories would be nice, but Greg KH said earlier in a different
> context that there's an issue with them [1].  He went on to say tools
> like udev would miss uevents for the subdirs [2].
> 
> I don't know whether that's really a problem in this case -- it
> doesn't seem like we would care about uevents for files that do MSI-X
> vector assignment.
> 
> [1] https://lore.kernel.org/linux-pci/20191121211017.GA854512@kroah.com/
> [2] https://lore.kernel.org/linux-pci/20191124170207.GA2267252@kroah.com/

You can only go "one level deep" on subdirectories tied to a 'struct
device' and have userspace tools know they are still there.  You can do
that by giving an attribute group a "name" for the directory.

Anything more than that just gets very very messy very quickly and I do
not recommend doing that at all.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-26 19:01                                 ` Jason Gunthorpe
@ 2021-03-30  1:29                                   ` Bjorn Helgaas
  2021-03-30 13:57                                     ` Jason Gunthorpe
  2021-03-30 18:10                                     ` Keith Busch
  0 siblings, 2 replies; 74+ messages in thread
From: Bjorn Helgaas @ 2021-03-30  1:29 UTC (permalink / raw)
  To: Alexander Duyck, Keith Busch
  Cc: Jason Gunthorpe, Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed,
	Jakub Kicinski, linux-pci, linux-rdma, Netdev, Don Dutile,
	Alex Williamson, David S . Miller, Greg Kroah-Hartman

On Fri, Mar 26, 2021 at 04:01:48PM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 26, 2021 at 11:50:44AM -0700, Alexander Duyck wrote:
> 
> > My concern would be that we are defining the user space interface.
> > Once we have this working as a single operation I could see us having
> > to support it that way going forward as somebody will script something
> > not expecting an "offline" sysfs file, and the complaint would be that
> > we are breaking userspace if we require the use of an "offline"
> > file.
> 
> Well, we wouldn't do that. The semantic we define here is that the
> msix_count interface 'auto-offlines' if that is what is required. If
> we add some formal offline someday then 'auto-offline' would be a NOP
> when the device is offline and do the same online/offline sequence as
> today if it isn't.

Alexander, Keith, any more thoughts on this?

I think I misunderstood Greg's subdirectory comment.  We already have
directories like this:

  /sys/bus/pci/devices/0000:01:00.0/link/
  /sys/bus/pci/devices/0000:01:00.0/msi_irqs/
  /sys/bus/pci/devices/0000:01:00.0/power/

and aspm_ctrl_attr_group (for "link") is nicely done with static
attributes.  So I think we could do something like this:

  /sys/bus/pci/devices/0000:01:00.0/   # PF directory
    sriov/                             # SR-IOV related stuff
      vf_total_msix
      vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
      ...
      vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF

And I think this could support the mlx5 model as well as the NVMe
model.

For NVMe, a write to vf_msix_count_* would have to auto-offline the VF
before asking the PF to assign the vectors, as Jason suggests above.
Before VF Enable is set, the vf_msix_count_* files wouldn't exist and
we wouldn't be able to assign vectors to VFs; IIUC that's a difference
from the NVMe interface, but maybe not a terrible one?

I'm not proposing changing nvme-cli to use this, but if the interface
is general enough to support both, that would be a good clue that it
might be able to support future devices with similar functionality.

Bjorn

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-30  1:29                                   ` Bjorn Helgaas
@ 2021-03-30 13:57                                     ` Jason Gunthorpe
  2021-03-30 15:00                                       ` Bjorn Helgaas
  2021-03-30 18:10                                     ` Keith Busch
  1 sibling, 1 reply; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-30 13:57 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Keith Busch, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:

> I think I misunderstood Greg's subdirectory comment.  We already have
> directories like this:

Yes, IIRC, Greg's remark applies if you have to start creating
directories with manual kobjects.

> and aspm_ctrl_attr_group (for "link") is nicely done with static
> attributes.  So I think we could do something like this:
> 
>   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
>     sriov/                             # SR-IOV related stuff
>       vf_total_msix
>       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
>       ...
>       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF

It looks a bit odd that it isn't a subdirectory, but this seems
reasonable.

> For NVMe, a write to vf_msix_count_* would have to auto-offline the VF
> before asking the PF to assign the vectors, as Jason suggests above.

It is also not awful if it returns EBUSY if the admin hasn't done
some device-specific offline sequence.

I'm just worried adding the idea of offline here is going to open a
huge can of worms in terms of defining what it means, and the very
next ask will be to start all VFs in offline mode. This would be some
weird overlap with the no-driver-autoprobing sysfs. We've been
thinking about this alot here and there are not easy answers.

mlx5 sort of has an offline concept too, but we have been modeling it
in devlink, which is kind of like nvme-cli for networking.

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-30 13:57                                     ` Jason Gunthorpe
@ 2021-03-30 15:00                                       ` Bjorn Helgaas
  2021-03-30 19:47                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 74+ messages in thread
From: Bjorn Helgaas @ 2021-03-30 15:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Duyck, Keith Busch, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> 
> > I think I misunderstood Greg's subdirectory comment.  We already have
> > directories like this:
> 
> Yes, IIRC, Greg's remark applies if you have to start creating
> directories with manual kobjects.
> 
> > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > attributes.  So I think we could do something like this:
> > 
> >   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
> >     sriov/                             # SR-IOV related stuff
> >       vf_total_msix
> >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
> >       ...
> >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> 
> It looks a bit odd that it isn't a subdirectory, but this seems
> reasonable.

Sorry, I missed your point; you'll have to lay it out more explicitly.
I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0
directory.  The full paths would be:

  /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix
  /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F
  ...

> > For NVMe, a write to vf_msix_count_* would have to auto-offline the VF
> > before asking the PF to assign the vectors, as Jason suggests above.
> 
> It is also not awful if it returns EBUSY if the admin hasn't done
> some device-specific offline sequence.

Agreed.  The concept of "offline" is not visible in this interface.

> I'm just worried adding the idea of offline here is going to open a
> huge can of worms in terms of defining what it means, and the very
> next ask will be to start all VFs in offline mode. This would be some
> weird overlap with the no-driver-autoprobing sysfs. We've been
> thinking about this alot here and there are not easy answers.

We haven't added any idea of offline in the sysfs interface.  I'm
only trying to figure out whether it would be possible to use this
interface on top of devices with an offline concept, e.g., NVMe.

> mlx5 sort of has an offline concept too, but we have been modeling it
> in devlink, which is kind of like nvme-cli for networking.
> 
> Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-30  1:29                                   ` Bjorn Helgaas
  2021-03-30 13:57                                     ` Jason Gunthorpe
@ 2021-03-30 18:10                                     ` Keith Busch
  1 sibling, 0 replies; 74+ messages in thread
From: Keith Busch @ 2021-03-30 18:10 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Jason Gunthorpe, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> On Fri, Mar 26, 2021 at 04:01:48PM -0300, Jason Gunthorpe wrote:
> > On Fri, Mar 26, 2021 at 11:50:44AM -0700, Alexander Duyck wrote:
> > 
> > > My concern would be that we are defining the user space interface.
> > > Once we have this working as a single operation I could see us having
> > > to support it that way going forward as somebody will script something
> > > not expecting an "offline" sysfs file, and the complaint would be that
> > > we are breaking userspace if we require the use of an "offline"
> > > file.
> > 
> > Well, we wouldn't do that. The semantic we define here is that the
> > msix_count interface 'auto-offlines' if that is what is required. If
> > we add some formal offline someday then 'auto-offline' would be a NOP
> > when the device is offline and do the same online/offline sequence as
> > today if it isn't.
> 
> Alexander, Keith, any more thoughts on this?
> 
> I think I misunderstood Greg's subdirectory comment.  We already have
> directories like this:
> 
>   /sys/bus/pci/devices/0000:01:00.0/link/
>   /sys/bus/pci/devices/0000:01:00.0/msi_irqs/
>   /sys/bus/pci/devices/0000:01:00.0/power/
> 
> and aspm_ctrl_attr_group (for "link") is nicely done with static
> attributes.  So I think we could do something like this:
> 
>   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
>     sriov/                             # SR-IOV related stuff
>       vf_total_msix
>       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
>       ...
>       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> 
> And I think this could support the mlx5 model as well as the NVMe
> model.
> 
> For NVMe, a write to vf_msix_count_* would have to auto-offline the VF
> before asking the PF to assign the vectors, as Jason suggests above.
> Before VF Enable is set, the vf_msix_count_* files wouldn't exist and
> we wouldn't be able to assign vectors to VFs; IIUC that's a difference
> from the NVMe interface, but maybe not a terrible one?

Yes, that's fine, nvme can handle this flow. It is a little easier to
avoid nvme user error if we could mainpulate the counts prior to VF Enable,
but it's really not a problem this way either.

I think it's reasonable for nvme to subscribe to this interface, but I
will have to defer to someone with capable nvme devices to implement it.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-30 15:00                                       ` Bjorn Helgaas
@ 2021-03-30 19:47                                         ` Jason Gunthorpe
  2021-03-30 20:41                                           ` Bjorn Helgaas
  0 siblings, 1 reply; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-30 19:47 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Keith Busch, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote:
> On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> > 
> > > I think I misunderstood Greg's subdirectory comment.  We already have
> > > directories like this:
> > 
> > Yes, IIRC, Greg's remark applies if you have to start creating
> > directories with manual kobjects.
> > 
> > > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > > attributes.  So I think we could do something like this:
> > > 
> > >   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
> > >     sriov/                             # SR-IOV related stuff
> > >       vf_total_msix
> > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
> > >       ...
> > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> > 
> > It looks a bit odd that it isn't a subdirectory, but this seems
> > reasonable.
> 
> Sorry, I missed your point; you'll have to lay it out more explicitly.
> I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0
> directory.  The full paths would be:
>
>   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix
>   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F
>   ...

Sorry, I was meaning what you first proposed:

   /sys/bus/pci/devices/0000:01:00.0/sriov/BB:DD.F/vf_msix_count

Which has the extra sub directory to organize the child VFs.

Keep in mind there is going to be alot of VFs here, > 1k - so this
will be a huge directory.

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-30 19:47                                         ` Jason Gunthorpe
@ 2021-03-30 20:41                                           ` Bjorn Helgaas
  2021-03-30 22:43                                             ` Jason Gunthorpe
  2021-03-31  4:08                                             ` Leon Romanovsky
  0 siblings, 2 replies; 74+ messages in thread
From: Bjorn Helgaas @ 2021-03-30 20:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexander Duyck, Keith Busch, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Tue, Mar 30, 2021 at 04:47:16PM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote:
> > On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> > > 
> > > > I think I misunderstood Greg's subdirectory comment.  We already have
> > > > directories like this:
> > > 
> > > Yes, IIRC, Greg's remark applies if you have to start creating
> > > directories with manual kobjects.
> > > 
> > > > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > > > attributes.  So I think we could do something like this:
> > > > 
> > > >   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
> > > >     sriov/                             # SR-IOV related stuff
> > > >       vf_total_msix
> > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
> > > >       ...
> > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> > > 
> > > It looks a bit odd that it isn't a subdirectory, but this seems
> > > reasonable.
> > 
> > Sorry, I missed your point; you'll have to lay it out more explicitly.
> > I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0
> > directory.  The full paths would be:
> >
> >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix
> >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F
> >   ...
> 
> Sorry, I was meaning what you first proposed:
> 
>    /sys/bus/pci/devices/0000:01:00.0/sriov/BB:DD.F/vf_msix_count
> 
> Which has the extra sub directory to organize the child VFs.
> 
> Keep in mind there is going to be alot of VFs here, > 1k - so this
> will be a huge directory.

With 0000:01:00.0/sriov/vf_msix_count_BB:DD.F, sriov/ will contain
1 + 1K files ("vf_total_msix" + 1 per VF).

With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
1 file and 1K subdirectories.

No real difference now, but if we add more files per VF, a BB:DD.F/
subdirectory would certainly be nicer.

I'm dense and don't fully understand Greg's subdirectory comment.

The VF will have its own "pci/devices/DDDD:BB:DD.F/" directory, so
adding sriov/BB:DD.F/ under the PF shouldn't affect any udev events or
rules for the VF.

I see "ATTR{power/control}" in lots of udev rules, so I guess udev
could manage a single subdirectory like "ATTR{sriov/vf_total_msix}".
I doubt it could do "ATTR{sriov/adm/vf_total_msix}" (with another
level) or "ATTR{sriov/BBB:DD.F/vf_msix_count}" (with variable VF text
in the path).

But it doesn't seem like that level of control would be in a udev rule
anyway.  A PF udev rule might *start* a program to manage MSI-X
vectors, but such a program should be able to deal with whatever
directory structure we want.

If my uninformed udev speculation makes sense *and* we think there
will be more per-VF files later, I think I'm OK either way.

Bjorn

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-30 20:41                                           ` Bjorn Helgaas
@ 2021-03-30 22:43                                             ` Jason Gunthorpe
  2021-03-31  6:38                                               ` Greg Kroah-Hartman
  2021-03-31  4:08                                             ` Leon Romanovsky
  1 sibling, 1 reply; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-30 22:43 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Keith Busch, Leon Romanovsky, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Tue, Mar 30, 2021 at 03:41:41PM -0500, Bjorn Helgaas wrote:
> On Tue, Mar 30, 2021 at 04:47:16PM -0300, Jason Gunthorpe wrote:
> > On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote:
> > > On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> > > > 
> > > > > I think I misunderstood Greg's subdirectory comment.  We already have
> > > > > directories like this:
> > > > 
> > > > Yes, IIRC, Greg's remark applies if you have to start creating
> > > > directories with manual kobjects.
> > > > 
> > > > > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > > > > attributes.  So I think we could do something like this:
> > > > > 
> > > > >   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
> > > > >     sriov/                             # SR-IOV related stuff
> > > > >       vf_total_msix
> > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
> > > > >       ...
> > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> > > > 
> > > > It looks a bit odd that it isn't a subdirectory, but this seems
> > > > reasonable.
> > > 
> > > Sorry, I missed your point; you'll have to lay it out more explicitly.
> > > I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0
> > > directory.  The full paths would be:
> > >
> > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix
> > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F
> > >   ...
> > 
> > Sorry, I was meaning what you first proposed:
> > 
> >    /sys/bus/pci/devices/0000:01:00.0/sriov/BB:DD.F/vf_msix_count
> > 
> > Which has the extra sub directory to organize the child VFs.
> > 
> > Keep in mind there is going to be alot of VFs here, > 1k - so this
> > will be a huge directory.
> 
> With 0000:01:00.0/sriov/vf_msix_count_BB:DD.F, sriov/ will contain
> 1 + 1K files ("vf_total_msix" + 1 per VF).
> 
> With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
> 1 file and 1K subdirectories.

The smallest directory sizes is with the current patch since it
re-uses the existing VF directory. Do we care about directory size at
the sysfs level?
 
> No real difference now, but if we add more files per VF, a BB:DD.F/
> subdirectory would certainly be nicer.

Hard to know if that will happen, there is a lot of 'pre-driver'
configuration but it seems to mostly be living in other places. 

If this is restricted to be only the generic PCI attributes (and I
think it should be) I'm having a hard time coming up with a future
extension.

> I'm dense and don't fully understand Greg's subdirectory comment.

I also don't know udev well enough. I've certainly seen drivers
creating extra subdirectories using kobjects.

> But it doesn't seem like that level of control would be in a udev rule
> anyway.  A PF udev rule might *start* a program to manage MSI-X
> vectors, but such a program should be able to deal with whatever
> directory structure we want.

Yes, I can't really see this being used from udev either. 

I assume there is also the usual race about triggering the uevent
before the subdirectories are created, but we have the
dev_set_uevent_suppress() thing now for that..

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-30 20:41                                           ` Bjorn Helgaas
  2021-03-30 22:43                                             ` Jason Gunthorpe
@ 2021-03-31  4:08                                             ` Leon Romanovsky
  2021-04-01  1:23                                               ` Bjorn Helgaas
  1 sibling, 1 reply; 74+ messages in thread
From: Leon Romanovsky @ 2021-03-31  4:08 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jason Gunthorpe, Alexander Duyck, Keith Busch, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman

On Tue, Mar 30, 2021 at 03:41:41PM -0500, Bjorn Helgaas wrote:
> On Tue, Mar 30, 2021 at 04:47:16PM -0300, Jason Gunthorpe wrote:
> > On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote:
> > > On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> > > > 
> > > > > I think I misunderstood Greg's subdirectory comment.  We already have
> > > > > directories like this:
> > > > 
> > > > Yes, IIRC, Greg's remark applies if you have to start creating
> > > > directories with manual kobjects.
> > > > 
> > > > > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > > > > attributes.  So I think we could do something like this:
> > > > > 
> > > > >   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
> > > > >     sriov/                             # SR-IOV related stuff
> > > > >       vf_total_msix
> > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
> > > > >       ...
> > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> > > > 
> > > > It looks a bit odd that it isn't a subdirectory, but this seems
> > > > reasonable.
> > > 
> > > Sorry, I missed your point; you'll have to lay it out more explicitly.
> > > I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0
> > > directory.  The full paths would be:
> > >
> > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix
> > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F
> > >   ...
> > 
> > Sorry, I was meaning what you first proposed:
> > 
> >    /sys/bus/pci/devices/0000:01:00.0/sriov/BB:DD.F/vf_msix_count
> > 
> > Which has the extra sub directory to organize the child VFs.
> > 
> > Keep in mind there is going to be alot of VFs here, > 1k - so this
> > will be a huge directory.
> 
> With 0000:01:00.0/sriov/vf_msix_count_BB:DD.F, sriov/ will contain
> 1 + 1K files ("vf_total_msix" + 1 per VF).
> 
> With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
> 1 file and 1K subdirectories.

This is racy by design, in order to add new file and create BB:DD.F
directory, the VF will need to do it after or during it's creation.
During PF creation it is unknown to PF those BB:DD.F values.

The race here is due to the events of PF,VF directory already sent but
new directory structure is not ready yet.

From code perspective, we will need to add something similar to pci_iov_sysfs_link()
with the code that you didn't like in previous variants (the one that messes with
sysfs_create_file API).

It looks not good for large SR-IOV systems with >1K VFs with gazillion
subdirectories inside PF, while the more natural is to see them in VF.

So I'm completely puzzled why you want to do these files on PF and not
on VF as v0, v7 and v8 proposed.

Thanks

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-30 22:43                                             ` Jason Gunthorpe
@ 2021-03-31  6:38                                               ` Greg Kroah-Hartman
  2021-03-31 12:19                                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 74+ messages in thread
From: Greg Kroah-Hartman @ 2021-03-31  6:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, Alexander Duyck, Keith Busch, Leon Romanovsky,
	Bjorn Helgaas, Saeed Mahameed, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller

On Tue, Mar 30, 2021 at 07:43:41PM -0300, Jason Gunthorpe wrote:
> > With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
> > 1 file and 1K subdirectories.
> 
> The smallest directory sizes is with the current patch since it
> re-uses the existing VF directory. Do we care about directory size at
> the sysfs level?

No, that should not matter.

The "issue" here is that you "broke" the device chain here by adding a
random kobject to the directory tree: "BB:DD.F"

Again, devices are allowed to have attributes associated with it to be
_ONE_ subdirectory level deep.

So, to use your path above, this is allowed:
	0000:01:00.0/sriov/vf_msix_count

as these are sriov attributes for the 0000:01:00.0 device, but this is
not:
	0000:01:00.0/sriov/BB:DD.F/vf_msix_count
as you "threw" a random kobject called BB:DD.F into the middle.

If you want to have "BB:DD.F" in there, then it needs to be a real
struct device and _THEN_ it needs to point its parent to "0000:01:00.0",
another struct device, as "sriov" is NOT ANYTHING in the heirachy here
at all.

Does that help?  The rules are:
	- Once you use a 'struct device', all subdirs below that device
	  are either an attribute group for that device or a child
	  device.
	- A struct device can NOT have an attribute group as a parent,
	  it can ONLY have another struct device as a parent.

If you break those rules, the kernel has the ability to get really
confused unless you are very careful, and userspace will be totally lost
as you can not do anything special there.

> > I'm dense and don't fully understand Greg's subdirectory comment.
> 
> I also don't know udev well enough. I've certainly seen drivers
> creating extra subdirectories using kobjects.

And those drivers are broken.  Please point them out to me and I will be
glad to go fix them.  Or tell their authors why they are broken :)

> > But it doesn't seem like that level of control would be in a udev rule
> > anyway.  A PF udev rule might *start* a program to manage MSI-X
> > vectors, but such a program should be able to deal with whatever
> > directory structure we want.
> 
> Yes, I can't really see this being used from udev either. 

It doesn't matter if you think it could be used, it _will_ be used as
you are exposing this stuff to userspace.

> I assume there is also the usual race about triggering the uevent
> before the subdirectories are created, but we have the
> dev_set_uevent_suppress() thing now for that..

Unless you are "pci bus code" you shouldn't be using that :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-31  6:38                                               ` Greg Kroah-Hartman
@ 2021-03-31 12:19                                                 ` Jason Gunthorpe
  2021-03-31 15:03                                                   ` Greg Kroah-Hartman
  0 siblings, 1 reply; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-31 12:19 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Bjorn Helgaas, Alexander Duyck, Keith Busch, Leon Romanovsky,
	Bjorn Helgaas, Saeed Mahameed, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller

On Wed, Mar 31, 2021 at 08:38:39AM +0200, Greg Kroah-Hartman wrote:
> On Tue, Mar 30, 2021 at 07:43:41PM -0300, Jason Gunthorpe wrote:
> > > With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
> > > 1 file and 1K subdirectories.
> > 
> > The smallest directory sizes is with the current patch since it
> > re-uses the existing VF directory. Do we care about directory size at
> > the sysfs level?
> 
> No, that should not matter.
> 
> The "issue" here is that you "broke" the device chain here by adding a
> random kobject to the directory tree: "BB:DD.F"
> 
> Again, devices are allowed to have attributes associated with it to be
> _ONE_ subdirectory level deep.
> 
> So, to use your path above, this is allowed:
> 	0000:01:00.0/sriov/vf_msix_count
> 
> as these are sriov attributes for the 0000:01:00.0 device, but this is
> not:
> 	0000:01:00.0/sriov/BB:DD.F/vf_msix_count
> as you "threw" a random kobject called BB:DD.F into the middle.
>
> If you want to have "BB:DD.F" in there, then it needs to be a real
> struct device and _THEN_ it needs to point its parent to "0000:01:00.0",
> another struct device, as "sriov" is NOT ANYTHING in the heirachy here
> at all.

It isn't a struct device object at all though, it just organizing
attributes.

> Does that help?  The rules are:
> 	- Once you use a 'struct device', all subdirs below that device
> 	  are either an attribute group for that device or a child
> 	  device.
> 	- A struct device can NOT have an attribute group as a parent,
> 	  it can ONLY have another struct device as a parent.
> 
> If you break those rules, the kernel has the ability to get really
> confused unless you are very careful, and userspace will be totally lost
> as you can not do anything special there.

The kernel gets confused?

I'm not sure I understand why userspace gets confused. I can guess
udev has some issue, but everything else seems OK, it is just a path.

> > > I'm dense and don't fully understand Greg's subdirectory comment.
> > 
> > I also don't know udev well enough. I've certainly seen drivers
> > creating extra subdirectories using kobjects.
> 
> And those drivers are broken.  Please point them out to me and I will be
> glad to go fix them.  Or tell their authors why they are broken :)

How do you fix them? It is uAPI at this point so we can't change the
directory names. Can't make them struct devices (userspace would get
confused if we add *more* sysfs files)

Grep for kobject_init_and_add() under drivers/ and I think you get a
pretty good overview of the places.

Since it seems like kind of a big problem can we make this allowed
somehow?

> > > But it doesn't seem like that level of control would be in a udev rule
> > > anyway.  A PF udev rule might *start* a program to manage MSI-X
> > > vectors, but such a program should be able to deal with whatever
> > > directory structure we want.
> >
> > Yes, I can't really see this being used from udev either. 
> 
> It doesn't matter if you think it could be used, it _will_ be used as
> you are exposing this stuff to userspace.

Well, from what I understand, it wont be used because udev can't do
three level deep attributes, and if that hasn't been a problem in that
last 10 years for the existing places, it might not ever be needed in
udev at all.

> > I assume there is also the usual race about triggering the uevent
> > before the subdirectories are created, but we have the
> > dev_set_uevent_suppress() thing now for that..
> 
> Unless you are "pci bus code" you shouldn't be using that :)

There are over 40 users now.

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-31 12:19                                                 ` Jason Gunthorpe
@ 2021-03-31 15:03                                                   ` Greg Kroah-Hartman
  2021-03-31 17:07                                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 74+ messages in thread
From: Greg Kroah-Hartman @ 2021-03-31 15:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, Alexander Duyck, Keith Busch, Leon Romanovsky,
	Bjorn Helgaas, Saeed Mahameed, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller

On Wed, Mar 31, 2021 at 09:19:29AM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 31, 2021 at 08:38:39AM +0200, Greg Kroah-Hartman wrote:
> > On Tue, Mar 30, 2021 at 07:43:41PM -0300, Jason Gunthorpe wrote:
> > > > With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
> > > > 1 file and 1K subdirectories.
> > > 
> > > The smallest directory sizes is with the current patch since it
> > > re-uses the existing VF directory. Do we care about directory size at
> > > the sysfs level?
> > 
> > No, that should not matter.
> > 
> > The "issue" here is that you "broke" the device chain here by adding a
> > random kobject to the directory tree: "BB:DD.F"
> > 
> > Again, devices are allowed to have attributes associated with it to be
> > _ONE_ subdirectory level deep.
> > 
> > So, to use your path above, this is allowed:
> > 	0000:01:00.0/sriov/vf_msix_count
> > 
> > as these are sriov attributes for the 0000:01:00.0 device, but this is
> > not:
> > 	0000:01:00.0/sriov/BB:DD.F/vf_msix_count
> > as you "threw" a random kobject called BB:DD.F into the middle.
> >
> > If you want to have "BB:DD.F" in there, then it needs to be a real
> > struct device and _THEN_ it needs to point its parent to "0000:01:00.0",
> > another struct device, as "sriov" is NOT ANYTHING in the heirachy here
> > at all.
> 
> It isn't a struct device object at all though, it just organizing
> attributes.

That's the point, it really is not.  You are forced to create a real
object for that subdirectory, and by doing so, you are "breaking" the
driver/device model.  As is evident by userspace not knowing what is
going on here.

> > Does that help?  The rules are:
> > 	- Once you use a 'struct device', all subdirs below that device
> > 	  are either an attribute group for that device or a child
> > 	  device.
> > 	- A struct device can NOT have an attribute group as a parent,
> > 	  it can ONLY have another struct device as a parent.
> > 
> > If you break those rules, the kernel has the ability to get really
> > confused unless you are very careful, and userspace will be totally lost
> > as you can not do anything special there.
> 
> The kernel gets confused?

Putting a kobject as a child of a struct device can easily cause
confusion as that is NOT what you should be doing.  Especially if you
then try to add a device to be a child of that kobject.  Now the kernel
core can not walk all devices in the correct order and lots of other
problems that you are lucky you do not hit.

> I'm not sure I understand why userspace gets confused. I can guess
> udev has some issue, but everything else seems OK, it is just a path.

No, it is not a "path".

Again, here are the driver/device model rules:
	- once you have a "struct device", only "struct device" can be
	  children.
	- You are allowed to place attributes in subdirectories if you
	  want to make things cleaner.  Don't make me doubt giving that
	  type of permission to people by trying to abuse it by going
	  "lower" than one level.
	- If you have to represent something dynamic below a struct
	  device that is not an attribute, make it a real struct device.

And userspace "gets confused" because it thinks it can properly walk the
tree and get a record of all devices in the system.  When you put a
kobject in there just to make a new subdirectory, you break the
notification model and everything else.

Again, do not do that.

> > > > I'm dense and don't fully understand Greg's subdirectory comment.
> > > 
> > > I also don't know udev well enough. I've certainly seen drivers
> > > creating extra subdirectories using kobjects.
> > 
> > And those drivers are broken.  Please point them out to me and I will be
> > glad to go fix them.  Or tell their authors why they are broken :)
> 
> How do you fix them? It is uAPI at this point so we can't change the
> directory names. Can't make them struct devices (userspace would get
> confused if we add *more* sysfs files)

How would userspace get confused?  If anything it would suddenly "wake
up" and see these attributes properly.

> Grep for kobject_init_and_add() under drivers/ and I think you get a
> pretty good overview of the places.

Yes, lots of places where people are abusing things and it should not be
done.  Do not add to the mess by knowingly adding broken code please.

> Since it seems like kind of a big problem can we make this allowed
> somehow?

No, not at all.  Please do not do that.  I will look into the existing
users and try to see if I can fix them up.  Maybe start annoying people
by throwing warnings if you try to register a kobject as a child of a
device...

> > > > But it doesn't seem like that level of control would be in a udev rule
> > > > anyway.  A PF udev rule might *start* a program to manage MSI-X
> > > > vectors, but such a program should be able to deal with whatever
> > > > directory structure we want.
> > >
> > > Yes, I can't really see this being used from udev either. 
> > 
> > It doesn't matter if you think it could be used, it _will_ be used as
> > you are exposing this stuff to userspace.
> 
> Well, from what I understand, it wont be used because udev can't do
> three level deep attributes, and if that hasn't been a problem in that
> last 10 years for the existing places, it might not ever be needed in
> udev at all.

If userspace is not seeing these attributes then WHY CREATE THEM AT
ALL???

Seriously, what is needing to see these in sysfs if not the tools that
we have today to use sysfs?  Are you wanting to create new tools instead
to handle these new attributes?  Maybe just do not create them in the
first place?

> > > I assume there is also the usual race about triggering the uevent
> > > before the subdirectories are created, but we have the
> > > dev_set_uevent_suppress() thing now for that..
> > 
> > Unless you are "pci bus code" you shouldn't be using that :)
> 
> There are over 40 users now.

Let's not add more if you do not have to.  PCI has not needed it so far,
no need to rush to add it here if at all possible please.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-31 15:03                                                   ` Greg Kroah-Hartman
@ 2021-03-31 17:07                                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 74+ messages in thread
From: Jason Gunthorpe @ 2021-03-31 17:07 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Bjorn Helgaas, Alexander Duyck, Keith Busch, Leon Romanovsky,
	Bjorn Helgaas, Saeed Mahameed, Jakub Kicinski, linux-pci,
	linux-rdma, Netdev, Don Dutile, Alex Williamson,
	David S . Miller

On Wed, Mar 31, 2021 at 05:03:45PM +0200, Greg Kroah-Hartman wrote:
> > It isn't a struct device object at all though, it just organizing
> > attributes.
> 
> That's the point, it really is not.  You are forced to create a real
> object for that subdirectory, and by doing so, you are "breaking" the
> driver/device model.  As is evident by userspace not knowing what is
> going on here.

I'm still not really sure about what this means in practice..

I found an nested attribute in RDMA land so lets see how it behaves.
 
   /sys/class/infiniband/ibp0s9/ <-- This is a struct device/ib_device

Then we have 261 'attribute' files under a ports subdirectory, for
instance:

 /sys/class/infiniband/ibp0s9/ports/1/cm_tx_retries/dreq

Open/read works fine, and the specialty userspace that people built on
this has been working for a long time.

Does udev see the deeply nested attributes? Apparently yes:

$ udevadm info -a /sys/class/infiniband/ibp0s9
    ATTR{ports/1/cm_rx_duplicates/dreq}=="0"
    [..]

Given your remarks, I'm surprised, but it seems to work - I assume if
udevadm shows it then all the rules will work too.

Has udev become confused about what is a struct device? Looks like no:

$ udevadm info -a /sys/class/infiniband/ibp0s9/port
Unknown device "/sys/class/infiniband/ibp0s9/port": No such device

Can you give an example where things go wrong?

(and I inherited this RDMA stuff. In the last two years we moved it
 all to netlink and modern userspace largely no longer touches sysfs,
 but I can't break in-use uAPI)

> > > Does that help?  The rules are:
> > > 	- Once you use a 'struct device', all subdirs below that device
> > > 	  are either an attribute group for that device or a child
> > > 	  device.
> > > 	- A struct device can NOT have an attribute group as a parent,
> > > 	  it can ONLY have another struct device as a parent.
> > > 
> > > If you break those rules, the kernel has the ability to get really
> > > confused unless you are very careful, and userspace will be totally lost
> > > as you can not do anything special there.
> > 
> > The kernel gets confused?
> 
> Putting a kobject as a child of a struct device can easily cause
> confusion as that is NOT what you should be doing.  Especially if you
> then try to add a device to be a child of that kobject. 

That I've never seen. I've only seen people making extra levels of
directories for organizing attributes.

> > How do you fix them? It is uAPI at this point so we can't change the
> > directory names. Can't make them struct devices (userspace would get
> > confused if we add *more* sysfs files)
> 
> How would userspace get confused?  If anything it would suddenly "wake
> up" and see these attributes properly.

We are talking about specialty userspace that is designed to work with
the sysfs layout as-is. Not udev. In some of these subdirs the
userspace does readdir() on - if you start adding random stuff it will
break it.

> > Since it seems like kind of a big problem can we make this allowed
> > somehow?
> 
> No, not at all.  Please do not do that.  I will look into the existing
> users and try to see if I can fix them up.  Maybe start annoying people
> by throwing warnings if you try to register a kobject as a child of a
> device...

How does that mesh with our don't break userspace ideal?? :(

> > Well, from what I understand, it wont be used because udev can't do
> > three level deep attributes, and if that hasn't been a problem in that
> > last 10 years for the existing places, it might not ever be needed in
> > udev at all.
> 
> If userspace is not seeing these attributes then WHY CREATE THEM AT
> ALL???

*udev* is not the userspace! People expose sysfs attributes and then
make specialty userspace to consume them! I've seen it many times now.

> Seriously, what is needing to see these in sysfs if not the tools that
> we have today to use sysfs?  Are you wanting to create new tools instead
> to handle these new attributes?  Maybe just do not create them in the
> first place?

This advice is about 10 years too late :(

Regardless, lets not do deeply nested attributes here in PCI. They are
PITA anyhow.

Jason

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-03-31  4:08                                             ` Leon Romanovsky
@ 2021-04-01  1:23                                               ` Bjorn Helgaas
  2021-04-01 11:49                                                 ` Leon Romanovsky
  0 siblings, 1 reply; 74+ messages in thread
From: Bjorn Helgaas @ 2021-04-01  1:23 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Alexander Duyck, Keith Busch, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman, Rafael J. Wysocki

[+cc Rafael, in case you're interested in the driver core issue here]

On Wed, Mar 31, 2021 at 07:08:07AM +0300, Leon Romanovsky wrote:
> On Tue, Mar 30, 2021 at 03:41:41PM -0500, Bjorn Helgaas wrote:
> > On Tue, Mar 30, 2021 at 04:47:16PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote:
> > > > On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> > > > > 
> > > > > > I think I misunderstood Greg's subdirectory comment.  We already have
> > > > > > directories like this:
> > > > > 
> > > > > Yes, IIRC, Greg's remark applies if you have to start creating
> > > > > directories with manual kobjects.
> > > > > 
> > > > > > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > > > > > attributes.  So I think we could do something like this:
> > > > > > 
> > > > > >   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
> > > > > >     sriov/                             # SR-IOV related stuff
> > > > > >       vf_total_msix
> > > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
> > > > > >       ...
> > > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> > > > > 
> > > > > It looks a bit odd that it isn't a subdirectory, but this seems
> > > > > reasonable.
> > > > 
> > > > Sorry, I missed your point; you'll have to lay it out more explicitly.
> > > > I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0
> > > > directory.  The full paths would be:
> > > >
> > > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix
> > > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F
> > > >   ...
> > > 
> > > Sorry, I was meaning what you first proposed:
> > > 
> > >    /sys/bus/pci/devices/0000:01:00.0/sriov/BB:DD.F/vf_msix_count
> > > 
> > > Which has the extra sub directory to organize the child VFs.
> > > 
> > > Keep in mind there is going to be alot of VFs here, > 1k - so this
> > > will be a huge directory.
> > 
> > With 0000:01:00.0/sriov/vf_msix_count_BB:DD.F, sriov/ will contain
> > 1 + 1K files ("vf_total_msix" + 1 per VF).
> > 
> > With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
> > 1 file and 1K subdirectories.
> 
> This is racy by design, in order to add new file and create BB:DD.F
> directory, the VF will need to do it after or during it's creation.
> During PF creation it is unknown to PF those BB:DD.F values.
> 
> The race here is due to the events of PF,VF directory already sent
> but new directory structure is not ready yet.
>
> From code perspective, we will need to add something similar to
> pci_iov_sysfs_link() with the code that you didn't like in previous
> variants (the one that messes with sysfs_create_file API).
> 
> It looks not good for large SR-IOV systems with >1K VFs with
> gazillion subdirectories inside PF, while the more natural is to see
> them in VF.
> 
> So I'm completely puzzled why you want to do these files on PF and
> not on VF as v0, v7 and v8 proposed.

On both mlx5 and NVMe, the "assign vectors to VF" functionality is
implemented by the PF, so I think it's reasonable to explore the idea
of "associate the vector assignment sysfs file with the PF."

Assume 1K VFs.  Either way we have >1K subdirectories of
/sys/devices/pci0000:00/.  I think we should avoid an extra
subdirectory level, so I think the choices on the table are:

Associate "vf_msix_count" with the PF:

  - /sys/.../<PF>/sriov/vf_total_msix    # all on PF

  - /sys/.../<PF>/sriov/vf_msix_count_BB:DD.F (1K of these).  Greg
    says the number of these is not a problem.

  - The "vf_total_msix" and "vf_msix_count_*" files are all related
    and are grouped together in PF/sriov/.

  - The "vf_msix_count_*" files operate directly on the PF.  Lock the
    PF for serialization, lookup and lock the VF to ensure no VF
    driver, call PF driver callback to assign vectors.

  - Requires special sysfs code to create/remove "vf_msix_count_*"
    files when setting/clearing VF Enable.  This code could create
    them only when the PF driver actually supports vector assignment.
    Unavoidable sysfs/uevent race, see below.

Associate "vf_msix_count" with the VF:

  - /sys/.../<PF>/sriov_vf_total_msix    # on PF

  - /sys/.../<VF>/sriov_vf_msix_count    # on each VF

  - The "sriov_vf_msix_count" files enter via the VF.  Lock the VF to
    ensure no VF driver, lookup and lock the PF for serialization,
    call PF driver callback to assign vectors.

  - Can be done with static sysfs attributes.  This means creating
    "sriov_vf_msix_count" *always*, even if PF driver doesn't support
    vector assignment.

IIUC, putting "vf_msix_count_*" under the PF involves a race.  When we
call device_add() for each new VF, it creates the VF sysfs directory
and emits the KOBJ_ADD uevent, but the "vf_msix_count_*" file doesn't
exist yet.  It can't be created before device_add() because the sysfs
directory doesn't exist.  If we create it after device_add(), the "add
VF" uevent has already been emitted, so userspace may consume it
before "vf_msix_count_*" is created.

  sriov_enable
    <set VF Enable>                     <-- VFs created on PCI
    sriov_add_vfs
      for (i = 0; i < num_vfs; i++) {
        pci_iov_add_virtfn
          pci_device_add
            device_initialize
            device_add
              device_add_attrs          <-- add VF sysfs attrs
              kobject_uevent(KOBJ_ADD)  <-- emit uevent
                                        <-- add "vf_msix_count_*" sysfs attr
          pci_iov_sysfs_link
          pci_bus_add_device
            pci_create_sysfs_dev_files
            device_attach
      }

Conceptually, I like having the "vf_total_msix" and "vf_msix_count_*"
files associated directly with the PF.  I think that's more natural
because they both operate directly on the PF.

But I don't like the race, and using static attributes seems much
cleaner implementation-wise.

Bjorn

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count
  2021-04-01  1:23                                               ` Bjorn Helgaas
@ 2021-04-01 11:49                                                 ` Leon Romanovsky
  0 siblings, 0 replies; 74+ messages in thread
From: Leon Romanovsky @ 2021-04-01 11:49 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jason Gunthorpe, Alexander Duyck, Keith Busch, Bjorn Helgaas,
	Saeed Mahameed, Jakub Kicinski, linux-pci, linux-rdma, Netdev,
	Don Dutile, Alex Williamson, David S . Miller,
	Greg Kroah-Hartman, Rafael J. Wysocki

On Wed, Mar 31, 2021 at 08:23:40PM -0500, Bjorn Helgaas wrote:
> [+cc Rafael, in case you're interested in the driver core issue here]
> 
> On Wed, Mar 31, 2021 at 07:08:07AM +0300, Leon Romanovsky wrote:
> > On Tue, Mar 30, 2021 at 03:41:41PM -0500, Bjorn Helgaas wrote:
> > > On Tue, Mar 30, 2021 at 04:47:16PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote:
> > > > > On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote:
> > > > > > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote:
> > > > > > 
> > > > > > > I think I misunderstood Greg's subdirectory comment.  We already have
> > > > > > > directories like this:
> > > > > > 
> > > > > > Yes, IIRC, Greg's remark applies if you have to start creating
> > > > > > directories with manual kobjects.
> > > > > > 
> > > > > > > and aspm_ctrl_attr_group (for "link") is nicely done with static
> > > > > > > attributes.  So I think we could do something like this:
> > > > > > > 
> > > > > > >   /sys/bus/pci/devices/0000:01:00.0/   # PF directory
> > > > > > >     sriov/                             # SR-IOV related stuff
> > > > > > >       vf_total_msix
> > > > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of first VF
> > > > > > >       ...
> > > > > > >       vf_msix_count_BB:DD.F        # includes bus/dev/fn of last VF
> > > > > > 
> > > > > > It looks a bit odd that it isn't a subdirectory, but this seems
> > > > > > reasonable.
> > > > > 
> > > > > Sorry, I missed your point; you'll have to lay it out more explicitly.
> > > > > I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0
> > > > > directory.  The full paths would be:
> > > > >
> > > > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix
> > > > >   /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F
> > > > >   ...
> > > > 
> > > > Sorry, I was meaning what you first proposed:
> > > > 
> > > >    /sys/bus/pci/devices/0000:01:00.0/sriov/BB:DD.F/vf_msix_count
> > > > 
> > > > Which has the extra sub directory to organize the child VFs.
> > > > 
> > > > Keep in mind there is going to be alot of VFs here, > 1k - so this
> > > > will be a huge directory.
> > > 
> > > With 0000:01:00.0/sriov/vf_msix_count_BB:DD.F, sriov/ will contain
> > > 1 + 1K files ("vf_total_msix" + 1 per VF).
> > > 
> > > With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain
> > > 1 file and 1K subdirectories.
> > 
> > This is racy by design, in order to add new file and create BB:DD.F
> > directory, the VF will need to do it after or during it's creation.
> > During PF creation it is unknown to PF those BB:DD.F values.
> > 
> > The race here is due to the events of PF,VF directory already sent
> > but new directory structure is not ready yet.
> >
> > From code perspective, we will need to add something similar to
> > pci_iov_sysfs_link() with the code that you didn't like in previous
> > variants (the one that messes with sysfs_create_file API).
> > 
> > It looks not good for large SR-IOV systems with >1K VFs with
> > gazillion subdirectories inside PF, while the more natural is to see
> > them in VF.
> > 
> > So I'm completely puzzled why you want to do these files on PF and
> > not on VF as v0, v7 and v8 proposed.
> 
> On both mlx5 and NVMe, the "assign vectors to VF" functionality is
> implemented by the PF, so I think it's reasonable to explore the idea
> of "associate the vector assignment sysfs file with the PF."

As you said, it is our (Linux kernel) implementation, but from user space
perspective it is seen different. The user "configures" specific VF and
doesn't need to know that we are routing his requests through PF.

> 
> Assume 1K VFs.  Either way we have >1K subdirectories of
> /sys/devices/pci0000:00/.  I think we should avoid an extra
> subdirectory level, so I think the choices on the table are:

Right, we already have 1k subdirectories and PF will have 1k virtfnX
links anyway. So for me it doesn't look appealing to see extra 1K files.

In case someone else will need to add another parameter to this sriov
overlay directory, we will find an extra 1K files and repeat again for
any new parameter.

At the end, the discussion here is to make this sysfs feature to be very
general and 1k files in one folder for every parameter is not nice (IMHO).

In theory, SR-IOV spec allows up-to 64K VFs per-PF.

> 
> Associate "vf_msix_count" with the PF:
> 
>   - /sys/.../<PF>/sriov/vf_total_msix    # all on PF
> 
>   - /sys/.../<PF>/sriov/vf_msix_count_BB:DD.F (1K of these).  Greg
>     says the number of these is not a problem.
> 
>   - The "vf_total_msix" and "vf_msix_count_*" files are all related
>     and are grouped together in PF/sriov/.
> 
>   - The "vf_msix_count_*" files operate directly on the PF.  Lock the
>     PF for serialization, lookup and lock the VF to ensure no VF
>     driver, call PF driver callback to assign vectors.
> 
>   - Requires special sysfs code to create/remove "vf_msix_count_*"
>     files when setting/clearing VF Enable.  This code could create
>     them only when the PF driver actually supports vector assignment.
>     Unavoidable sysfs/uevent race, see below.
> 
> Associate "vf_msix_count" with the VF:
> 
>   - /sys/.../<PF>/sriov_vf_total_msix    # on PF
> 
>   - /sys/.../<VF>/sriov_vf_msix_count    # on each VF
> 
>   - The "sriov_vf_msix_count" files enter via the VF.  Lock the VF to
>     ensure no VF driver, lookup and lock the PF for serialization,
>     call PF driver callback to assign vectors.
> 
>   - Can be done with static sysfs attributes.  This means creating
>     "sriov_vf_msix_count" *always*, even if PF driver doesn't support
>     vector assignment.

The same goes for the next parameter that someone will add.

> 
> IIUC, putting "vf_msix_count_*" under the PF involves a race.  When we
> call device_add() for each new VF, it creates the VF sysfs directory
> and emits the KOBJ_ADD uevent, but the "vf_msix_count_*" file doesn't
> exist yet.  It can't be created before device_add() because the sysfs
> directory doesn't exist.  If we create it after device_add(), the "add
> VF" uevent has already been emitted, so userspace may consume it
> before "vf_msix_count_*" is created.
> 
>   sriov_enable
>     <set VF Enable>                     <-- VFs created on PCI
>     sriov_add_vfs
>       for (i = 0; i < num_vfs; i++) {
>         pci_iov_add_virtfn
>           pci_device_add
>             device_initialize
>             device_add
>               device_add_attrs          <-- add VF sysfs attrs
>               kobject_uevent(KOBJ_ADD)  <-- emit uevent
>                                         <-- add "vf_msix_count_*" sysfs attr
>           pci_iov_sysfs_link
>           pci_bus_add_device
>             pci_create_sysfs_dev_files
>             device_attach
>       }
> 
> Conceptually, I like having the "vf_total_msix" and "vf_msix_count_*"
> files associated directly with the PF.  I think that's more natural
> because they both operate directly on the PF.

And this is where we disagree :)

> 
> But I don't like the race, and using static attributes seems much
> cleaner implementation-wise.

Right, so what should be my next steps in order to do not miss this merge
window?

Thanks

> 
> Bjorn

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2021-04-01 17:42 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-01  7:55 [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
2021-03-01  7:55 ` [PATCH mlx5-next v7 1/4] PCI: Add a sysfs file to change the MSI-X table size of SR-IOV VFs Leon Romanovsky
2021-03-01  8:14   ` Greg Kroah-Hartman
2021-03-01  8:32     ` Leon Romanovsky
2021-03-01  8:37       ` Greg Kroah-Hartman
2021-03-01  8:53         ` Leon Romanovsky
2021-03-01  7:55 ` [PATCH mlx5-next v7 2/4] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
2021-03-01  7:55 ` [PATCH mlx5-next v7 3/4] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
2021-03-01  7:55 ` [PATCH mlx5-next v7 4/4] net/mlx5: Implement sriov_get_vf_total_msix/count() callbacks Leon Romanovsky
2021-03-07  8:11 ` [PATCH mlx5-next v7 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
2021-03-07 18:55 ` Alexander Duyck
2021-03-07 19:19   ` Leon Romanovsky
2021-03-08 16:33     ` Alexander Duyck
2021-03-08 19:20       ` Leon Romanovsky
2021-03-10 19:09   ` Bjorn Helgaas
2021-03-10 20:10     ` Leon Romanovsky
2021-03-10 20:21       ` Greg Kroah-Hartman
2021-03-11  8:37         ` Leon Romanovsky
2021-03-10 23:34     ` Alexander Duyck
2021-03-11 18:17       ` Bjorn Helgaas
2021-03-11 19:16         ` Keith Busch
2021-03-11 19:21           ` Leon Romanovsky
2021-03-11 20:22           ` Jason Gunthorpe
2021-03-11 20:50             ` Keith Busch
2021-03-11 21:44               ` Jason Gunthorpe
2021-03-25 17:21                 ` Bjorn Helgaas
2021-03-25 17:36                   ` Jason Gunthorpe
2021-03-25 18:20                     ` Bjorn Helgaas
2021-03-25 18:28                       ` Jason Gunthorpe
2021-03-26  6:44                         ` Leon Romanovsky
2021-03-26 16:00                           ` Alexander Duyck
2021-03-26 16:56                             ` Jason Gunthorpe
2021-03-26 17:08                             ` Bjorn Helgaas
2021-03-26 17:12                               ` Jason Gunthorpe
2021-03-27  6:00                                 ` Leon Romanovsky
2021-03-26 17:29                               ` Keith Busch
2021-03-26 17:31                                 ` Jason Gunthorpe
2021-03-26 18:50                               ` Alexander Duyck
2021-03-26 19:01                                 ` Jason Gunthorpe
2021-03-30  1:29                                   ` Bjorn Helgaas
2021-03-30 13:57                                     ` Jason Gunthorpe
2021-03-30 15:00                                       ` Bjorn Helgaas
2021-03-30 19:47                                         ` Jason Gunthorpe
2021-03-30 20:41                                           ` Bjorn Helgaas
2021-03-30 22:43                                             ` Jason Gunthorpe
2021-03-31  6:38                                               ` Greg Kroah-Hartman
2021-03-31 12:19                                                 ` Jason Gunthorpe
2021-03-31 15:03                                                   ` Greg Kroah-Hartman
2021-03-31 17:07                                                     ` Jason Gunthorpe
2021-03-31  4:08                                             ` Leon Romanovsky
2021-04-01  1:23                                               ` Bjorn Helgaas
2021-04-01 11:49                                                 ` Leon Romanovsky
2021-03-30 18:10                                     ` Keith Busch
2021-03-26 19:36                                 ` Bjorn Helgaas
2021-03-27 12:38                                   ` Greg Kroah-Hartman
2021-03-25 18:31                     ` Keith Busch
2021-03-25 18:36                       ` Jason Gunthorpe
2021-03-11 19:17         ` Leon Romanovsky
2021-03-11 19:37         ` Alexander Duyck
2021-03-11 19:51           ` Leon Romanovsky
2021-03-11 20:11             ` Alexander Duyck
2021-03-11 20:19           ` Jason Gunthorpe
2021-03-11 21:49             ` Alexander Duyck
2021-03-11 23:20               ` Jason Gunthorpe
2021-03-12  2:53                 ` Alexander Duyck
2021-03-12  6:32                   ` Leon Romanovsky
2021-03-12 16:59                     ` Alexander Duyck
2021-03-12 17:03                       ` Jason Gunthorpe
2021-03-12 18:34                         ` Leon Romanovsky
2021-03-12 18:41                       ` Leon Romanovsky
2021-03-12 13:00                   ` Jason Gunthorpe
2021-03-12 13:36                     ` Keith Busch
2021-03-11 20:31         ` Jason Gunthorpe
2021-03-10  5:58 ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).