All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH mlx5-next 0/4] Dynamically assign MSI-X vectors count
@ 2021-01-03  8:24 Leon Romanovsky
  2021-01-03  8:24 ` [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs Leon Romanovsky
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Leon Romanovsky @ 2021-01-03  8:24 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Jakub Kicinski, linux-pci,
	linux-rdma, netdev

From: Leon Romanovsky <leonro@nvidia.com>

Hi,

The number of MSI-X vectors is PCI property visible through lspci, that
field is read-only and configured by the device.

The static assignment of an amount of MSI-X vectors doesn't allow utilize
the newly created VF because it is not known to the device the future load
and configuration where that VF will be used.

The VFs are created on the hypervisor and forwarded to the VMs that have
different properties (for example number of CPUs).

To overcome the inefficiency in the spread of such MSI-X vectors, we
allow the kernel to instruct the device with the needed number of such
vectors, before VF is initialized and bounded to the driver.

Before this series:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
	Capabilities: [9c] MSI-X: Enable- Count=12 Masked-

Configuration script:
1. Start fresh
echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
modprobe -q -r mlx5_ib mlx5_core
2. Ensure that driver doesn't run and it is safe to change MSI-X
echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_drivers_autoprobe
3. Load driver for the PF
modprobe mlx5_core
4. Configure one of the VFs with new number
echo 2 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
echo 21 > /sys/bus/pci/devices/0000\:08\:00.2/vf_msix_vec

After this series:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
	Capabilities: [9c] MSI-X: Enable- Count=21 Masked-


Thanks

Leon Romanovsky (4):
  PCI: Configure number of MSI-X vectors for SR-IOV VFs
  net/mlx5: Add dynamic MSI-X capabilities bits
  net/mlx5: Dynamically assign MSI-X vectors count
  net/mlx5: Allow to the users to configure number of MSI-X vectors

 Documentation/ABI/testing/sysfs-bus-pci       | 16 +++++
 .../net/ethernet/mellanox/mlx5/core/main.c    |  5 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  6 ++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 62 +++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 49 ++++++++++++++-
 drivers/pci/iov.c                             | 57 +++++++++++++++++
 drivers/pci/msi.c                             | 30 +++++++++
 drivers/pci/pci-sysfs.c                       |  1 +
 drivers/pci/pci.h                             |  1 +
 include/linux/mlx5/mlx5_ifc.h                 | 11 +++-
 include/linux/pci.h                           |  8 +++
 11 files changed, 243 insertions(+), 3 deletions(-)

--
2.29.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs
  2021-01-03  8:24 [PATCH mlx5-next 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
@ 2021-01-03  8:24 ` Leon Romanovsky
  2021-01-08  0:57   ` Bjorn Helgaas
  2021-01-03  8:24 ` [PATCH mlx5-next 2/4] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Leon Romanovsky @ 2021-01-03  8:24 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Jakub Kicinski, linux-pci,
	linux-rdma, netdev

From: Leon Romanovsky <leonro@nvidia.com>

This function is applicable for SR-IOV VFs because such devices allocate
their MSI-X table before they will run on the targeted hardware and they
can't guess the right amount of vectors.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 Documentation/ABI/testing/sysfs-bus-pci | 16 +++++++
 drivers/pci/iov.c                       | 57 +++++++++++++++++++++++++
 drivers/pci/msi.c                       | 30 +++++++++++++
 drivers/pci/pci-sysfs.c                 |  1 +
 drivers/pci/pci.h                       |  1 +
 include/linux/pci.h                     |  8 ++++
 6 files changed, 113 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 25c9c39770c6..30720a9e1386 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -375,3 +375,19 @@ Description:
 		The value comes from the PCI kernel device state and can be one
 		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
 		The file is read only.
+
+What:		/sys/bus/pci/devices/.../vf_msix_vec
+Date:		December 2020
+Contact:	Leon Romanovsky <leonro@nvidia.com>
+Description:
+		This file is associated with the SR-IOV VFs. It allows overwrite
+		the amount of MSI-X vectors for that VF. This is needed to optimize
+		performance of newly bounded devices by allocating the number of
+		vectors based on the internal knowledge of targeted VM.
+
+		The values accepted are:
+		 * > 0 - this will be number reported by the PCI VF's PCIe MSI-X capability.
+		 * < 0 - not valid
+		 * = 0 - will reset to the device default value
+
+		The file is writable if no driver is bounded.
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 4afd4ee4f7f0..0f8c570361fc 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
 	return (dev->devfn + dev->sriov->offset +
 		dev->sriov->stride * vf_id) & 0xff;
 }
+EXPORT_SYMBOL(pci_iov_virtfn_devfn);

 /*
  * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
@@ -426,6 +427,62 @@ const struct attribute_group sriov_dev_attr_group = {
 	.is_visible = sriov_attrs_are_visible,
 };

+#ifdef CONFIG_PCI_MSI
+static ssize_t vf_msix_vec_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	int numb = pci_msix_vec_count(pdev);
+
+	if (numb < 0)
+		return numb;
+
+	return sprintf(buf, "%d\n", numb);
+}
+
+static ssize_t vf_msix_vec_store(struct device *dev,
+				 struct device_attribute *attr, const char *buf,
+				 size_t count)
+{
+	struct pci_dev *vf_dev = to_pci_dev(dev);
+	int val, ret;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	ret = pci_set_msix_vec_count(vf_dev, val);
+	if (ret)
+		return ret;
+
+	return count;
+}
+static DEVICE_ATTR_RW(vf_msix_vec);
+#endif
+
+static struct attribute *sriov_vf_dev_attrs[] = {
+#ifdef CONFIG_PCI_MSI
+	&dev_attr_vf_msix_vec.attr,
+#endif
+	NULL,
+};
+
+static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
+					  struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+
+	if (dev_is_pf(dev))
+		return 0;
+
+	return a->mode;
+}
+
+const struct attribute_group sriov_vf_dev_attr_group = {
+	.attrs = sriov_vf_dev_attrs,
+	.is_visible = sriov_vf_attrs_are_visible,
+};
+
 int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
 {
 	return 0;
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 3162f88fe940..0bcd705487d9 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -991,6 +991,36 @@ int pci_msix_vec_count(struct pci_dev *dev)
 }
 EXPORT_SYMBOL(pci_msix_vec_count);

+/**
+ * pci_set_msix_vec_count - change the reported number of MSI-X vectors.
+ * This function is applicable for SR-IOV VFs because such devices allocate
+ * their MSI-X table before they will run on the targeted hardware and they
+ * can't guess the right amount of vectors.
+ * @dev: VF device that is going to be changed.
+ * @numb: amount of MSI-X vectors.
+ **/
+int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
+{
+	struct pci_dev *pdev = pci_physfn(dev);
+
+	if (!dev->msix_cap || !pdev->msix_cap)
+		return -EINVAL;
+
+	if (dev->driver || !pdev->driver ||
+	    !pdev->driver->sriov_set_msix_vec_count)
+		return -EOPNOTSUPP;
+
+	if (numb < 0)
+		/*
+		 * We don't support negative numbers for now,
+		 * but maybe in the future it will make sense.
+		 */
+		return -EINVAL;
+
+	return pdev->driver->sriov_set_msix_vec_count(dev, numb);
+}
+EXPORT_SYMBOL(pci_set_msix_vec_count);
+
 static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
 			     int nvec, struct irq_affinity *affd, int flags)
 {
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index fb072f4b3176..0af2222643c2 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1557,6 +1557,7 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
 	&pci_dev_hp_attr_group,
 #ifdef CONFIG_PCI_IOV
 	&sriov_dev_attr_group,
+	&sriov_vf_dev_attr_group,
 #endif
 	&pci_bridge_attr_group,
 	&pcie_dev_attr_group,
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5c59365092fa..46396a5da2d9 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -502,6 +502,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
 void pci_restore_iov_state(struct pci_dev *dev);
 int pci_iov_bus_range(struct pci_bus *bus);
 extern const struct attribute_group sriov_dev_attr_group;
+extern const struct attribute_group sriov_vf_dev_attr_group;
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b32126d26997..1acba40a1b1b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -856,6 +856,8 @@ struct module;
  *		e.g. drivers/net/e100.c.
  * @sriov_configure: Optional driver callback to allow configuration of
  *		number of VFs to enable via sysfs "sriov_numvfs" file.
+ * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
+ *              exposed by the sysfs "vf_msix_vec" entry.
  * @err_handler: See Documentation/PCI/pci-error-recovery.rst
  * @groups:	Sysfs attribute groups.
  * @driver:	Driver model structure.
@@ -871,6 +873,7 @@ struct pci_driver {
 	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
 	void (*shutdown)(struct pci_dev *dev);
 	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
+	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
 	const struct pci_error_handlers *err_handler;
 	const struct attribute_group **groups;
 	struct device_driver	driver;
@@ -1464,6 +1467,7 @@ struct msix_entry {
 int pci_msi_vec_count(struct pci_dev *dev);
 void pci_disable_msi(struct pci_dev *dev);
 int pci_msix_vec_count(struct pci_dev *dev);
+int pci_set_msix_vec_count(struct pci_dev *dev, int numb);
 void pci_disable_msix(struct pci_dev *dev);
 void pci_restore_msi_state(struct pci_dev *dev);
 int pci_msi_enabled(void);
@@ -2402,6 +2406,10 @@ static inline bool pci_is_thunderbolt_attached(struct pci_dev *pdev)
 void pci_uevent_ers(struct pci_dev *pdev, enum  pci_ers_result err_type);
 #endif

+#ifdef CONFIG_PCI_IOV
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id);
+#endif
+
 /* Provide the legacy pci_dma_* API */
 #include <linux/pci-dma-compat.h>

--
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH mlx5-next 2/4] net/mlx5: Add dynamic MSI-X capabilities bits
  2021-01-03  8:24 [PATCH mlx5-next 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
  2021-01-03  8:24 ` [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs Leon Romanovsky
@ 2021-01-03  8:24 ` Leon Romanovsky
  2021-01-03  8:24 ` [PATCH mlx5-next 3/4] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2021-01-03  8:24 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, linux-rdma, netdev, Saeed Mahameed

From: Leon Romanovsky <leonro@nvidia.com>

These new fields declare the number of MSI-X vectors that is
possible to allocate on the VF through PF configuration.

Value must be in range defined by min_dynamic_vf_msix_table_size
and max_dynamic_vf_msix_table_size.

The driver should continue to query its MSI-X table through PCI
configuration header.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 8c5d5fe58051..7ac614bc592a 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -1656,7 +1656,16 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8	   reserved_at_6e0[0x10];
 	u8	   sf_base_id[0x10];
 
-	u8	   reserved_at_700[0x80];
+	u8	   reserved_at_700[0x8];
+	u8	   num_total_dynamic_vf_msix[0x18];
+	u8	   reserved_at_720[0x14];
+	u8	   dynamic_msix_table_size[0xc];
+	u8	   reserved_at_740[0xc];
+	u8	   min_dynamic_vf_msix_table_size[0x4];
+	u8	   reserved_at_750[0x4];
+	u8	   max_dynamic_vf_msix_table_size[0xc];
+
+	u8	   reserved_at_760[0x20];
 	u8	   vhca_tunnel_commands[0x40];
 	u8	   reserved_at_7c0[0x40];
 };
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH mlx5-next 3/4] net/mlx5: Dynamically assign MSI-X vectors count
  2021-01-03  8:24 [PATCH mlx5-next 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
  2021-01-03  8:24 ` [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs Leon Romanovsky
  2021-01-03  8:24 ` [PATCH mlx5-next 2/4] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
@ 2021-01-03  8:24 ` Leon Romanovsky
  2021-01-03  8:24 ` [PATCH mlx5-next 4/4] net/mlx5: Allow to the users to configure number of MSI-X vectors Leon Romanovsky
  2021-01-06  5:50 ` [PATCH mlx5-next 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
  4 siblings, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2021-01-03  8:24 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Jakub Kicinski, linux-pci,
	linux-rdma, netdev

From: Leon Romanovsky <leonro@nvidia.com>

The number of MSI-X vectors is PCI property visible through lspci, that
field is read-only and configured by the device. The static assignment
of an amount of MSI-X vectors doesn't allow utilize the newly created
VF because it is not known to the device the future load and configuration
where that VF will be used.

To overcome the inefficiency in the spread of such MSI-X vectors, we
allow the kernel to instruct the device with the needed number of such
vectors.

Such change immediately increases the amount of MSI-X vectors for the
system with 2 VFs from 12 vectors per-VF, to be 32 vectors per-VF.

Before this patch:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
	Capabilities: [9c] MSI-X: Enable- Count=12 Masked-

After this patch:
[root@server ~]# lspci -vs 0000:08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
....
	Capabilities: [9c] MSI-X: Enable- Count=32 Masked-

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/main.c    |  4 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  5 ++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 62 +++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 14 ++++-
 4 files changed, 83 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index c08315b51fd3..8269cfbfc69d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -567,6 +567,10 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
 	if (MLX5_CAP_GEN_MAX(dev, mkey_by_name))
 		MLX5_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1);

+	if (MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix))
+		MLX5_SET(cmd_hca_cap, set_hca_cap, num_total_dynamic_vf_msix,
+			 MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix));
+
 	return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE);
 }

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 0a0302ce7144..5babb4434a87 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -172,6 +172,11 @@ int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, int vecidx,
 		       struct notifier_block *nb);
 int mlx5_irq_detach_nb(struct mlx5_irq_table *irq_table, int vecidx,
 		       struct notifier_block *nb);
+
+int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int devfn,
+			    int msix_vec_count);
+int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs);
+
 struct cpumask *
 mlx5_irq_get_affinity_mask(struct mlx5_irq_table *irq_table, int vecidx);
 struct cpu_rmap *mlx5_irq_get_rmap(struct mlx5_irq_table *table);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 6fd974920394..135078e8dd55 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -55,6 +55,68 @@ static struct mlx5_irq *mlx5_irq_get(struct mlx5_core_dev *dev, int vecidx)
 	return &irq_table->irq[vecidx];
 }

+int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs)
+{
+	int num_vf_msix, min_msix, max_msix;
+
+	num_vf_msix = MLX5_CAP_GEN(dev, num_total_dynamic_vf_msix);
+	if (!num_vf_msix)
+		return 0;
+
+	min_msix = MLX5_CAP_GEN(dev, min_dynamic_vf_msix_table_size);
+	max_msix = MLX5_CAP_GEN(dev, max_dynamic_vf_msix_table_size);
+
+	return max(min(num_vf_msix / num_vfs, max_msix), min_msix);
+}
+
+/**
+ * mlx5_set_msix_vec_count() - Set dynamically allocated MSI-X to the VF
+ * @dev - PF to work on
+ * @function_id - internal PCI VF function id
+ * @msix_vec_count - Number of MSI-X to set
+ **/
+int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int function_id,
+			    int msix_vec_count)
+{
+	int sz = MLX5_ST_SZ_BYTES(set_hca_cap_in);
+	int num_vf_msix, min_msix, max_msix;
+	void *hca_cap, *cap;
+	int ret;
+
+	num_vf_msix = MLX5_CAP_GEN(dev, num_total_dynamic_vf_msix);
+	if (!num_vf_msix)
+		return 0;
+
+	if (!MLX5_CAP_GEN(dev, vport_group_manager) || !mlx5_core_is_pf(dev))
+		return -EOPNOTSUPP;
+
+	min_msix = MLX5_CAP_GEN(dev, min_dynamic_vf_msix_table_size);
+	max_msix = MLX5_CAP_GEN(dev, max_dynamic_vf_msix_table_size);
+
+	if (msix_vec_count < min_msix)
+		return -EINVAL;
+
+	if (msix_vec_count > max_msix)
+		return -EOVERFLOW;
+
+	hca_cap = kzalloc(sz, GFP_KERNEL);
+	if (!hca_cap)
+		return -ENOMEM;
+
+	cap = MLX5_ADDR_OF(set_hca_cap_in, hca_cap, capability);
+	MLX5_SET(cmd_hca_cap, cap, dynamic_msix_table_size, msix_vec_count);
+
+	MLX5_SET(set_hca_cap_in, hca_cap, opcode, MLX5_CMD_OP_SET_HCA_CAP);
+	MLX5_SET(set_hca_cap_in, hca_cap, other_function, 1);
+	MLX5_SET(set_hca_cap_in, hca_cap, function_id, function_id);
+
+	MLX5_SET(set_hca_cap_in, hca_cap, op_mod,
+		 MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE << 1);
+	ret = mlx5_cmd_exec_in(dev, set_hca_cap, hca_cap);
+	kfree(hca_cap);
+	return ret;
+}
+
 int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, int vecidx,
 		       struct notifier_block *nb)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index 3094d20297a9..c59efb1e7a26 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -71,8 +71,7 @@ static int sriov_restore_guids(struct mlx5_core_dev *dev, int vf)
 static int mlx5_device_enable_sriov(struct mlx5_core_dev *dev, int num_vfs)
 {
 	struct mlx5_core_sriov *sriov = &dev->priv.sriov;
-	int err;
-	int vf;
+	int err, vf, num_msix_count;

 	if (!MLX5_ESWITCH_MANAGER(dev))
 		goto enable_vfs_hca;
@@ -85,12 +84,23 @@ static int mlx5_device_enable_sriov(struct mlx5_core_dev *dev, int num_vfs)
 	}

 enable_vfs_hca:
+	num_msix_count = mlx5_get_default_msix_vec_count(dev, num_vfs);
 	for (vf = 0; vf < num_vfs; vf++) {
 		err = mlx5_core_enable_hca(dev, vf + 1);
 		if (err) {
 			mlx5_core_warn(dev, "failed to enable VF %d (%d)\n", vf, err);
 			continue;
 		}
+
+		err = mlx5_set_msix_vec_count(dev, vf + 1, num_msix_count);
+		if (err) {
+			mlx5_core_warn(
+				dev,
+				"failed to set MSI-X vector counts VF %d, err %d\n",
+				vf, err);
+			continue;
+		}
+
 		sriov->vfs_ctx[vf].enabled = 1;
 		if (MLX5_CAP_GEN(dev, port_type) == MLX5_CAP_PORT_TYPE_IB) {
 			err = sriov_restore_guids(dev, vf);
--
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH mlx5-next 4/4] net/mlx5: Allow to the users to configure number of MSI-X vectors
  2021-01-03  8:24 [PATCH mlx5-next 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
                   ` (2 preceding siblings ...)
  2021-01-03  8:24 ` [PATCH mlx5-next 3/4] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
@ 2021-01-03  8:24 ` Leon Romanovsky
  2021-01-06  5:50 ` [PATCH mlx5-next 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
  4 siblings, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2021-01-03  8:24 UTC (permalink / raw)
  To: Bjorn Helgaas, Saeed Mahameed
  Cc: Leon Romanovsky, Jason Gunthorpe, Jakub Kicinski, linux-pci,
	linux-rdma, netdev

From: Leon Romanovsky <leonro@nvidia.com>

Implement ability to configure MSI-X for the SR-IOV VFs.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/main.c    |  1 +
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  1 +
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  4 +--
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 35 +++++++++++++++++++
 4 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 8269cfbfc69d..334b3b5077c5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1647,6 +1647,7 @@ static struct pci_driver mlx5_core_driver = {
 	.shutdown	= shutdown,
 	.err_handler	= &mlx5_err_handler,
 	.sriov_configure   = mlx5_core_sriov_configure,
+	.sriov_set_msix_vec_count = mlx5_core_sriov_set_msix_vec_count,
 };

 static void mlx5_core_verify_params(void)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 5babb4434a87..8a2523d2d43a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -138,6 +138,7 @@ void mlx5_sriov_cleanup(struct mlx5_core_dev *dev);
 int mlx5_sriov_attach(struct mlx5_core_dev *dev);
 void mlx5_sriov_detach(struct mlx5_core_dev *dev);
 int mlx5_core_sriov_configure(struct pci_dev *dev, int num_vfs);
+int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count);
 int mlx5_core_enable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_core_disable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_create_scheduling_element_cmd(struct mlx5_core_dev *dev, u8 hierarchy,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 135078e8dd55..65a761346385 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -59,7 +59,7 @@ int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs)
 {
 	int num_vf_msix, min_msix, max_msix;

-	num_vf_msix = MLX5_CAP_GEN(dev, num_total_dynamic_vf_msix);
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
 	if (!num_vf_msix)
 		return 0;

@@ -83,7 +83,7 @@ int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int function_id,
 	void *hca_cap, *cap;
 	int ret;

-	num_vf_msix = MLX5_CAP_GEN(dev, num_total_dynamic_vf_msix);
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
 	if (!num_vf_msix)
 		return 0;

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index c59efb1e7a26..e63861498ef3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -188,6 +188,41 @@ int mlx5_core_sriov_configure(struct pci_dev *pdev, int num_vfs)
 	return err ? err : num_vfs;
 }

+int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count)
+{
+	struct pci_dev *pf = pci_physfn(vf);
+	struct mlx5_core_sriov *sriov;
+	struct mlx5_core_dev *dev;
+	int num_vf_msix, id;
+
+	dev = pci_get_drvdata(pf);
+	num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+	if (!num_vf_msix)
+		return -EOPNOTSUPP;
+
+	if (!msix_vec_count)
+		msix_vec_count =
+			mlx5_get_default_msix_vec_count(dev, pci_num_vf(pf));
+
+	sriov = &dev->priv.sriov;
+
+	/* Reversed translation of PCI VF function number to the internal
+	 * function_id, which exists in the name of virtfn symlink.
+	 */
+	for (id = 0; id < pci_num_vf(pf); id++) {
+		if (!sriov->vfs_ctx[id].enabled)
+			continue;
+
+		if (vf->devfn == pci_iov_virtfn_devfn(pf, id))
+			break;
+	}
+
+	if (id == pci_num_vf(pf) || !sriov->vfs_ctx[id].enabled)
+		return -EINVAL;
+
+	return mlx5_set_msix_vec_count(dev, id + 1, msix_vec_count);
+}
+
 int mlx5_sriov_attach(struct mlx5_core_dev *dev)
 {
 	if (!mlx5_core_is_pf(dev) || !pci_num_vf(dev->pdev))
--
2.29.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH mlx5-next 0/4] Dynamically assign MSI-X vectors count
  2021-01-03  8:24 [PATCH mlx5-next 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
                   ` (3 preceding siblings ...)
  2021-01-03  8:24 ` [PATCH mlx5-next 4/4] net/mlx5: Allow to the users to configure number of MSI-X vectors Leon Romanovsky
@ 2021-01-06  5:50 ` Leon Romanovsky
  4 siblings, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2021-01-06  5:50 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski, linux-pci,
	linux-rdma, netdev

On Sun, Jan 03, 2021 at 10:24:36AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
>
> Hi,
>
> The number of MSI-X vectors is PCI property visible through lspci, that
> field is read-only and configured by the device.
>
> The static assignment of an amount of MSI-X vectors doesn't allow utilize
> the newly created VF because it is not known to the device the future load
> and configuration where that VF will be used.
>
> The VFs are created on the hypervisor and forwarded to the VMs that have
> different properties (for example number of CPUs).
>
> To overcome the inefficiency in the spread of such MSI-X vectors, we
> allow the kernel to instruct the device with the needed number of such
> vectors, before VF is initialized and bounded to the driver.
>
> Before this series:
> [root@server ~]# lspci -vs 0000:08:00.2
> 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
> ....
> 	Capabilities: [9c] MSI-X: Enable- Count=12 Masked-
>
> Configuration script:
> 1. Start fresh
> echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
> modprobe -q -r mlx5_ib mlx5_core
> 2. Ensure that driver doesn't run and it is safe to change MSI-X
> echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_drivers_autoprobe
> 3. Load driver for the PF
> modprobe mlx5_core
> 4. Configure one of the VFs with new number
> echo 2 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs
> echo 21 > /sys/bus/pci/devices/0000\:08\:00.2/vf_msix_vec
>
> After this series:
> [root@server ~]# lspci -vs 0000:08:00.2
> 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
> ....
> 	Capabilities: [9c] MSI-X: Enable- Count=21 Masked-
>
>
> Thanks
>
> Leon Romanovsky (4):
>   PCI: Configure number of MSI-X vectors for SR-IOV VFs
>   net/mlx5: Add dynamic MSI-X capabilities bits
>   net/mlx5: Dynamically assign MSI-X vectors count
>   net/mlx5: Allow to the users to configure number of MSI-X vectors

Hi Bjorn,

I would like to route the PCI patch through mlx5-next tree which will
be taken to the netdev and rdma trees.

This is needed to avoid any possible merge conflicts between three
subsystems PCI, netdev and RDMA.

Is it acceptable by you?

Thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs
  2021-01-03  8:24 ` [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs Leon Romanovsky
@ 2021-01-08  0:57   ` Bjorn Helgaas
  2021-01-08  3:54     ` Don Dutile
  2021-01-10  8:22     ` Leon Romanovsky
  0 siblings, 2 replies; 17+ messages in thread
From: Bjorn Helgaas @ 2021-01-08  0:57 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Alex Williamson,
	Donald Dutile

[+cc Alex, Don]

This patch does not actually *configure* the number of vectors, so the
subject is not quite accurate.  IIUC, this patch adds a sysfs file
that can be used to configure the number of vectors.  The subject
should mention the sysfs connection.

On Sun, Jan 03, 2021 at 10:24:37AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> This function is applicable for SR-IOV VFs because such devices allocate
> their MSI-X table before they will run on the targeted hardware and they
> can't guess the right amount of vectors.

This sentence doesn't quite have enough context to make sense to me.
Per PCIe r5.0, sec 9.5.1.2, I think PFs and VFs have independent MSI-X
Capabilities.  What is the connection between the PF MSI-X and the VF
MSI-X?

The MSI-X table sizes should be determined by the Table Size in the
Message Control register.  Apparently we write a VF's Table Size
before a driver is bound to the VF?  Where does that happen?

"Before they run on the targeted hardware" -- do you mean before the
VF is passed through to a guest virtual machine?  You mention "target
VM" below, which makes more sense to me.  VFs don't "run"; they're not
software.  I apologize for not being an expert in the use of VFs.

Please mention the sysfs path in the commit log.

> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci | 16 +++++++
>  drivers/pci/iov.c                       | 57 +++++++++++++++++++++++++
>  drivers/pci/msi.c                       | 30 +++++++++++++
>  drivers/pci/pci-sysfs.c                 |  1 +
>  drivers/pci/pci.h                       |  1 +
>  include/linux/pci.h                     |  8 ++++
>  6 files changed, 113 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 25c9c39770c6..30720a9e1386 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -375,3 +375,19 @@ Description:
>  		The value comes from the PCI kernel device state and can be one
>  		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
>  		The file is read only.
> +
> +What:		/sys/bus/pci/devices/.../vf_msix_vec
> +Date:		December 2020
> +Contact:	Leon Romanovsky <leonro@nvidia.com>
> +Description:
> +		This file is associated with the SR-IOV VFs. It allows overwrite
> +		the amount of MSI-X vectors for that VF. This is needed to optimize
> +		performance of newly bounded devices by allocating the number of
> +		vectors based on the internal knowledge of targeted VM.

s/allows overwrite/allows configuration of/
s/for that/for the/
s/amount of/number of/
s/bounded/bound/

What "internal knowledge" is this?  AFAICT this would have to be some
user-space administration knowledge, not anything internal to the
kernel.

> +		The values accepted are:
> +		 * > 0 - this will be number reported by the PCI VF's PCIe MSI-X capability.

s/PCI// (it's obvious we're talking about PCI here)
s/PCIe// (MSI-X is not PCIe-specific, and there's no need to mention
it at all)

> +		 * < 0 - not valid
> +		 * = 0 - will reset to the device default value
> +
> +		The file is writable if no driver is bounded.

From the code, it looks more like this:

  The file is writable if the PF is bound to a driver that supports
  the ->sriov_set_msix_vec_count() callback and there is no driver
  bound to the VF.

Please wrap all of this to fit in 80 columns like the rest of the file.

> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 4afd4ee4f7f0..0f8c570361fc 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
>  	return (dev->devfn + dev->sriov->offset +
>  		dev->sriov->stride * vf_id) & 0xff;
>  }
> +EXPORT_SYMBOL(pci_iov_virtfn_devfn);
> 
>  /*
>   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
> @@ -426,6 +427,62 @@ const struct attribute_group sriov_dev_attr_group = {
>  	.is_visible = sriov_attrs_are_visible,
>  };
> 
> +#ifdef CONFIG_PCI_MSI
> +static ssize_t vf_msix_vec_show(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	int numb = pci_msix_vec_count(pdev);
> +
> +	if (numb < 0)
> +		return numb;
> +
> +	return sprintf(buf, "%d\n", numb);
> +}
> +
> +static ssize_t vf_msix_vec_store(struct device *dev,
> +				 struct device_attribute *attr, const char *buf,
> +				 size_t count)
> +{
> +	struct pci_dev *vf_dev = to_pci_dev(dev);
> +	int val, ret;
> +
> +	ret = kstrtoint(buf, 0, &val);
> +	if (ret)
> +		return ret;
> +
> +	ret = pci_set_msix_vec_count(vf_dev, val);
> +	if (ret)
> +		return ret;
> +
> +	return count;
> +}
> +static DEVICE_ATTR_RW(vf_msix_vec);
> +#endif
> +
> +static struct attribute *sriov_vf_dev_attrs[] = {
> +#ifdef CONFIG_PCI_MSI
> +	&dev_attr_vf_msix_vec.attr,
> +#endif
> +	NULL,
> +};
> +
> +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
> +					  struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +
> +	if (dev_is_pf(dev))
> +		return 0;
> +
> +	return a->mode;
> +}
> +
> +const struct attribute_group sriov_vf_dev_attr_group = {
> +	.attrs = sriov_vf_dev_attrs,
> +	.is_visible = sriov_vf_attrs_are_visible,
> +};
> +
>  int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
>  {
>  	return 0;
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 3162f88fe940..0bcd705487d9 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -991,6 +991,36 @@ int pci_msix_vec_count(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL(pci_msix_vec_count);
> 
> +/**
> + * pci_set_msix_vec_count - change the reported number of MSI-X vectors.

Drop period at end, as other kernel doc in this file does.

> + * This function is applicable for SR-IOV VFs because such devices allocate
> + * their MSI-X table before they will run on the targeted hardware and they
> + * can't guess the right amount of vectors.
> + * @dev: VF device that is going to be changed.
> + * @numb: amount of MSI-X vectors.

Rewrite the "such devices allocate..." part based on the questions in
the commit log.  Same with "targeted hardware."

s/amount of/number of/
Drop periods at end of parameter descriptions.

> + **/
> +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
> +{
> +	struct pci_dev *pdev = pci_physfn(dev);
> +
> +	if (!dev->msix_cap || !pdev->msix_cap)
> +		return -EINVAL;
> +
> +	if (dev->driver || !pdev->driver ||
> +	    !pdev->driver->sriov_set_msix_vec_count)
> +		return -EOPNOTSUPP;
> +
> +	if (numb < 0)
> +		/*
> +		 * We don't support negative numbers for now,
> +		 * but maybe in the future it will make sense.
> +		 */
> +		return -EINVAL;
> +
> +	return pdev->driver->sriov_set_msix_vec_count(dev, numb);

So we write to a VF sysfs file, get here and look up the PF, call a PF
driver callback with the VF as an argument, the callback (at least for
mlx5) looks up the PF from the VF, then does some mlx5-specific magic
to the PF that influences the VF somehow?

Help me connect the dots here.  Is this required because of something
peculiar to mlx5, or is something like this required for all SR-IOV
devices because of the way the PCIe spec is written?

> +}
> +EXPORT_SYMBOL(pci_set_msix_vec_count);
> +
>  static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
>  			     int nvec, struct irq_affinity *affd, int flags)
>  {
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index fb072f4b3176..0af2222643c2 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -1557,6 +1557,7 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
>  	&pci_dev_hp_attr_group,
>  #ifdef CONFIG_PCI_IOV
>  	&sriov_dev_attr_group,
> +	&sriov_vf_dev_attr_group,
>  #endif
>  	&pci_bridge_attr_group,
>  	&pcie_dev_attr_group,
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 5c59365092fa..46396a5da2d9 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -502,6 +502,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
>  void pci_restore_iov_state(struct pci_dev *dev);
>  int pci_iov_bus_range(struct pci_bus *bus);
>  extern const struct attribute_group sriov_dev_attr_group;
> +extern const struct attribute_group sriov_vf_dev_attr_group;
>  #else
>  static inline int pci_iov_init(struct pci_dev *dev)
>  {
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index b32126d26997..1acba40a1b1b 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -856,6 +856,8 @@ struct module;
>   *		e.g. drivers/net/e100.c.
>   * @sriov_configure: Optional driver callback to allow configuration of
>   *		number of VFs to enable via sysfs "sriov_numvfs" file.
> + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> + *              exposed by the sysfs "vf_msix_vec" entry.
>   * @err_handler: See Documentation/PCI/pci-error-recovery.rst
>   * @groups:	Sysfs attribute groups.
>   * @driver:	Driver model structure.
> @@ -871,6 +873,7 @@ struct pci_driver {
>  	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
>  	void (*shutdown)(struct pci_dev *dev);
>  	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
> +	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
>  	const struct pci_error_handlers *err_handler;
>  	const struct attribute_group **groups;
>  	struct device_driver	driver;
> @@ -1464,6 +1467,7 @@ struct msix_entry {
>  int pci_msi_vec_count(struct pci_dev *dev);
>  void pci_disable_msi(struct pci_dev *dev);
>  int pci_msix_vec_count(struct pci_dev *dev);
> +int pci_set_msix_vec_count(struct pci_dev *dev, int numb);

This patch adds the pci_set_msix_vec_count() definition in pci/msi.c
and a call in pci/iov.c.  It doesn't need to be declared in
include/linux/pci.h or exported.  It can be declared in
drivers/pci/pci.h.

>  void pci_disable_msix(struct pci_dev *dev);
>  void pci_restore_msi_state(struct pci_dev *dev);
>  int pci_msi_enabled(void);
> @@ -2402,6 +2406,10 @@ static inline bool pci_is_thunderbolt_attached(struct pci_dev *pdev)
>  void pci_uevent_ers(struct pci_dev *pdev, enum  pci_ers_result err_type);
>  #endif
> 
> +#ifdef CONFIG_PCI_IOV
> +int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id);
> +#endif

pci_iov_virtfn_devfn() is already declared in this file.

>  /* Provide the legacy pci_dma_* API */
>  #include <linux/pci-dma-compat.h>
> 
> --
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs
  2021-01-08  0:57   ` Bjorn Helgaas
@ 2021-01-08  3:54     ` Don Dutile
  2021-01-08  7:25       ` Leon Romanovsky
                         ` (2 more replies)
  2021-01-10  8:22     ` Leon Romanovsky
  1 sibling, 3 replies; 17+ messages in thread
From: Don Dutile @ 2021-01-08  3:54 UTC (permalink / raw)
  To: Bjorn Helgaas, Leon Romanovsky
  Cc: Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Alex Williamson

On 1/7/21 7:57 PM, Bjorn Helgaas wrote:
> [+cc Alex, Don]
>
> This patch does not actually *configure* the number of vectors, so the
> subject is not quite accurate.  IIUC, this patch adds a sysfs file
> that can be used to configure the number of vectors.  The subject
> should mention the sysfs connection.
>
> On Sun, Jan 03, 2021 at 10:24:37AM +0200, Leon Romanovsky wrote:
>> From: Leon Romanovsky <leonro@nvidia.com>
>>
>> This function is applicable for SR-IOV VFs because such devices allocate
>> their MSI-X table before they will run on the targeted hardware and they
>> can't guess the right amount of vectors.
> This sentence doesn't quite have enough context to make sense to me.
> Per PCIe r5.0, sec 9.5.1.2, I think PFs and VFs have independent MSI-X
> Capabilities.  What is the connection between the PF MSI-X and the VF
> MSI-X?
+1... strip this commit log section and write it with correct, technical content.
PFs & VF's have indep MSIX caps.

Q: is this an issue where (some) mlx5's have a large msi-x capability (per VF) that may overwhelm a system's, (pci-(sub)-tree) MSI / intr capability,
and this is a sysfs-based tuning knob to reduce the max number on such 'challenged' systems?
-- ah; reading further below, it's based on some information gleemed from the VM's capability for intr. support.
     -- or maybe IOMMU (intr) support on the host system, and the VF can't exceed it or config failure in VM... whatever... its some VM cap that's being accomodated.
> The MSI-X table sizes should be determined by the Table Size in the
> Message Control register.  Apparently we write a VF's Table Size
> before a driver is bound to the VF?  Where does that happen?
>
> "Before they run on the targeted hardware" -- do you mean before the
> VF is passed through to a guest virtual machine?  You mention "target
> VM" below, which makes more sense to me.  VFs don't "run"; they're not
> software.  I apologize for not being an expert in the use of VFs.
>
> Please mention the sysfs path in the commit log.
>
>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>> ---
>>   Documentation/ABI/testing/sysfs-bus-pci | 16 +++++++
>>   drivers/pci/iov.c                       | 57 +++++++++++++++++++++++++
>>   drivers/pci/msi.c                       | 30 +++++++++++++
>>   drivers/pci/pci-sysfs.c                 |  1 +
>>   drivers/pci/pci.h                       |  1 +
>>   include/linux/pci.h                     |  8 ++++
>>   6 files changed, 113 insertions(+)
>>
>> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
>> index 25c9c39770c6..30720a9e1386 100644
>> --- a/Documentation/ABI/testing/sysfs-bus-pci
>> +++ b/Documentation/ABI/testing/sysfs-bus-pci
>> @@ -375,3 +375,19 @@ Description:
>>   		The value comes from the PCI kernel device state and can be one
>>   		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
>>   		The file is read only.
>> +
>> +What:		/sys/bus/pci/devices/.../vf_msix_vec
>> +Date:		December 2020
>> +Contact:	Leon Romanovsky <leonro@nvidia.com>
>> +Description:
>> +		This file is associated with the SR-IOV VFs. It allows overwrite
>> +		the amount of MSI-X vectors for that VF. This is needed to optimize
>> +		performance of newly bounded devices by allocating the number of
>> +		vectors based on the internal knowledge of targeted VM.
> s/allows overwrite/allows configuration of/
> s/for that/for the/
> s/amount of/number of/
> s/bounded/bound/
>
> What "internal knowledge" is this?  AFAICT this would have to be some
> user-space administration knowledge, not anything internal to the
> kernel.
Correct; likely a libvirt VM (section of its) description;

>
>> +		The values accepted are:
>> +		 * > 0 - this will be number reported by the PCI VF's PCIe MSI-X capability.
> s/PCI// (it's obvious we're talking about PCI here)
> s/PCIe// (MSI-X is not PCIe-specific, and there's no need to mention
> it at all)
>
>> +		 * < 0 - not valid
>> +		 * = 0 - will reset to the device default value
>> +
>> +		The file is writable if no driver is bounded.
>  From the code, it looks more like this:
>
>    The file is writable if the PF is bound to a driver that supports
>    the ->sriov_set_msix_vec_count() callback and there is no driver
>    bound to the VF.
>
> Please wrap all of this to fit in 80 columns like the rest of the file.
>
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index 4afd4ee4f7f0..0f8c570361fc 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
>>   	return (dev->devfn + dev->sriov->offset +
>>   		dev->sriov->stride * vf_id) & 0xff;
>>   }
>> +EXPORT_SYMBOL(pci_iov_virtfn_devfn);
>>
>>   /*
>>    * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
>> @@ -426,6 +427,62 @@ const struct attribute_group sriov_dev_attr_group = {
>>   	.is_visible = sriov_attrs_are_visible,
>>   };
>>
>> +#ifdef CONFIG_PCI_MSI
>> +static ssize_t vf_msix_vec_show(struct device *dev,
>> +				struct device_attribute *attr, char *buf)
>> +{
>> +	struct pci_dev *pdev = to_pci_dev(dev);
>> +	int numb = pci_msix_vec_count(pdev);
>> +
>> +	if (numb < 0)
>> +		return numb;
>> +
>> +	return sprintf(buf, "%d\n", numb);
>> +}
>> +
>> +static ssize_t vf_msix_vec_store(struct device *dev,
>> +				 struct device_attribute *attr, const char *buf,
>> +				 size_t count)
>> +{
>> +	struct pci_dev *vf_dev = to_pci_dev(dev);
>> +	int val, ret;
>> +
>> +	ret = kstrtoint(buf, 0, &val);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = pci_set_msix_vec_count(vf_dev, val);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return count;
>> +}
>> +static DEVICE_ATTR_RW(vf_msix_vec);
>> +#endif
>> +
>> +static struct attribute *sriov_vf_dev_attrs[] = {
>> +#ifdef CONFIG_PCI_MSI
>> +	&dev_attr_vf_msix_vec.attr,
>> +#endif
>> +	NULL,
>> +};
>> +
>> +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
>> +					  struct attribute *a, int n)
>> +{
>> +	struct device *dev = kobj_to_dev(kobj);
>> +
>> +	if (dev_is_pf(dev))
>> +		return 0;
>> +
>> +	return a->mode;
>> +}
>> +
>> +const struct attribute_group sriov_vf_dev_attr_group = {
>> +	.attrs = sriov_vf_dev_attrs,
>> +	.is_visible = sriov_vf_attrs_are_visible,
>> +};
>> +
>>   int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
>>   {
>>   	return 0;
>> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
>> index 3162f88fe940..0bcd705487d9 100644
>> --- a/drivers/pci/msi.c
>> +++ b/drivers/pci/msi.c
>> @@ -991,6 +991,36 @@ int pci_msix_vec_count(struct pci_dev *dev)
>>   }
>>   EXPORT_SYMBOL(pci_msix_vec_count);
>>
>> +/**
>> + * pci_set_msix_vec_count - change the reported number of MSI-X vectors.
> Drop period at end, as other kernel doc in this file does.
>
>> + * This function is applicable for SR-IOV VFs because such devices allocate
>> + * their MSI-X table before they will run on the targeted hardware and they
>> + * can't guess the right amount of vectors.
>> + * @dev: VF device that is going to be changed.
>> + * @numb: amount of MSI-X vectors.
> Rewrite the "such devices allocate..." part based on the questions in
> the commit log.  Same with "targeted hardware."
>
> s/amount of/number of/
> Drop periods at end of parameter descriptions.
>
>> + **/
>> +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
>> +{
>> +	struct pci_dev *pdev = pci_physfn(dev);
>> +
>> +	if (!dev->msix_cap || !pdev->msix_cap)
>> +		return -EINVAL;
>> +
>> +	if (dev->driver || !pdev->driver ||
>> +	    !pdev->driver->sriov_set_msix_vec_count)
>> +		return -EOPNOTSUPP;
>> +
>> +	if (numb < 0)
>> +		/*
>> +		 * We don't support negative numbers for now,
>> +		 * but maybe in the future it will make sense.
>> +		 */
>> +		return -EINVAL;
>> +
>> +	return pdev->driver->sriov_set_msix_vec_count(dev, numb);
> So we write to a VF sysfs file, get here and look up the PF, call a PF
> driver callback with the VF as an argument, the callback (at least for
> mlx5) looks up the PF from the VF, then does some mlx5-specific magic
> to the PF that influences the VF somehow?
There's no PF lookup above.... it's just checking if a pdev has a driver with the desired msix-cap setting(reduction) feature.

> Help me connect the dots here.  Is this required because of something
> peculiar to mlx5, or is something like this required for all SR-IOV
> devices because of the way the PCIe spec is written?
So, overall, I'm guessing the mlx5 device can have 1000's of MSIX -- say, one per send/receive/completion queue.
This device capability may exceed the max number MSIX a VM can have/support (depending on guestos).
So, a sysfs tunable is used to set the max MSIX available, and thus, the device puts >1 send/rcv/completion queue intr on a given MSIX.

ok, time for Leon to better state what this patch does,
and why it's needed on mlx5 (and may be applicable to other/future high-MSIX devices assigned to VMs (NVME?)).
Hmmm, now that I said it, why is it SRIOV-centric and not pci-device centric (can pass a PF w/high number of MSIX to a VM).

-Don
>> +}
>> +EXPORT_SYMBOL(pci_set_msix_vec_count);
>> +
>>   static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
>>   			     int nvec, struct irq_affinity *affd, int flags)
>>   {
>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>> index fb072f4b3176..0af2222643c2 100644
>> --- a/drivers/pci/pci-sysfs.c
>> +++ b/drivers/pci/pci-sysfs.c
>> @@ -1557,6 +1557,7 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
>>   	&pci_dev_hp_attr_group,
>>   #ifdef CONFIG_PCI_IOV
>>   	&sriov_dev_attr_group,
>> +	&sriov_vf_dev_attr_group,
>>   #endif
>>   	&pci_bridge_attr_group,
>>   	&pcie_dev_attr_group,
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 5c59365092fa..46396a5da2d9 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -502,6 +502,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
>>   void pci_restore_iov_state(struct pci_dev *dev);
>>   int pci_iov_bus_range(struct pci_bus *bus);
>>   extern const struct attribute_group sriov_dev_attr_group;
>> +extern const struct attribute_group sriov_vf_dev_attr_group;
>>   #else
>>   static inline int pci_iov_init(struct pci_dev *dev)
>>   {
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index b32126d26997..1acba40a1b1b 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -856,6 +856,8 @@ struct module;
>>    *		e.g. drivers/net/e100.c.
>>    * @sriov_configure: Optional driver callback to allow configuration of
>>    *		number of VFs to enable via sysfs "sriov_numvfs" file.
>> + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
>> + *              exposed by the sysfs "vf_msix_vec" entry.
>>    * @err_handler: See Documentation/PCI/pci-error-recovery.rst
>>    * @groups:	Sysfs attribute groups.
>>    * @driver:	Driver model structure.
>> @@ -871,6 +873,7 @@ struct pci_driver {
>>   	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
>>   	void (*shutdown)(struct pci_dev *dev);
>>   	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
>> +	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
>>   	const struct pci_error_handlers *err_handler;
>>   	const struct attribute_group **groups;
>>   	struct device_driver	driver;
>> @@ -1464,6 +1467,7 @@ struct msix_entry {
>>   int pci_msi_vec_count(struct pci_dev *dev);
>>   void pci_disable_msi(struct pci_dev *dev);
>>   int pci_msix_vec_count(struct pci_dev *dev);
>> +int pci_set_msix_vec_count(struct pci_dev *dev, int numb);
> This patch adds the pci_set_msix_vec_count() definition in pci/msi.c
> and a call in pci/iov.c.  It doesn't need to be declared in
> include/linux/pci.h or exported.  It can be declared in
> drivers/pci/pci.h.
>
>>   void pci_disable_msix(struct pci_dev *dev);
>>   void pci_restore_msi_state(struct pci_dev *dev);
>>   int pci_msi_enabled(void);
>> @@ -2402,6 +2406,10 @@ static inline bool pci_is_thunderbolt_attached(struct pci_dev *pdev)
>>   void pci_uevent_ers(struct pci_dev *pdev, enum  pci_ers_result err_type);
>>   #endif
>>
>> +#ifdef CONFIG_PCI_IOV
>> +int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id);
>> +#endif
> pci_iov_virtfn_devfn() is already declared in this file.
>
>>   /* Provide the legacy pci_dma_* API */
>>   #include <linux/pci-dma-compat.h>
>>
>> --
>> 2.29.2
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs
  2021-01-08  3:54     ` Don Dutile
@ 2021-01-08  7:25       ` Leon Romanovsky
  2021-01-08 16:21         ` Alex Williamson
  2021-01-08 21:09       ` Bjorn Helgaas
  2021-01-10  8:29       ` Leon Romanovsky
  2 siblings, 1 reply; 17+ messages in thread
From: Leon Romanovsky @ 2021-01-08  7:25 UTC (permalink / raw)
  To: Don Dutile, Bjorn Helgaas, Bjorn Helgaas
  Cc: Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski, linux-pci,
	linux-rdma, netdev, Alex Williamson

On Thu, Jan 07, 2021 at 10:54:38PM -0500, Don Dutile wrote:
> On 1/7/21 7:57 PM, Bjorn Helgaas wrote:
> > [+cc Alex, Don]

<...>

> > Help me connect the dots here.  Is this required because of something
> > peculiar to mlx5, or is something like this required for all SR-IOV
> > devices because of the way the PCIe spec is written?
> So, overall, I'm guessing the mlx5 device can have 1000's of MSIX -- say, one per send/receive/completion queue.
> This device capability may exceed the max number MSIX a VM can have/support (depending on guestos).
> So, a sysfs tunable is used to set the max MSIX available, and thus, the device puts >1 send/rcv/completion queue intr on a given MSIX.
>
> ok, time for Leon to better state what this patch does,
> and why it's needed on mlx5 (and may be applicable to other/future high-MSIX devices assigned to VMs (NVME?)).
> Hmmm, now that I said it, why is it SRIOV-centric and not pci-device centric (can pass a PF w/high number of MSIX to a VM).

Thanks Don and Bjorn,

I will answer on all comments a little bit later when I will return
to the office (Sunday).

However it is important for me to present the use case.

Our mlx5 SR-IOV devices were always capable to drive many MSI-X (upto 2K,
don't catch me on exact number), however when user created VFs, the FW has
no knowledge of how those VFs will be used. So FW had no choice but statically
and equally assign same amount of MSI-X to all VFs.

After SR-IOV VF creation, user will bind those new VFs to the VMs, but
the VMs have different number of CPUs and despite HW being able to deliver
all needed number of vectors (in mlx5 netdev world, number of channels == number
of CPUs == number of vectors), we will be limited by already set low number
of vectors.

So it is not for vector reduction, but more for vector re-partition.

As an example, imagine mlx5 with two VFs. One VF is bounded to VM with 200 CPUs
and another is bounded to VM with 1 CPU. They need different amount of MSI-X vectors.

Hope that I succeeded to explain :).

Regarding why it is SR-IOV and not PCI, the amount of MSI-X vectors is
read-only field in the PCI, so we can't write from pci/core toward
PF device and expect HW update it. It means that if we really need it,
we will need to do it after driver already loaded on that PF, so driver
will forward to HW and lspci will work correctly. This will require
reload of whole PCI device initialization sequence, because MSI-X table
size pre-calculated very early in the init flow.

Thanks

>
> -Don

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs
  2021-01-08  7:25       ` Leon Romanovsky
@ 2021-01-08 16:21         ` Alex Williamson
  2021-01-10  8:47           ` Leon Romanovsky
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Williamson @ 2021-01-08 16:21 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Don Dutile, Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed,
	Jason Gunthorpe, Jakub Kicinski, linux-pci, linux-rdma, netdev

On Fri, 8 Jan 2021 09:25:25 +0200
Leon Romanovsky <leon@kernel.org> wrote:

> On Thu, Jan 07, 2021 at 10:54:38PM -0500, Don Dutile wrote:
> > On 1/7/21 7:57 PM, Bjorn Helgaas wrote:  
> > > [+cc Alex, Don]  
> 
> <...>
> 
> > > Help me connect the dots here.  Is this required because of something
> > > peculiar to mlx5, or is something like this required for all SR-IOV
> > > devices because of the way the PCIe spec is written?  
> > So, overall, I'm guessing the mlx5 device can have 1000's of MSIX -- say, one per send/receive/completion queue.
> > This device capability may exceed the max number MSIX a VM can have/support (depending on guestos).
> > So, a sysfs tunable is used to set the max MSIX available, and thus, the device puts >1 send/rcv/completion queue intr on a given MSIX.
> >
> > ok, time for Leon to better state what this patch does,
> > and why it's needed on mlx5 (and may be applicable to other/future high-MSIX devices assigned to VMs (NVME?)).
> > Hmmm, now that I said it, why is it SRIOV-centric and not pci-device centric (can pass a PF w/high number of MSIX to a VM).  
> 
> Thanks Don and Bjorn,
> 
> I will answer on all comments a little bit later when I will return
> to the office (Sunday).
> 
> However it is important for me to present the use case.
> 
> Our mlx5 SR-IOV devices were always capable to drive many MSI-X (upto 2K,
> don't catch me on exact number), however when user created VFs, the FW has
> no knowledge of how those VFs will be used. So FW had no choice but statically
> and equally assign same amount of MSI-X to all VFs.
> 
> After SR-IOV VF creation, user will bind those new VFs to the VMs, but
> the VMs have different number of CPUs and despite HW being able to deliver
> all needed number of vectors (in mlx5 netdev world, number of channels == number
> of CPUs == number of vectors), we will be limited by already set low number
> of vectors.
> 
> So it is not for vector reduction, but more for vector re-partition.
> 
> As an example, imagine mlx5 with two VFs. One VF is bounded to VM with 200 CPUs
> and another is bounded to VM with 1 CPU. They need different amount of MSI-X vectors.
> 
> Hope that I succeeded to explain :).

The idea is not unreasonable imo, but without knowing the size of the
vector pool, range available per vf, or ultimately whether the vf
supports this feature before we try to configure it, I don't see how
userspace is expected to make use of this in the general case.  If the
configuration requires such specific vf vector usage and pf driver
specific knowledge, I'm not sure it's fit as a generic pci-sysfs
interface.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs
  2021-01-08  3:54     ` Don Dutile
  2021-01-08  7:25       ` Leon Romanovsky
@ 2021-01-08 21:09       ` Bjorn Helgaas
  2021-01-09  2:54         ` Don Dutile
  2021-01-10  8:25         ` Leon Romanovsky
  2021-01-10  8:29       ` Leon Romanovsky
  2 siblings, 2 replies; 17+ messages in thread
From: Bjorn Helgaas @ 2021-01-08 21:09 UTC (permalink / raw)
  To: Don Dutile
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky,
	Jason Gunthorpe, Jakub Kicinski, linux-pci, linux-rdma, netdev,
	Alex Williamson

On Thu, Jan 07, 2021 at 10:54:38PM -0500, Don Dutile wrote:
> On 1/7/21 7:57 PM, Bjorn Helgaas wrote:
> > On Sun, Jan 03, 2021 at 10:24:37AM +0200, Leon Romanovsky wrote:

> > > + **/
> > > +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
> > > +{
> > > +	struct pci_dev *pdev = pci_physfn(dev);
> > > +
> > > +	if (!dev->msix_cap || !pdev->msix_cap)
> > > +		return -EINVAL;
> > > +
> > > +	if (dev->driver || !pdev->driver ||
> > > +	    !pdev->driver->sriov_set_msix_vec_count)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	if (numb < 0)
> > > +		/*
> > > +		 * We don't support negative numbers for now,
> > > +		 * but maybe in the future it will make sense.
> > > +		 */
> > > +		return -EINVAL;
> > > +
> > > +	return pdev->driver->sriov_set_msix_vec_count(dev, numb);
> >
> > So we write to a VF sysfs file, get here and look up the PF, call a PF
> > driver callback with the VF as an argument, the callback (at least for
> > mlx5) looks up the PF from the VF, then does some mlx5-specific magic
> > to the PF that influences the VF somehow?
>
> There's no PF lookup above.... it's just checking if a pdev has a
> driver with the desired msix-cap setting(reduction) feature.

We started with the VF (the sysfs file is attached to the VF).  "pdev"
is the corresponding PF; that's what I meant by "looking up the PF".
Then we call the PF driver sriov_set_msix_vec_count() method.

I asked because this raises questions of whether we need mutual
exclusion or some other coordination between setting this for multiple
VFs.

Obviously it's great to answer all these in email, but at the end of
the day, the rationale needs to be in the commit, either in code
comments or the commit log.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs
  2021-01-08 21:09       ` Bjorn Helgaas
@ 2021-01-09  2:54         ` Don Dutile
  2021-01-10  8:33           ` Leon Romanovsky
  2021-01-10  8:25         ` Leon Romanovsky
  1 sibling, 1 reply; 17+ messages in thread
From: Don Dutile @ 2021-01-09  2:54 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Leon Romanovsky, Bjorn Helgaas, Saeed Mahameed, Leon Romanovsky,
	Jason Gunthorpe, Jakub Kicinski, linux-pci, linux-rdma, netdev,
	Alex Williamson

On 1/8/21 4:09 PM, Bjorn Helgaas wrote:
> On Thu, Jan 07, 2021 at 10:54:38PM -0500, Don Dutile wrote:
>> On 1/7/21 7:57 PM, Bjorn Helgaas wrote:
>>> On Sun, Jan 03, 2021 at 10:24:37AM +0200, Leon Romanovsky wrote:
>>>> + **/
>>>> +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
>>>> +{
>>>> +	struct pci_dev *pdev = pci_physfn(dev);
>>>> +
>>>> +	if (!dev->msix_cap || !pdev->msix_cap)
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (dev->driver || !pdev->driver ||
>>>> +	    !pdev->driver->sriov_set_msix_vec_count)
>>>> +		return -EOPNOTSUPP;
>>>> +
>>>> +	if (numb < 0)
>>>> +		/*
>>>> +		 * We don't support negative numbers for now,
>>>> +		 * but maybe in the future it will make sense.
>>>> +		 */
>>>> +		return -EINVAL;
>>>> +
>>>> +	return pdev->driver->sriov_set_msix_vec_count(dev, numb);
>>> So we write to a VF sysfs file, get here and look up the PF, call a PF
>>> driver callback with the VF as an argument, the callback (at least for
>>> mlx5) looks up the PF from the VF, then does some mlx5-specific magic
>>> to the PF that influences the VF somehow?
>> There's no PF lookup above.... it's just checking if a pdev has a
>> driver with the desired msix-cap setting(reduction) feature.
> We started with the VF (the sysfs file is attached to the VF).  "pdev"
> is the corresponding PF; that's what I meant by "looking up the PF".
> Then we call the PF driver sriov_set_msix_vec_count() method.
ah, got how your statement relates to the files &/or pdev.

> I asked because this raises questions of whether we need mutual
> exclusion or some other coordination between setting this for multiple
> VFs.
>
> Obviously it's great to answer all these in email, but at the end of
> the day, the rationale needs to be in the commit, either in code
> comments or the commit log.
>
I'm still not getting why this is not per-(vf)pdev -- just b/c a device has N-number of MSIX capability doesn't mean it has to all be used/configured,
Setting max-MSIX for VFs in the PF's pdev means it is the same number for all VFs ... and I'm not sure that's the right solution either.
It should still be (v)pdev-based, IMO.
--dd


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs
  2021-01-08  0:57   ` Bjorn Helgaas
  2021-01-08  3:54     ` Don Dutile
@ 2021-01-10  8:22     ` Leon Romanovsky
  1 sibling, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2021-01-10  8:22 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe, Jakub Kicinski,
	linux-pci, linux-rdma, netdev, Alex Williamson, Donald Dutile

On Thu, Jan 07, 2021 at 06:57:21PM -0600, Bjorn Helgaas wrote:
> [+cc Alex, Don]
>
> This patch does not actually *configure* the number of vectors, so the
> subject is not quite accurate.  IIUC, this patch adds a sysfs file
> that can be used to configure the number of vectors.  The subject
> should mention the sysfs connection.

I'll do:
"PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs"

>
> On Sun, Jan 03, 2021 at 10:24:37AM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > This function is applicable for SR-IOV VFs because such devices allocate
> > their MSI-X table before they will run on the targeted hardware and they
> > can't guess the right amount of vectors.
>
> This sentence doesn't quite have enough context to make sense to me.
> Per PCIe r5.0, sec 9.5.1.2, I think PFs and VFs have independent MSI-X
> Capabilities.  What is the connection between the PF MSI-X and the VF
> MSI-X?

Right, PF and VF have different capabilities, but MSI-X vectors are
limited resource by the HW and the device has pool of such vectors
to distribute to the VFs.

The connection between PF and VF is a logical one. The PF exists and bounded
to the driver, so have an ability to actually write to the HW and change VF
configuration before driver bounded to it.

>
> The MSI-X table sizes should be determined by the Table Size in the
> Message Control register.  Apparently we write a VF's Table Size
> before a driver is bound to the VF?  Where does that happen?

The table size is set by the HW when SR-IOV is enabled and VFs are created.
echo num_sriov > /sys/bus/pci/devices/.../sriov_numvfs
.... at this point VFs have this table set, but not used yet.

The driver will read this table when it enables MSI-X:
 pci_enable_msix_range
  __pci_enable_msix_range
   __pci_enable_msix
    pci_msix_vec_count

>
> "Before they run on the targeted hardware" -- do you mean before the
> VF is passed through to a guest virtual machine?  You mention "target
> VM" below, which makes more sense to me.  VFs don't "run"; they're not
> software.  I apologize for not being an expert in the use of VFs.

Yes, "target" == "VM".

>
> Please mention the sysfs path in the commit log.

I'll do.

>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  Documentation/ABI/testing/sysfs-bus-pci | 16 +++++++
> >  drivers/pci/iov.c                       | 57 +++++++++++++++++++++++++
> >  drivers/pci/msi.c                       | 30 +++++++++++++
> >  drivers/pci/pci-sysfs.c                 |  1 +
> >  drivers/pci/pci.h                       |  1 +
> >  include/linux/pci.h                     |  8 ++++
> >  6 files changed, 113 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > index 25c9c39770c6..30720a9e1386 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > @@ -375,3 +375,19 @@ Description:
> >  		The value comes from the PCI kernel device state and can be one
> >  		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
> >  		The file is read only.
> > +
> > +What:		/sys/bus/pci/devices/.../vf_msix_vec
> > +Date:		December 2020
> > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > +Description:
> > +		This file is associated with the SR-IOV VFs. It allows overwrite
> > +		the amount of MSI-X vectors for that VF. This is needed to optimize
> > +		performance of newly bounded devices by allocating the number of
> > +		vectors based on the internal knowledge of targeted VM.
>
> s/allows overwrite/allows configuration of/
> s/for that/for the/
> s/amount of/number of/
> s/bounded/bound/

I changed it, thanks

>
> What "internal knowledge" is this?  AFAICT this would have to be some
> user-space administration knowledge, not anything internal to the
> kernel.

Yes, it is not internal to the kernel, but administrator knowledge.
In our case, it is orchestration software that allocates such VFs to the
users.

>
> > +		The values accepted are:
> > +		 * > 0 - this will be number reported by the PCI VF's PCIe MSI-X capability.
>
> s/PCI// (it's obvious we're talking about PCI here)
> s/PCIe// (MSI-X is not PCIe-specific, and there's no need to mention
> it at all)

Done

>
> > +		 * < 0 - not valid
> > +		 * = 0 - will reset to the device default value
> > +
> > +		The file is writable if no driver is bounded.
>
> From the code, it looks more like this:
>
>   The file is writable if the PF is bound to a driver that supports
>   the ->sriov_set_msix_vec_count() callback and there is no driver
>   bound to the VF.

I added it to the description.

>
> Please wrap all of this to fit in 80 columns like the rest of the file.

Fixed

>
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index 4afd4ee4f7f0..0f8c570361fc 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
> >  	return (dev->devfn + dev->sriov->offset +
> >  		dev->sriov->stride * vf_id) & 0xff;
> >  }
> > +EXPORT_SYMBOL(pci_iov_virtfn_devfn);
> >
> >  /*
> >   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
> > @@ -426,6 +427,62 @@ const struct attribute_group sriov_dev_attr_group = {
> >  	.is_visible = sriov_attrs_are_visible,
> >  };
> >
> > +#ifdef CONFIG_PCI_MSI
> > +static ssize_t vf_msix_vec_show(struct device *dev,
> > +				struct device_attribute *attr, char *buf)
> > +{
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +	int numb = pci_msix_vec_count(pdev);
> > +
> > +	if (numb < 0)
> > +		return numb;
> > +
> > +	return sprintf(buf, "%d\n", numb);
> > +}
> > +
> > +static ssize_t vf_msix_vec_store(struct device *dev,
> > +				 struct device_attribute *attr, const char *buf,
> > +				 size_t count)
> > +{
> > +	struct pci_dev *vf_dev = to_pci_dev(dev);
> > +	int val, ret;
> > +
> > +	ret = kstrtoint(buf, 0, &val);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = pci_set_msix_vec_count(vf_dev, val);
> > +	if (ret)
> > +		return ret;
> > +
> > +	return count;
> > +}
> > +static DEVICE_ATTR_RW(vf_msix_vec);
> > +#endif
> > +
> > +static struct attribute *sriov_vf_dev_attrs[] = {
> > +#ifdef CONFIG_PCI_MSI
> > +	&dev_attr_vf_msix_vec.attr,
> > +#endif
> > +	NULL,
> > +};
> > +
> > +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
> > +					  struct attribute *a, int n)
> > +{
> > +	struct device *dev = kobj_to_dev(kobj);
> > +
> > +	if (dev_is_pf(dev))
> > +		return 0;
> > +
> > +	return a->mode;
> > +}
> > +
> > +const struct attribute_group sriov_vf_dev_attr_group = {
> > +	.attrs = sriov_vf_dev_attrs,
> > +	.is_visible = sriov_vf_attrs_are_visible,
> > +};
> > +
> >  int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
> >  {
> >  	return 0;
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 3162f88fe940..0bcd705487d9 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -991,6 +991,36 @@ int pci_msix_vec_count(struct pci_dev *dev)
> >  }
> >  EXPORT_SYMBOL(pci_msix_vec_count);
> >
> > +/**
> > + * pci_set_msix_vec_count - change the reported number of MSI-X vectors.
>
> Drop period at end, as other kernel doc in this file does.

Done

>
> > + * This function is applicable for SR-IOV VFs because such devices allocate
> > + * their MSI-X table before they will run on the targeted hardware and they
> > + * can't guess the right amount of vectors.
> > + * @dev: VF device that is going to be changed.
> > + * @numb: amount of MSI-X vectors.
>
> Rewrite the "such devices allocate..." part based on the questions in
> the commit log.  Same with "targeted hardware."
>
> s/amount of/number of/
> Drop periods at end of parameter descriptions.

Done

>
> > + **/
> > +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
> > +{
> > +	struct pci_dev *pdev = pci_physfn(dev);
> > +
> > +	if (!dev->msix_cap || !pdev->msix_cap)
> > +		return -EINVAL;
> > +
> > +	if (dev->driver || !pdev->driver ||
> > +	    !pdev->driver->sriov_set_msix_vec_count)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (numb < 0)
> > +		/*
> > +		 * We don't support negative numbers for now,
> > +		 * but maybe in the future it will make sense.
> > +		 */
> > +		return -EINVAL;
> > +
> > +	return pdev->driver->sriov_set_msix_vec_count(dev, numb);
>
> So we write to a VF sysfs file, get here and look up the PF, call a PF
> driver callback with the VF as an argument, the callback (at least for
> mlx5) looks up the PF from the VF, then does some mlx5-specific magic
> to the PF that influences the VF somehow?

Right, because HW already created VFs, the PF driver is aware of them,
so it simply says to the FW that specific VF should have different
value in their table size.

>
> Help me connect the dots here.  Is this required because of something
> peculiar to mlx5, or is something like this required for all SR-IOV
> devices because of the way the PCIe spec is written?

The second one is correct, there is nothing mlx5 specific in it.
This is a combination of the spec together with Linux SR-IOV implementation
logic.

First, PCI spec has one single bit to enable/disable all VFs at the same time
without ability to dynamically add/delete. It means all SR-IOV HW in the world
will do the same: split internal MSI-X pool equally or by any other same logic.
This is needed so lspci right after VF created will give proper values in the
MSI-X section.

Second, Linux followed the spec and implemented same allocation model and separated
it by the layers, at the time driver probes, it should have all PCI config ready in
the PCI level. It means even change of MSI-X inside VF during driver VF init and
driver reload later can be potentially problematic.

>
> > +}
> > +EXPORT_SYMBOL(pci_set_msix_vec_count);
> > +
> >  static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
> >  			     int nvec, struct irq_affinity *affd, int flags)
> >  {
> > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> > index fb072f4b3176..0af2222643c2 100644
> > --- a/drivers/pci/pci-sysfs.c
> > +++ b/drivers/pci/pci-sysfs.c
> > @@ -1557,6 +1557,7 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
> >  	&pci_dev_hp_attr_group,
> >  #ifdef CONFIG_PCI_IOV
> >  	&sriov_dev_attr_group,
> > +	&sriov_vf_dev_attr_group,
> >  #endif
> >  	&pci_bridge_attr_group,
> >  	&pcie_dev_attr_group,
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 5c59365092fa..46396a5da2d9 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -502,6 +502,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
> >  void pci_restore_iov_state(struct pci_dev *dev);
> >  int pci_iov_bus_range(struct pci_bus *bus);
> >  extern const struct attribute_group sriov_dev_attr_group;
> > +extern const struct attribute_group sriov_vf_dev_attr_group;
> >  #else
> >  static inline int pci_iov_init(struct pci_dev *dev)
> >  {
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index b32126d26997..1acba40a1b1b 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -856,6 +856,8 @@ struct module;
> >   *		e.g. drivers/net/e100.c.
> >   * @sriov_configure: Optional driver callback to allow configuration of
> >   *		number of VFs to enable via sysfs "sriov_numvfs" file.
> > + * @sriov_set_msix_vec_count: Driver callback to change number of MSI-X vectors
> > + *              exposed by the sysfs "vf_msix_vec" entry.
> >   * @err_handler: See Documentation/PCI/pci-error-recovery.rst
> >   * @groups:	Sysfs attribute groups.
> >   * @driver:	Driver model structure.
> > @@ -871,6 +873,7 @@ struct pci_driver {
> >  	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
> >  	void (*shutdown)(struct pci_dev *dev);
> >  	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
> > +	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
> >  	const struct pci_error_handlers *err_handler;
> >  	const struct attribute_group **groups;
> >  	struct device_driver	driver;
> > @@ -1464,6 +1467,7 @@ struct msix_entry {
> >  int pci_msi_vec_count(struct pci_dev *dev);
> >  void pci_disable_msi(struct pci_dev *dev);
> >  int pci_msix_vec_count(struct pci_dev *dev);
> > +int pci_set_msix_vec_count(struct pci_dev *dev, int numb);
>
> This patch adds the pci_set_msix_vec_count() definition in pci/msi.c
> and a call in pci/iov.c.  It doesn't need to be declared in
> include/linux/pci.h or exported.  It can be declared in
> drivers/pci/pci.h.

I changed it, thanks

>
> >  void pci_disable_msix(struct pci_dev *dev);
> >  void pci_restore_msi_state(struct pci_dev *dev);
> >  int pci_msi_enabled(void);
> > @@ -2402,6 +2406,10 @@ static inline bool pci_is_thunderbolt_attached(struct pci_dev *pdev)
> >  void pci_uevent_ers(struct pci_dev *pdev, enum  pci_ers_result err_type);
> >  #endif
> >
> > +#ifdef CONFIG_PCI_IOV
> > +int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id);
> > +#endif
>
> pci_iov_virtfn_devfn() is already declared in this file.

Sorry for that.

>
> >  /* Provide the legacy pci_dma_* API */
> >  #include <linux/pci-dma-compat.h>
> >
> > --
> > 2.29.2
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs
  2021-01-08 21:09       ` Bjorn Helgaas
  2021-01-09  2:54         ` Don Dutile
@ 2021-01-10  8:25         ` Leon Romanovsky
  1 sibling, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2021-01-10  8:25 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Don Dutile, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Alex Williamson

On Fri, Jan 08, 2021 at 03:09:13PM -0600, Bjorn Helgaas wrote:
> On Thu, Jan 07, 2021 at 10:54:38PM -0500, Don Dutile wrote:
> > On 1/7/21 7:57 PM, Bjorn Helgaas wrote:
> > > On Sun, Jan 03, 2021 at 10:24:37AM +0200, Leon Romanovsky wrote:
>
> > > > + **/
> > > > +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
> > > > +{
> > > > +	struct pci_dev *pdev = pci_physfn(dev);
> > > > +
> > > > +	if (!dev->msix_cap || !pdev->msix_cap)
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (dev->driver || !pdev->driver ||
> > > > +	    !pdev->driver->sriov_set_msix_vec_count)
> > > > +		return -EOPNOTSUPP;
> > > > +
> > > > +	if (numb < 0)
> > > > +		/*
> > > > +		 * We don't support negative numbers for now,
> > > > +		 * but maybe in the future it will make sense.
> > > > +		 */
> > > > +		return -EINVAL;
> > > > +
> > > > +	return pdev->driver->sriov_set_msix_vec_count(dev, numb);
> > >
> > > So we write to a VF sysfs file, get here and look up the PF, call a PF
> > > driver callback with the VF as an argument, the callback (at least for
> > > mlx5) looks up the PF from the VF, then does some mlx5-specific magic
> > > to the PF that influences the VF somehow?
> >
> > There's no PF lookup above.... it's just checking if a pdev has a
> > driver with the desired msix-cap setting(reduction) feature.
>
> We started with the VF (the sysfs file is attached to the VF).  "pdev"
> is the corresponding PF; that's what I meant by "looking up the PF".
> Then we call the PF driver sriov_set_msix_vec_count() method.
>
> I asked because this raises questions of whether we need mutual
> exclusion or some other coordination between setting this for multiple
> VFs.

MSI-X are managed by HW and they are separated between VFs.
IMHO, it will be better if SW won't do too much coordination.

Thanks

>
> Obviously it's great to answer all these in email, but at the end of
> the day, the rationale needs to be in the commit, either in code
> comments or the commit log.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs
  2021-01-08  3:54     ` Don Dutile
  2021-01-08  7:25       ` Leon Romanovsky
  2021-01-08 21:09       ` Bjorn Helgaas
@ 2021-01-10  8:29       ` Leon Romanovsky
  2 siblings, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2021-01-10  8:29 UTC (permalink / raw)
  To: Don Dutile
  Cc: Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Alex Williamson

On Thu, Jan 07, 2021 at 10:54:38PM -0500, Don Dutile wrote:
> On 1/7/21 7:57 PM, Bjorn Helgaas wrote:
> > [+cc Alex, Don]
> >
> > This patch does not actually *configure* the number of vectors, so the
> > subject is not quite accurate.  IIUC, this patch adds a sysfs file
> > that can be used to configure the number of vectors.  The subject
> > should mention the sysfs connection.
> >
> > On Sun, Jan 03, 2021 at 10:24:37AM +0200, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > This function is applicable for SR-IOV VFs because such devices allocate
> > > their MSI-X table before they will run on the targeted hardware and they
> > > can't guess the right amount of vectors.
> > This sentence doesn't quite have enough context to make sense to me.
> > Per PCIe r5.0, sec 9.5.1.2, I think PFs and VFs have independent MSI-X
> > Capabilities.  What is the connection between the PF MSI-X and the VF
> > MSI-X?
> +1... strip this commit log section and write it with correct, technical content.
> PFs & VF's have indep MSIX caps.
>
> Q: is this an issue where (some) mlx5's have a large msi-x capability (per VF) that may overwhelm a system's, (pci-(sub)-tree) MSI / intr capability,
> and this is a sysfs-based tuning knob to reduce the max number on such 'challenged' systems?
> -- ah; reading further below, it's based on some information gleemed from the VM's capability for intr. support.
>     -- or maybe IOMMU (intr) support on the host system, and the VF can't exceed it or config failure in VM... whatever... its some VM cap that's being accomodated.

I hope that this answers.
https://lore.kernel.org/linux-pci/20210110082206.GD31158@unreal/T/#md5dfc2edaaa686331ab3ce73496df7f58421c550

This feature is for MSI-X repartition and reduction.

> > The MSI-X table sizes should be determined by the Table Size in the
> > Message Control register.  Apparently we write a VF's Table Size
> > before a driver is bound to the VF?  Where does that happen?
> >
> > "Before they run on the targeted hardware" -- do you mean before the
> > VF is passed through to a guest virtual machine?  You mention "target
> > VM" below, which makes more sense to me.  VFs don't "run"; they're not
> > software.  I apologize for not being an expert in the use of VFs.
> >
> > Please mention the sysfs path in the commit log.
> >
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > ---
> > >   Documentation/ABI/testing/sysfs-bus-pci | 16 +++++++
> > >   drivers/pci/iov.c                       | 57 +++++++++++++++++++++++++
> > >   drivers/pci/msi.c                       | 30 +++++++++++++
> > >   drivers/pci/pci-sysfs.c                 |  1 +
> > >   drivers/pci/pci.h                       |  1 +
> > >   include/linux/pci.h                     |  8 ++++
> > >   6 files changed, 113 insertions(+)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > index 25c9c39770c6..30720a9e1386 100644
> > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > @@ -375,3 +375,19 @@ Description:
> > >   		The value comes from the PCI kernel device state and can be one
> > >   		of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
> > >   		The file is read only.
> > > +
> > > +What:		/sys/bus/pci/devices/.../vf_msix_vec
> > > +Date:		December 2020
> > > +Contact:	Leon Romanovsky <leonro@nvidia.com>
> > > +Description:
> > > +		This file is associated with the SR-IOV VFs. It allows overwrite
> > > +		the amount of MSI-X vectors for that VF. This is needed to optimize
> > > +		performance of newly bounded devices by allocating the number of
> > > +		vectors based on the internal knowledge of targeted VM.
> > s/allows overwrite/allows configuration of/
> > s/for that/for the/
> > s/amount of/number of/
> > s/bounded/bound/
> >
> > What "internal knowledge" is this?  AFAICT this would have to be some
> > user-space administration knowledge, not anything internal to the
> > kernel.
> Correct; likely a libvirt VM (section of its) description;

Right, libvirt and/or orchestration software.

Thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs
  2021-01-09  2:54         ` Don Dutile
@ 2021-01-10  8:33           ` Leon Romanovsky
  0 siblings, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2021-01-10  8:33 UTC (permalink / raw)
  To: Don Dutile
  Cc: Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed, Jason Gunthorpe,
	Jakub Kicinski, linux-pci, linux-rdma, netdev, Alex Williamson

On Fri, Jan 08, 2021 at 09:54:47PM -0500, Don Dutile wrote:
> On 1/8/21 4:09 PM, Bjorn Helgaas wrote:
> > On Thu, Jan 07, 2021 at 10:54:38PM -0500, Don Dutile wrote:
> > > On 1/7/21 7:57 PM, Bjorn Helgaas wrote:
> > > > On Sun, Jan 03, 2021 at 10:24:37AM +0200, Leon Romanovsky wrote:
> > > > > + **/
> > > > > +int pci_set_msix_vec_count(struct pci_dev *dev, int numb)
> > > > > +{
> > > > > +	struct pci_dev *pdev = pci_physfn(dev);
> > > > > +
> > > > > +	if (!dev->msix_cap || !pdev->msix_cap)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (dev->driver || !pdev->driver ||
> > > > > +	    !pdev->driver->sriov_set_msix_vec_count)
> > > > > +		return -EOPNOTSUPP;
> > > > > +
> > > > > +	if (numb < 0)
> > > > > +		/*
> > > > > +		 * We don't support negative numbers for now,
> > > > > +		 * but maybe in the future it will make sense.
> > > > > +		 */
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	return pdev->driver->sriov_set_msix_vec_count(dev, numb);
> > > > So we write to a VF sysfs file, get here and look up the PF, call a PF
> > > > driver callback with the VF as an argument, the callback (at least for
> > > > mlx5) looks up the PF from the VF, then does some mlx5-specific magic
> > > > to the PF that influences the VF somehow?
> > > There's no PF lookup above.... it's just checking if a pdev has a
> > > driver with the desired msix-cap setting(reduction) feature.
> > We started with the VF (the sysfs file is attached to the VF).  "pdev"
> > is the corresponding PF; that's what I meant by "looking up the PF".
> > Then we call the PF driver sriov_set_msix_vec_count() method.
> ah, got how your statement relates to the files &/or pdev.
>
> > I asked because this raises questions of whether we need mutual
> > exclusion or some other coordination between setting this for multiple
> > VFs.
> >
> > Obviously it's great to answer all these in email, but at the end of
> > the day, the rationale needs to be in the commit, either in code
> > comments or the commit log.
> >
> I'm still not getting why this is not per-(vf)pdev -- just b/c a device has N-number of MSIX capability doesn't mean it has to all be used/configured,
> Setting max-MSIX for VFs in the PF's pdev means it is the same number for all VFs ... and I'm not sure that's the right solution either.
> It should still be (v)pdev-based, IMO.

The proposed solution is per-VF, am I missing anything in this discussion?

> --dd
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs
  2021-01-08 16:21         ` Alex Williamson
@ 2021-01-10  8:47           ` Leon Romanovsky
  0 siblings, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2021-01-10  8:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Don Dutile, Bjorn Helgaas, Bjorn Helgaas, Saeed Mahameed,
	Jason Gunthorpe, Jakub Kicinski, linux-pci, linux-rdma, netdev

On Fri, Jan 08, 2021 at 09:21:45AM -0700, Alex Williamson wrote:
> On Fri, 8 Jan 2021 09:25:25 +0200
> Leon Romanovsky <leon@kernel.org> wrote:
>
> > On Thu, Jan 07, 2021 at 10:54:38PM -0500, Don Dutile wrote:
> > > On 1/7/21 7:57 PM, Bjorn Helgaas wrote:
> > > > [+cc Alex, Don]
> >
> > <...>
> >
> > > > Help me connect the dots here.  Is this required because of something
> > > > peculiar to mlx5, or is something like this required for all SR-IOV
> > > > devices because of the way the PCIe spec is written?
> > > So, overall, I'm guessing the mlx5 device can have 1000's of MSIX -- say, one per send/receive/completion queue.
> > > This device capability may exceed the max number MSIX a VM can have/support (depending on guestos).
> > > So, a sysfs tunable is used to set the max MSIX available, and thus, the device puts >1 send/rcv/completion queue intr on a given MSIX.
> > >
> > > ok, time for Leon to better state what this patch does,
> > > and why it's needed on mlx5 (and may be applicable to other/future high-MSIX devices assigned to VMs (NVME?)).
> > > Hmmm, now that I said it, why is it SRIOV-centric and not pci-device centric (can pass a PF w/high number of MSIX to a VM).
> >
> > Thanks Don and Bjorn,
> >
> > I will answer on all comments a little bit later when I will return
> > to the office (Sunday).
> >
> > However it is important for me to present the use case.
> >
> > Our mlx5 SR-IOV devices were always capable to drive many MSI-X (upto 2K,
> > don't catch me on exact number), however when user created VFs, the FW has
> > no knowledge of how those VFs will be used. So FW had no choice but statically
> > and equally assign same amount of MSI-X to all VFs.
> >
> > After SR-IOV VF creation, user will bind those new VFs to the VMs, but
> > the VMs have different number of CPUs and despite HW being able to deliver
> > all needed number of vectors (in mlx5 netdev world, number of channels == number
> > of CPUs == number of vectors), we will be limited by already set low number
> > of vectors.
> >
> > So it is not for vector reduction, but more for vector re-partition.
> >
> > As an example, imagine mlx5 with two VFs. One VF is bounded to VM with 200 CPUs
> > and another is bounded to VM with 1 CPU. They need different amount of MSI-X vectors.
> >
> > Hope that I succeeded to explain :).
>
> The idea is not unreasonable imo, but without knowing the size of the
> vector pool, range available per vf, or ultimately whether the vf
> supports this feature before we try to configure it, I don't see how
> userspace is expected to make use of this in the general case.  If the
> configuration requires such specific vf vector usage and pf driver
> specific knowledge, I'm not sure it's fit as a generic pci-sysfs
> interface.  Thanks,

I didn't prohibit read of newly created sysfs file, but if I change
the implementation to vf_msix_vec_show() to return -EOPNOTSUPP for
not-supported device, the software will be able to distinguish
supported/not-supported.

SW will read this file:
	-> success -> feature supported
	-> failure -> feature not supported

There is one extra sysfs file needed: vf_total_msix. That file will
give total number of MSI-X vectors that is possible to configure.

The same logic (supported/not-supported) can be applicable here as well.

The feature itself will be used by orchestration software that will
make decisions based on already configured values or future promises
and the overall total number. The positive outcome of this scheme
that driver stays lean.

Thanks

>
> Alex
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-01-10  8:48 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-03  8:24 [PATCH mlx5-next 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky
2021-01-03  8:24 ` [PATCH mlx5-next 1/4] PCI: Configure number of MSI-X vectors for SR-IOV VFs Leon Romanovsky
2021-01-08  0:57   ` Bjorn Helgaas
2021-01-08  3:54     ` Don Dutile
2021-01-08  7:25       ` Leon Romanovsky
2021-01-08 16:21         ` Alex Williamson
2021-01-10  8:47           ` Leon Romanovsky
2021-01-08 21:09       ` Bjorn Helgaas
2021-01-09  2:54         ` Don Dutile
2021-01-10  8:33           ` Leon Romanovsky
2021-01-10  8:25         ` Leon Romanovsky
2021-01-10  8:29       ` Leon Romanovsky
2021-01-10  8:22     ` Leon Romanovsky
2021-01-03  8:24 ` [PATCH mlx5-next 2/4] net/mlx5: Add dynamic MSI-X capabilities bits Leon Romanovsky
2021-01-03  8:24 ` [PATCH mlx5-next 3/4] net/mlx5: Dynamically assign MSI-X vectors count Leon Romanovsky
2021-01-03  8:24 ` [PATCH mlx5-next 4/4] net/mlx5: Allow to the users to configure number of MSI-X vectors Leon Romanovsky
2021-01-06  5:50 ` [PATCH mlx5-next 0/4] Dynamically assign MSI-X vectors count Leon Romanovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.