All of lore.kernel.org
 help / color / mirror / Atom feed
* [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF)
@ 2024-02-15  3:07 Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 01/15] net/mlx5: Add MPIR bit in mcam_access_reg Saeed Mahameed
                   ` (14 more replies)
  0 siblings, 15 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:07 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Saeed Mahameed <saeedm@nvidia.com>

Support Socket-Direct multi-dev netdev.

V3:
- Fix documentation per Jakubs feedback
- Fix typos
- Link new documentation in the networking index.rst

V2:
- Add documentation in a new patch.
- Add debugfs in a new patch.
- Add mlx5_ifc bit for MPIR cap check and use it before query.

V1:
- https://lore.kernel.org/netdev/20231221005721.186607-1-saeed@kernel.org/


For more information please see tag log below.

Please pull and let me know if there is any problem.

Thanks,
Saeed.


The following changes since commit d1d77120bc2867b3e449e07ee656a26b2fb03d1e:

  net: phy: dp83826: support TX data voltage tuning (2024-02-14 12:06:10 +0000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-socket-direct-v3

for you to fetch changes up to 0684f46f6e4ed5070a2ef6b32b3bd368d5387227:

  Documentation: networking: Add description for multi-pf netdev (2024-02-14 19:01:59 -0800)

----------------------------------------------------------------
Support Multi-PF netdev (Socket Direct)

This series adds support for combining multiple devices (PFs) of the
same port under one netdev instance. Passing traffic through different
devices belonging to different NUMA sockets saves cross-numa traffic and
allows apps running on the same netdev from different numas to still
feel a sense of proximity to the device and achieve improved
performance.

We achieve this by grouping PFs together, and creating the netdev only
once all group members are probed. Symmetrically, we destroy the netdev
once any of the PFs is removed.

The channels are distributed between all devices, a proper configuration
would utilize the correct close numa when working on a certain app/cpu.

We pick one device to be a primary (leader), and it fills a special
role.  The other devices (secondaries) are disconnected from the network
in the chip level (set to silent mode). All RX/TX traffic is steered
through the primary to/from the secondaries.

Currently, we limit the support to PFs only, and up to two devices
(sockets).

V3:
- Fix documentation per Jakubs feedback.
- Fix typos
- Link new documentation in the networking index.rst

V2:
- Add documentation in a new patch.
- Add debugfs in a new patch.
- Add mlx5_ifc bit for MPIR cap check and use it before query.

----------------------------------------------------------------
Tariq Toukan (15):
      net/mlx5: Add MPIR bit in mcam_access_reg
      net/mlx5: SD, Introduce SD lib
      net/mlx5: SD, Implement basic query and instantiation
      net/mlx5: SD, Implement devcom communication and primary election
      net/mlx5: SD, Implement steering for primary and secondaries
      net/mlx5: SD, Add informative prints in kernel log
      net/mlx5: SD, Add debugfs
      net/mlx5e: Create single netdev per SD group
      net/mlx5e: Create EN core HW resources for all secondary devices
      net/mlx5e: Let channels be SD-aware
      net/mlx5e: Support cross-vhca RSS
      net/mlx5e: Support per-mdev queue counter
      net/mlx5e: Block TLS device offload on combined SD netdev
      net/mlx5: Enable SD feature
      Documentation: networking: Add description for multi-pf netdev

 Documentation/networking/index.rst                 |   1 +
 Documentation/networking/multi-pf-netdev.rst       | 157 ++++++
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |   9 +-
 .../net/ethernet/mellanox/mlx5/core/en/channels.c  |  10 +-
 .../net/ethernet/mellanox/mlx5/core/en/channels.h  |   6 +-
 .../ethernet/mellanox/mlx5/core/en/monitor_stats.c |  48 +-
 .../net/ethernet/mellanox/mlx5/core/en/params.c    |   9 +-
 .../net/ethernet/mellanox/mlx5/core/en/params.h    |   3 -
 drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c   |  12 +-
 drivers/net/ethernet/mellanox/mlx5/core/en/qos.c   |   8 +-
 .../ethernet/mellanox/mlx5/core/en/reporter_rx.c   |   4 +-
 .../ethernet/mellanox/mlx5/core/en/reporter_tx.c   |   3 +-
 drivers/net/ethernet/mellanox/mlx5/core/en/rqt.c   | 123 ++++-
 drivers/net/ethernet/mellanox/mlx5/core/en/rqt.h   |   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/en/rss.c   |  17 +-
 drivers/net/ethernet/mellanox/mlx5/core/en/rss.h   |   4 +-
 .../net/ethernet/mellanox/mlx5/core/en/rx_res.c    |  62 ++-
 .../net/ethernet/mellanox/mlx5/core/en/rx_res.h    |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/en/trap.c  |  11 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/pool.c  |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/setup.c |   8 +-
 .../ethernet/mellanox/mlx5/core/en_accel/ktls.c    |   2 +-
 .../ethernet/mellanox/mlx5/core/en_accel/ktls.h    |   4 +-
 .../ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c |   6 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 176 +++++--
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |  39 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    |   4 +-
 .../net/ethernet/mellanox/mlx5/core/lib/devcom.h   |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h |  12 +
 drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c   | 524 +++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h   |  38 ++
 include/linux/mlx5/driver.h                        |   1 +
 include/linux/mlx5/mlx5_ifc.h                      |   4 +-
 34 files changed, 1151 insertions(+), 173 deletions(-)
 create mode 100644 Documentation/networking/multi-pf-netdev.rst
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [net-next V3 01/15] net/mlx5: Add MPIR bit in mcam_access_reg
  2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
@ 2024-02-15  3:08 ` Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 02/15] net/mlx5: SD, Introduce SD lib Saeed Mahameed
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Tariq Toukan <tariqt@nvidia.com>

Add a cap bit in mcam_access_reg to check for MPIR support.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 7f5e846eb46d..f0ad2eace6eb 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -10253,7 +10253,9 @@ struct mlx5_ifc_mcam_access_reg_bits {
 	u8         mcqi[0x1];
 	u8         mcqs[0x1];
 
-	u8         regs_95_to_87[0x9];
+	u8         regs_95_to_90[0x6];
+	u8         mpir[0x1];
+	u8         regs_88_to_87[0x2];
 	u8         mpegc[0x1];
 	u8         mtutc[0x1];
 	u8         regs_84_to_68[0x11];
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [net-next V3 02/15] net/mlx5: SD, Introduce SD lib
  2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 01/15] net/mlx5: Add MPIR bit in mcam_access_reg Saeed Mahameed
@ 2024-02-15  3:08 ` Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 03/15] net/mlx5: SD, Implement basic query and instantiation Saeed Mahameed
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Tariq Toukan <tariqt@nvidia.com>

Add Socket-Direct API with empty/minimal implementation.
We fill-in the implementation gradually in downstream patches.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |  2 +-
 .../ethernet/mellanox/mlx5/core/lib/mlx5.h    | 11 ++++
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 60 +++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/lib/sd.h  | 38 ++++++++++++
 4 files changed, 110 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index c44870b175f9..76dc5a9b9648 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -29,7 +29,7 @@ mlx5_core-$(CONFIG_MLX5_CORE_EN) += en/rqt.o en/tir.o en/rss.o en/rx_res.o \
 		en/reporter_tx.o en/reporter_rx.o en/params.o en/xsk/pool.o \
 		en/xsk/setup.o en/xsk/rx.o en/xsk/tx.o en/devlink.o en/ptp.o \
 		en/qos.o en/htb.o en/trap.o en/fs_tt_redirect.o en/selq.o \
-		lib/crypto.o
+		lib/crypto.o lib/sd.o
 
 #
 # Netdev extra
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h
index 2b5826a785c4..0810b92b48d0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h
@@ -54,4 +54,15 @@ static inline struct net_device *mlx5_uplink_netdev_get(struct mlx5_core_dev *md
 {
 	return mdev->mlx5e_res.uplink_netdev;
 }
+
+struct mlx5_sd;
+
+static inline struct mlx5_sd *mlx5_get_sd(struct mlx5_core_dev *dev)
+{
+	return NULL;
+}
+
+static inline void mlx5_set_sd(struct mlx5_core_dev *dev, struct mlx5_sd *sd)
+{
+}
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
new file mode 100644
index 000000000000..ea37238c4519
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#include "lib/sd.h"
+#include "mlx5_core.h"
+
+#define sd_info(__dev, format, ...) \
+	dev_info((__dev)->device, "Socket-Direct: " format, ##__VA_ARGS__)
+#define sd_warn(__dev, format, ...) \
+	dev_warn((__dev)->device, "Socket-Direct: " format, ##__VA_ARGS__)
+
+struct mlx5_sd {
+};
+
+static int mlx5_sd_get_host_buses(struct mlx5_core_dev *dev)
+{
+	return 1;
+}
+
+struct mlx5_core_dev *
+mlx5_sd_primary_get_peer(struct mlx5_core_dev *primary, int idx)
+{
+	if (idx == 0)
+		return primary;
+
+	return NULL;
+}
+
+int mlx5_sd_ch_ix_get_dev_ix(struct mlx5_core_dev *dev, int ch_ix)
+{
+	return ch_ix % mlx5_sd_get_host_buses(dev);
+}
+
+int mlx5_sd_ch_ix_get_vec_ix(struct mlx5_core_dev *dev, int ch_ix)
+{
+	return ch_ix / mlx5_sd_get_host_buses(dev);
+}
+
+struct mlx5_core_dev *mlx5_sd_ch_ix_get_dev(struct mlx5_core_dev *primary, int ch_ix)
+{
+	int mdev_idx = mlx5_sd_ch_ix_get_dev_ix(primary, ch_ix);
+
+	return mlx5_sd_primary_get_peer(primary, mdev_idx);
+}
+
+int mlx5_sd_init(struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
+void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
+{
+}
+
+struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
+					  struct auxiliary_device *adev,
+					  int idx)
+{
+	return adev;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
new file mode 100644
index 000000000000..137efaf9aabc
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#ifndef __MLX5_LIB_SD_H__
+#define __MLX5_LIB_SD_H__
+
+#define MLX5_SD_MAX_GROUP_SZ 2
+
+struct mlx5_sd;
+
+struct mlx5_core_dev *mlx5_sd_primary_get_peer(struct mlx5_core_dev *primary, int idx);
+int mlx5_sd_ch_ix_get_dev_ix(struct mlx5_core_dev *dev, int ch_ix);
+int mlx5_sd_ch_ix_get_vec_ix(struct mlx5_core_dev *dev, int ch_ix);
+struct mlx5_core_dev *mlx5_sd_ch_ix_get_dev(struct mlx5_core_dev *primary, int ch_ix);
+struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
+					  struct auxiliary_device *adev,
+					  int idx);
+
+int mlx5_sd_init(struct mlx5_core_dev *dev);
+void mlx5_sd_cleanup(struct mlx5_core_dev *dev);
+
+#define mlx5_sd_for_each_dev_from_to(i, primary, ix_from, to, pos)	\
+	for (i = ix_from;							\
+	     (pos = mlx5_sd_primary_get_peer(primary, i)) && pos != (to); i++)
+
+#define mlx5_sd_for_each_dev(i, primary, pos)				\
+	mlx5_sd_for_each_dev_from_to(i, primary, 0, NULL, pos)
+
+#define mlx5_sd_for_each_dev_to(i, primary, to, pos)			\
+	mlx5_sd_for_each_dev_from_to(i, primary, 0, to, pos)
+
+#define mlx5_sd_for_each_secondary(i, primary, pos)			\
+	mlx5_sd_for_each_dev_from_to(i, primary, 1, NULL, pos)
+
+#define mlx5_sd_for_each_secondary_to(i, primary, to, pos)		\
+	mlx5_sd_for_each_dev_from_to(i, primary, 1, to, pos)
+
+#endif /* __MLX5_LIB_SD_H__ */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [net-next V3 03/15] net/mlx5: SD, Implement basic query and instantiation
  2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 01/15] net/mlx5: Add MPIR bit in mcam_access_reg Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 02/15] net/mlx5: SD, Introduce SD lib Saeed Mahameed
@ 2024-02-15  3:08 ` Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 04/15] net/mlx5: SD, Implement devcom communication and primary election Saeed Mahameed
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Tariq Toukan <tariqt@nvidia.com>

Add implementation for querying the MPIR register for Socket-Direct
attributes, and instantiating a SD struct accordingly.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 110 +++++++++++++++++-
 1 file changed, 109 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index ea37238c4519..b1f86549af1c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -3,6 +3,8 @@
 
 #include "lib/sd.h"
 #include "mlx5_core.h"
+#include "lib/mlx5.h"
+#include <linux/mlx5/vport.h>
 
 #define sd_info(__dev, format, ...) \
 	dev_info((__dev)->device, "Socket-Direct: " format, ##__VA_ARGS__)
@@ -10,11 +12,18 @@
 	dev_warn((__dev)->device, "Socket-Direct: " format, ##__VA_ARGS__)
 
 struct mlx5_sd {
+	u32 group_id;
+	u8 host_buses;
 };
 
 static int mlx5_sd_get_host_buses(struct mlx5_core_dev *dev)
 {
-	return 1;
+	struct mlx5_sd *sd = mlx5_get_sd(dev);
+
+	if (!sd)
+		return 1;
+
+	return sd->host_buses;
 }
 
 struct mlx5_core_dev *
@@ -43,13 +52,112 @@ struct mlx5_core_dev *mlx5_sd_ch_ix_get_dev(struct mlx5_core_dev *primary, int c
 	return mlx5_sd_primary_get_peer(primary, mdev_idx);
 }
 
+static bool mlx5_sd_is_supported(struct mlx5_core_dev *dev, u8 host_buses)
+{
+	/* Feature is currently implemented for PFs only */
+	if (!mlx5_core_is_pf(dev))
+		return false;
+
+	/* Honor the SW implementation limit */
+	if (host_buses > MLX5_SD_MAX_GROUP_SZ)
+		return false;
+
+	return true;
+}
+
+static int mlx5_query_sd(struct mlx5_core_dev *dev, bool *sdm,
+			 u8 *host_buses, u8 *sd_group)
+{
+	u32 out[MLX5_ST_SZ_DW(mpir_reg)];
+	int err;
+
+	err = mlx5_query_mpir_reg(dev, out);
+	if (err)
+		return err;
+
+	err = mlx5_query_nic_vport_sd_group(dev, sd_group);
+	if (err)
+		return err;
+
+	*sdm = MLX5_GET(mpir_reg, out, sdm);
+	*host_buses = MLX5_GET(mpir_reg, out, host_buses);
+
+	return 0;
+}
+
+static u32 mlx5_sd_group_id(struct mlx5_core_dev *dev, u8 sd_group)
+{
+	return (u32)((MLX5_CAP_GEN(dev, native_port_num) << 8) | sd_group);
+}
+
+static int sd_init(struct mlx5_core_dev *dev)
+{
+	u8 host_buses, sd_group;
+	struct mlx5_sd *sd;
+	u32 group_id;
+	bool sdm;
+	int err;
+
+	if (!MLX5_CAP_MCAM_REG(dev, mpir))
+		return 0;
+
+	err = mlx5_query_sd(dev, &sdm, &host_buses, &sd_group);
+	if (err)
+		return err;
+
+	if (!sdm)
+		return 0;
+
+	if (!sd_group)
+		return 0;
+
+	group_id = mlx5_sd_group_id(dev, sd_group);
+
+	if (!mlx5_sd_is_supported(dev, host_buses)) {
+		sd_warn(dev, "can't support requested netdev combining for group id 0x%x), skipping\n",
+			group_id);
+		return 0;
+	}
+
+	sd = kzalloc(sizeof(*sd), GFP_KERNEL);
+	if (!sd)
+		return -ENOMEM;
+
+	sd->host_buses = host_buses;
+	sd->group_id = group_id;
+
+	mlx5_set_sd(dev, sd);
+
+	return 0;
+}
+
+static void sd_cleanup(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sd *sd = mlx5_get_sd(dev);
+
+	mlx5_set_sd(dev, NULL);
+	kfree(sd);
+}
+
 int mlx5_sd_init(struct mlx5_core_dev *dev)
 {
+	int err;
+
+	err = sd_init(dev);
+	if (err)
+		return err;
+
 	return 0;
 }
 
 void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
 {
+	struct mlx5_sd *sd = mlx5_get_sd(dev);
+
+	if (!sd)
+		return;
+
+	sd_cleanup(dev);
 }
 
 struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [net-next V3 04/15] net/mlx5: SD, Implement devcom communication and primary election
  2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
                   ` (2 preceding siblings ...)
  2024-02-15  3:08 ` [net-next V3 03/15] net/mlx5: SD, Implement basic query and instantiation Saeed Mahameed
@ 2024-02-15  3:08 ` Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 05/15] net/mlx5: SD, Implement steering for primary and secondaries Saeed Mahameed
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Tariq Toukan <tariqt@nvidia.com>

Use devcom to communicate between the different devices. Add a new
devcom component type for this.

Each device registers itself to the devcom component <SD, group ID>.
Once all devices of a component are registered, the component becomes
ready, and a primary device is elected.

In principle, any of the devices can act as a primary, they are all
capable, and a random election would've worked. However, we aim to
achieve predictability and consistency, hence each group always choses
the same device, with the lowest PCI BUS number, as primary.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/lib/devcom.h  |   1 +
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 122 +++++++++++++++++-
 2 files changed, 121 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h
index ec32b686f586..d58032dd0df7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.h
@@ -10,6 +10,7 @@ enum mlx5_devcom_component {
 	MLX5_DEVCOM_ESW_OFFLOADS,
 	MLX5_DEVCOM_MPV,
 	MLX5_DEVCOM_HCA_PORTS,
+	MLX5_DEVCOM_SD_GROUP,
 	MLX5_DEVCOM_NUM_COMPONENTS,
 };
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index b1f86549af1c..3059a3750f82 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -14,6 +14,16 @@
 struct mlx5_sd {
 	u32 group_id;
 	u8 host_buses;
+	struct mlx5_devcom_comp_dev *devcom;
+	bool primary;
+	union {
+		struct { /* primary */
+			struct mlx5_core_dev *secondaries[MLX5_SD_MAX_GROUP_SZ - 1];
+		};
+		struct { /* secondary */
+			struct mlx5_core_dev *primary_dev;
+		};
+	};
 };
 
 static int mlx5_sd_get_host_buses(struct mlx5_core_dev *dev)
@@ -26,13 +36,29 @@ static int mlx5_sd_get_host_buses(struct mlx5_core_dev *dev)
 	return sd->host_buses;
 }
 
+static struct mlx5_core_dev *mlx5_sd_get_primary(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sd *sd = mlx5_get_sd(dev);
+
+	if (!sd)
+		return dev;
+
+	return sd->primary ? dev : sd->primary_dev;
+}
+
 struct mlx5_core_dev *
 mlx5_sd_primary_get_peer(struct mlx5_core_dev *primary, int idx)
 {
+	struct mlx5_sd *sd;
+
 	if (idx == 0)
 		return primary;
 
-	return NULL;
+	if (idx >= mlx5_sd_get_host_buses(primary))
+		return NULL;
+
+	sd = mlx5_get_sd(primary);
+	return sd->secondaries[idx - 1];
 }
 
 int mlx5_sd_ch_ix_get_dev_ix(struct mlx5_core_dev *dev, int ch_ix)
@@ -139,15 +165,93 @@ static void sd_cleanup(struct mlx5_core_dev *dev)
 	kfree(sd);
 }
 
+static int sd_register(struct mlx5_core_dev *dev)
+{
+	struct mlx5_devcom_comp_dev *devcom, *pos;
+	struct mlx5_core_dev *peer, *primary;
+	struct mlx5_sd *sd, *primary_sd;
+	int err, i;
+
+	sd = mlx5_get_sd(dev);
+	devcom = mlx5_devcom_register_component(dev->priv.devc, MLX5_DEVCOM_SD_GROUP,
+						sd->group_id, NULL, dev);
+	if (!devcom)
+		return -ENOMEM;
+
+	sd->devcom = devcom;
+
+	if (mlx5_devcom_comp_get_size(devcom) != sd->host_buses)
+		return 0;
+
+	mlx5_devcom_comp_lock(devcom);
+	mlx5_devcom_comp_set_ready(devcom, true);
+	mlx5_devcom_comp_unlock(devcom);
+
+	if (!mlx5_devcom_for_each_peer_begin(devcom)) {
+		err = -ENODEV;
+		goto err_devcom_unreg;
+	}
+
+	primary = dev;
+	mlx5_devcom_for_each_peer_entry(devcom, peer, pos)
+		if (peer->pdev->bus->number < primary->pdev->bus->number)
+			primary = peer;
+
+	primary_sd = mlx5_get_sd(primary);
+	primary_sd->primary = true;
+	i = 0;
+	/* loop the secondaries */
+	mlx5_devcom_for_each_peer_entry(primary_sd->devcom, peer, pos) {
+		struct mlx5_sd *peer_sd = mlx5_get_sd(peer);
+
+		primary_sd->secondaries[i++] = peer;
+		peer_sd->primary = false;
+		peer_sd->primary_dev = primary;
+	}
+
+	mlx5_devcom_for_each_peer_end(devcom);
+	return 0;
+
+err_devcom_unreg:
+	mlx5_devcom_comp_lock(sd->devcom);
+	mlx5_devcom_comp_set_ready(sd->devcom, false);
+	mlx5_devcom_comp_unlock(sd->devcom);
+	mlx5_devcom_unregister_component(sd->devcom);
+	return err;
+}
+
+static void sd_unregister(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sd *sd = mlx5_get_sd(dev);
+
+	mlx5_devcom_comp_lock(sd->devcom);
+	mlx5_devcom_comp_set_ready(sd->devcom, false);
+	mlx5_devcom_comp_unlock(sd->devcom);
+	mlx5_devcom_unregister_component(sd->devcom);
+}
+
 int mlx5_sd_init(struct mlx5_core_dev *dev)
 {
+	struct mlx5_sd *sd = mlx5_get_sd(dev);
 	int err;
 
 	err = sd_init(dev);
 	if (err)
 		return err;
 
+	sd = mlx5_get_sd(dev);
+	if (!sd)
+		return 0;
+
+	err = sd_register(dev);
+	if (err)
+		goto err_sd_cleanup;
+
 	return 0;
+
+err_sd_cleanup:
+	sd_cleanup(dev);
+	return err;
 }
 
 void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
@@ -157,6 +261,7 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
 	if (!sd)
 		return;
 
+	sd_unregister(dev);
 	sd_cleanup(dev);
 }
 
@@ -164,5 +269,18 @@ struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
 					  struct auxiliary_device *adev,
 					  int idx)
 {
-	return adev;
+	struct mlx5_sd *sd = mlx5_get_sd(dev);
+	struct mlx5_core_dev *primary;
+
+	if (!sd)
+		return adev;
+
+	if (!mlx5_devcom_comp_is_ready(sd->devcom))
+		return NULL;
+
+	primary = mlx5_sd_get_primary(dev);
+	if (dev == primary)
+		return adev;
+
+	return &primary->priv.adev[idx]->adev;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [net-next V3 05/15] net/mlx5: SD, Implement steering for primary and secondaries
  2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
                   ` (3 preceding siblings ...)
  2024-02-15  3:08 ` [net-next V3 04/15] net/mlx5: SD, Implement devcom communication and primary election Saeed Mahameed
@ 2024-02-15  3:08 ` Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 06/15] net/mlx5: SD, Add informative prints in kernel log Saeed Mahameed
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Tariq Toukan <tariqt@nvidia.com>

Implement the needed SD steering adjustments for the primary and
secondaries.

While the SD multiple PFs are used to avoid cross-numa memory, when it
comes to chip level all traffic goes only through the primary device.
The secondaries are forced to silent mode, to guarantee they are not
involved in any unexpected ingress/egress traffic.

In RX, secondary devices will not have steering objects. Traffic will be
steered from the primary device to the RQs of a secondary device using
advanced cross-vhca RX steering capabilities.

In TX, the primary creates a new TX flow table, which is aliased by the
secondaries.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 185 +++++++++++++++++-
 1 file changed, 184 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index 3059a3750f82..76c2426c2498 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -4,6 +4,7 @@
 #include "lib/sd.h"
 #include "mlx5_core.h"
 #include "lib/mlx5.h"
+#include "fs_cmd.h"
 #include <linux/mlx5/vport.h>
 
 #define sd_info(__dev, format, ...) \
@@ -19,9 +20,11 @@ struct mlx5_sd {
 	union {
 		struct { /* primary */
 			struct mlx5_core_dev *secondaries[MLX5_SD_MAX_GROUP_SZ - 1];
+			struct mlx5_flow_table *tx_ft;
 		};
 		struct { /* secondary */
 			struct mlx5_core_dev *primary_dev;
+			u32 alias_obj_id;
 		};
 	};
 };
@@ -78,6 +81,21 @@ struct mlx5_core_dev *mlx5_sd_ch_ix_get_dev(struct mlx5_core_dev *primary, int c
 	return mlx5_sd_primary_get_peer(primary, mdev_idx);
 }
 
+static bool ft_create_alias_supported(struct mlx5_core_dev *dev)
+{
+	u64 obj_allowed = MLX5_CAP_GEN_2_64(dev, allowed_object_for_other_vhca_access);
+	u32 obj_supp = MLX5_CAP_GEN_2(dev, cross_vhca_object_to_object_supported);
+
+	if (!(obj_supp &
+	    MLX5_CROSS_VHCA_OBJ_TO_OBJ_SUPPORTED_LOCAL_FLOW_TABLE_ROOT_TO_REMOTE_FLOW_TABLE))
+		return false;
+
+	if (!(obj_allowed & MLX5_ALLOWED_OBJ_FOR_OTHER_VHCA_ACCESS_FLOW_TABLE))
+		return false;
+
+	return true;
+}
+
 static bool mlx5_sd_is_supported(struct mlx5_core_dev *dev, u8 host_buses)
 {
 	/* Feature is currently implemented for PFs only */
@@ -88,6 +106,24 @@ static bool mlx5_sd_is_supported(struct mlx5_core_dev *dev, u8 host_buses)
 	if (host_buses > MLX5_SD_MAX_GROUP_SZ)
 		return false;
 
+	/* Disconnect secondaries from the network */
+	if (!MLX5_CAP_GEN(dev, eswitch_manager))
+		return false;
+	if (!MLX5_CAP_GEN(dev, silent_mode))
+		return false;
+
+	/* RX steering from primary to secondaries */
+	if (!MLX5_CAP_GEN(dev, cross_vhca_rqt))
+		return false;
+	if (host_buses > MLX5_CAP_GEN_2(dev, max_rqt_vhca_id))
+		return false;
+
+	/* TX steering from secondaries to primary */
+	if (!ft_create_alias_supported(dev))
+		return false;
+	if (!MLX5_CAP_FLOWTABLE_NIC_TX(dev, reset_root_to_default))
+		return false;
+
 	return true;
 }
 
@@ -230,10 +266,122 @@ static void sd_unregister(struct mlx5_core_dev *dev)
 	mlx5_devcom_unregister_component(sd->devcom);
 }
 
+static int sd_cmd_set_primary(struct mlx5_core_dev *primary, u8 *alias_key)
+{
+	struct mlx5_cmd_allow_other_vhca_access_attr allow_attr = {};
+	struct mlx5_sd *sd = mlx5_get_sd(primary);
+	struct mlx5_flow_table_attr ft_attr = {};
+	struct mlx5_flow_namespace *nic_ns;
+	struct mlx5_flow_table *ft;
+	int err;
+
+	nic_ns = mlx5_get_flow_namespace(primary, MLX5_FLOW_NAMESPACE_EGRESS);
+	if (!nic_ns)
+		return -EOPNOTSUPP;
+
+	ft = mlx5_create_flow_table(nic_ns, &ft_attr);
+	if (IS_ERR(ft)) {
+		err = PTR_ERR(ft);
+		return err;
+	}
+	sd->tx_ft = ft;
+	memcpy(allow_attr.access_key, alias_key, ACCESS_KEY_LEN);
+	allow_attr.obj_type = MLX5_GENERAL_OBJECT_TYPES_FLOW_TABLE_ALIAS;
+	allow_attr.obj_id = (ft->type << FT_ID_FT_TYPE_OFFSET) | ft->id;
+
+	err = mlx5_cmd_allow_other_vhca_access(primary, &allow_attr);
+	if (err) {
+		mlx5_core_err(primary, "Failed to allow other vhca access err=%d\n",
+			      err);
+		mlx5_destroy_flow_table(ft);
+		return err;
+	}
+
+	return 0;
+}
+
+static void sd_cmd_unset_primary(struct mlx5_core_dev *primary)
+{
+	struct mlx5_sd *sd = mlx5_get_sd(primary);
+
+	mlx5_destroy_flow_table(sd->tx_ft);
+}
+
+static int sd_secondary_create_alias_ft(struct mlx5_core_dev *secondary,
+					struct mlx5_core_dev *primary,
+					struct mlx5_flow_table *ft,
+					u32 *obj_id, u8 *alias_key)
+{
+	u32 aliased_object_id = (ft->type << FT_ID_FT_TYPE_OFFSET) | ft->id;
+	u16 vhca_id_to_be_accessed = MLX5_CAP_GEN(primary, vhca_id);
+	struct mlx5_cmd_alias_obj_create_attr alias_attr = {};
+	int ret;
+
+	memcpy(alias_attr.access_key, alias_key, ACCESS_KEY_LEN);
+	alias_attr.obj_id = aliased_object_id;
+	alias_attr.obj_type = MLX5_GENERAL_OBJECT_TYPES_FLOW_TABLE_ALIAS;
+	alias_attr.vhca_id = vhca_id_to_be_accessed;
+	ret = mlx5_cmd_alias_obj_create(secondary, &alias_attr, obj_id);
+	if (ret) {
+		mlx5_core_err(secondary, "Failed to create alias object err=%d\n",
+			      ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void sd_secondary_destroy_alias_ft(struct mlx5_core_dev *secondary)
+{
+	struct mlx5_sd *sd = mlx5_get_sd(secondary);
+
+	mlx5_cmd_alias_obj_destroy(secondary, sd->alias_obj_id,
+				   MLX5_GENERAL_OBJECT_TYPES_FLOW_TABLE_ALIAS);
+}
+
+static int sd_cmd_set_secondary(struct mlx5_core_dev *secondary,
+				struct mlx5_core_dev *primary,
+				u8 *alias_key)
+{
+	struct mlx5_sd *primary_sd = mlx5_get_sd(primary);
+	struct mlx5_sd *sd = mlx5_get_sd(secondary);
+	int err;
+
+	err = mlx5_fs_cmd_set_l2table_entry_silent(secondary, 1);
+	if (err)
+		return err;
+
+	err = sd_secondary_create_alias_ft(secondary, primary, primary_sd->tx_ft,
+					   &sd->alias_obj_id, alias_key);
+	if (err)
+		goto err_unset_silent;
+
+	err = mlx5_fs_cmd_set_tx_flow_table_root(secondary, sd->alias_obj_id, false);
+	if (err)
+		goto err_destroy_alias_ft;
+
+	return 0;
+
+err_destroy_alias_ft:
+	sd_secondary_destroy_alias_ft(secondary);
+err_unset_silent:
+	mlx5_fs_cmd_set_l2table_entry_silent(secondary, 0);
+	return err;
+}
+
+static void sd_cmd_unset_secondary(struct mlx5_core_dev *secondary)
+{
+	mlx5_fs_cmd_set_tx_flow_table_root(secondary, 0, true);
+	sd_secondary_destroy_alias_ft(secondary);
+	mlx5_fs_cmd_set_l2table_entry_silent(secondary, 0);
+}
+
 int mlx5_sd_init(struct mlx5_core_dev *dev)
 {
+	struct mlx5_core_dev *primary, *pos, *to;
 	struct mlx5_sd *sd = mlx5_get_sd(dev);
-	int err;
+	u8 alias_key[ACCESS_KEY_LEN];
+	int err, i;
 
 	err = sd_init(dev);
 	if (err)
@@ -247,8 +395,33 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 	if (err)
 		goto err_sd_cleanup;
 
+	if (!mlx5_devcom_comp_is_ready(sd->devcom))
+		return 0;
+
+	primary = mlx5_sd_get_primary(dev);
+
+	for (i = 0; i < ACCESS_KEY_LEN; i++)
+		alias_key[i] = get_random_u8();
+
+	err = sd_cmd_set_primary(primary, alias_key);
+	if (err)
+		goto err_sd_unregister;
+
+	mlx5_sd_for_each_secondary(i, primary, pos) {
+		err = sd_cmd_set_secondary(pos, primary, alias_key);
+		if (err)
+			goto err_unset_secondaries;
+	}
+
 	return 0;
 
+err_unset_secondaries:
+	to = pos;
+	mlx5_sd_for_each_secondary_to(i, primary, to, pos)
+		sd_cmd_unset_secondary(pos);
+	sd_cmd_unset_primary(primary);
+err_sd_unregister:
+	sd_unregister(dev);
 err_sd_cleanup:
 	sd_cleanup(dev);
 	return err;
@@ -257,10 +430,20 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
 {
 	struct mlx5_sd *sd = mlx5_get_sd(dev);
+	struct mlx5_core_dev *primary, *pos;
+	int i;
 
 	if (!sd)
 		return;
 
+	if (!mlx5_devcom_comp_is_ready(sd->devcom))
+		goto out;
+
+	primary = mlx5_sd_get_primary(dev);
+	mlx5_sd_for_each_secondary(i, primary, pos)
+		sd_cmd_unset_secondary(pos);
+	sd_cmd_unset_primary(primary);
+out:
 	sd_unregister(dev);
 	sd_cleanup(dev);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [net-next V3 06/15] net/mlx5: SD, Add informative prints in kernel log
  2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
                   ` (4 preceding siblings ...)
  2024-02-15  3:08 ` [net-next V3 05/15] net/mlx5: SD, Implement steering for primary and secondaries Saeed Mahameed
@ 2024-02-15  3:08 ` Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 07/15] net/mlx5: SD, Add debugfs Saeed Mahameed
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Tariq Toukan <tariqt@nvidia.com>

Print to kernel log when an SD group moves from/to ready state.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 21 +++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index 76c2426c2498..918138c13a92 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -376,6 +376,21 @@ static void sd_cmd_unset_secondary(struct mlx5_core_dev *secondary)
 	mlx5_fs_cmd_set_l2table_entry_silent(secondary, 0);
 }
 
+static void sd_print_group(struct mlx5_core_dev *primary)
+{
+	struct mlx5_sd *sd = mlx5_get_sd(primary);
+	struct mlx5_core_dev *pos;
+	int i;
+
+	sd_info(primary, "group id %#x, primary %s, vhca %#x\n",
+		sd->group_id, pci_name(primary->pdev),
+		MLX5_CAP_GEN(primary, vhca_id));
+	mlx5_sd_for_each_secondary(i, primary, pos)
+		sd_info(primary, "group id %#x, secondary_%d %s, vhca %#x\n",
+			sd->group_id, i - 1, pci_name(pos->pdev),
+			MLX5_CAP_GEN(pos, vhca_id));
+}
+
 int mlx5_sd_init(struct mlx5_core_dev *dev)
 {
 	struct mlx5_core_dev *primary, *pos, *to;
@@ -413,6 +428,10 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 			goto err_unset_secondaries;
 	}
 
+	sd_info(primary, "group id %#x, size %d, combined\n",
+		sd->group_id, mlx5_devcom_comp_get_size(sd->devcom));
+	sd_print_group(primary);
+
 	return 0;
 
 err_unset_secondaries:
@@ -443,6 +462,8 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
 	mlx5_sd_for_each_secondary(i, primary, pos)
 		sd_cmd_unset_secondary(pos);
 	sd_cmd_unset_primary(primary);
+
+	sd_info(primary, "group id %#x, uncombined\n", sd->group_id);
 out:
 	sd_unregister(dev);
 	sd_cleanup(dev);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [net-next V3 07/15] net/mlx5: SD, Add debugfs
  2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
                   ` (5 preceding siblings ...)
  2024-02-15  3:08 ` [net-next V3 06/15] net/mlx5: SD, Add informative prints in kernel log Saeed Mahameed
@ 2024-02-15  3:08 ` Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 08/15] net/mlx5e: Create single netdev per SD group Saeed Mahameed
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Tariq Toukan <tariqt@nvidia.com>

Add debugfs entries that describe the Socket-Direct group.

Example:
$ grep -H . /sys/kernel/debug/mlx5/0000\:08\:00.0/sd/*
/sys/kernel/debug/mlx5/0000:08:00.0/sd/group_id:0x00000101
/sys/kernel/debug/mlx5/0000:08:00.0/sd/primary:0000:08:00.0 vhca 0x0
/sys/kernel/debug/mlx5/0000:08:00.0/sd/secondary_0:0000:09:00.0 vhca 0x2

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 34 +++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index 918138c13a92..5012510a69d5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
@@ -6,6 +6,7 @@
 #include "lib/mlx5.h"
 #include "fs_cmd.h"
 #include <linux/mlx5/vport.h>
+#include <linux/debugfs.h>
 
 #define sd_info(__dev, format, ...) \
 	dev_info((__dev)->device, "Socket-Direct: " format, ##__VA_ARGS__)
@@ -16,6 +17,7 @@ struct mlx5_sd {
 	u32 group_id;
 	u8 host_buses;
 	struct mlx5_devcom_comp_dev *devcom;
+	struct dentry *dfs;
 	bool primary;
 	union {
 		struct { /* primary */
@@ -391,6 +393,26 @@ static void sd_print_group(struct mlx5_core_dev *primary)
 			MLX5_CAP_GEN(pos, vhca_id));
 }
 
+static ssize_t dev_read(struct file *filp, char __user *buf, size_t count,
+			loff_t *pos)
+{
+	struct mlx5_core_dev *dev;
+	char tbuf[32];
+	int ret;
+
+	dev = filp->private_data;
+	ret = snprintf(tbuf, sizeof(tbuf), "%s vhca %#x\n", pci_name(dev->pdev),
+		       MLX5_CAP_GEN(dev, vhca_id));
+
+	return simple_read_from_buffer(buf, count, pos, tbuf, ret);
+}
+
+static const struct file_operations dev_fops = {
+	.owner	= THIS_MODULE,
+	.open	= simple_open,
+	.read	= dev_read,
+};
+
 int mlx5_sd_init(struct mlx5_core_dev *dev)
 {
 	struct mlx5_core_dev *primary, *pos, *to;
@@ -422,10 +444,20 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 	if (err)
 		goto err_sd_unregister;
 
+	sd->dfs = debugfs_create_dir("sd", mlx5_debugfs_get_dev_root(primary));
+	debugfs_create_x32("group_id", 0400, sd->dfs, &sd->group_id);
+	debugfs_create_file("primary", 0400, sd->dfs, primary, &dev_fops);
+
 	mlx5_sd_for_each_secondary(i, primary, pos) {
+		char name[32];
+
 		err = sd_cmd_set_secondary(pos, primary, alias_key);
 		if (err)
 			goto err_unset_secondaries;
+
+		snprintf(name, sizeof(name), "secondary_%d", i - 1);
+		debugfs_create_file(name, 0400, sd->dfs, pos, &dev_fops);
+
 	}
 
 	sd_info(primary, "group id %#x, size %d, combined\n",
@@ -439,6 +471,7 @@ int mlx5_sd_init(struct mlx5_core_dev *dev)
 	mlx5_sd_for_each_secondary_to(i, primary, to, pos)
 		sd_cmd_unset_secondary(pos);
 	sd_cmd_unset_primary(primary);
+	debugfs_remove_recursive(sd->dfs);
 err_sd_unregister:
 	sd_unregister(dev);
 err_sd_cleanup:
@@ -462,6 +495,7 @@ void mlx5_sd_cleanup(struct mlx5_core_dev *dev)
 	mlx5_sd_for_each_secondary(i, primary, pos)
 		sd_cmd_unset_secondary(pos);
 	sd_cmd_unset_primary(primary);
+	debugfs_remove_recursive(sd->dfs);
 
 	sd_info(primary, "group id %#x, uncombined\n", sd->group_id);
 out:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [net-next V3 08/15] net/mlx5e: Create single netdev per SD group
  2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
                   ` (6 preceding siblings ...)
  2024-02-15  3:08 ` [net-next V3 07/15] net/mlx5: SD, Add debugfs Saeed Mahameed
@ 2024-02-15  3:08 ` Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 09/15] net/mlx5e: Create EN core HW resources for all secondary devices Saeed Mahameed
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Tariq Toukan <tariqt@nvidia.com>

Integrate the SD library calls into the auxiliary_driver ops in
preparation for creating a single netdev for the multiple PFs belonging
to the same SD group.

SD is still disabled at this stage. It is enabled by a downstream patch
when all needed parts are implemented.

The netdev is created whenever the SD group, with all its participants,
are ready. It is later destroyed whenever any of the participating PFs
drops.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 69 +++++++++++++++++--
 1 file changed, 62 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index be809556b2e1..ef6a342742a2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -70,6 +70,7 @@
 #include "qos.h"
 #include "en/trap.h"
 #include "lib/devcom.h"
+#include "lib/sd.h"
 
 bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev, u8 page_shift,
 					    enum mlx5e_mpwrq_umr_mode umr_mode)
@@ -5987,7 +5988,7 @@ void mlx5e_destroy_netdev(struct mlx5e_priv *priv)
 	free_netdev(netdev);
 }
 
-static int mlx5e_resume(struct auxiliary_device *adev)
+static int _mlx5e_resume(struct auxiliary_device *adev)
 {
 	struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
 	struct mlx5e_dev *mlx5e_dev = auxiliary_get_drvdata(adev);
@@ -6012,6 +6013,23 @@ static int mlx5e_resume(struct auxiliary_device *adev)
 	return 0;
 }
 
+static int mlx5e_resume(struct auxiliary_device *adev)
+{
+	struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
+	struct mlx5_core_dev *mdev = edev->mdev;
+	struct auxiliary_device *actual_adev;
+	int err;
+
+	err = mlx5_sd_init(mdev);
+	if (err)
+		return err;
+
+	actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
+	if (actual_adev)
+		return _mlx5e_resume(actual_adev);
+	return 0;
+}
+
 static int _mlx5e_suspend(struct auxiliary_device *adev)
 {
 	struct mlx5e_dev *mlx5e_dev = auxiliary_get_drvdata(adev);
@@ -6032,7 +6050,17 @@ static int _mlx5e_suspend(struct auxiliary_device *adev)
 
 static int mlx5e_suspend(struct auxiliary_device *adev, pm_message_t state)
 {
-	return _mlx5e_suspend(adev);
+	struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
+	struct mlx5_core_dev *mdev = edev->mdev;
+	struct auxiliary_device *actual_adev;
+	int err = 0;
+
+	actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
+	if (actual_adev)
+		err = _mlx5e_suspend(actual_adev);
+
+	mlx5_sd_cleanup(mdev);
+	return err;
 }
 
 static int _mlx5e_probe(struct auxiliary_device *adev)
@@ -6078,9 +6106,9 @@ static int _mlx5e_probe(struct auxiliary_device *adev)
 		goto err_destroy_netdev;
 	}
 
-	err = mlx5e_resume(adev);
+	err = _mlx5e_resume(adev);
 	if (err) {
-		mlx5_core_err(mdev, "mlx5e_resume failed, %d\n", err);
+		mlx5_core_err(mdev, "_mlx5e_resume failed, %d\n", err);
 		goto err_profile_cleanup;
 	}
 
@@ -6111,15 +6139,29 @@ static int _mlx5e_probe(struct auxiliary_device *adev)
 static int mlx5e_probe(struct auxiliary_device *adev,
 		       const struct auxiliary_device_id *id)
 {
-	return _mlx5e_probe(adev);
+	struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
+	struct mlx5_core_dev *mdev = edev->mdev;
+	struct auxiliary_device *actual_adev;
+	int err;
+
+	err = mlx5_sd_init(mdev);
+	if (err)
+		return err;
+
+	actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
+	if (actual_adev)
+		return _mlx5e_probe(actual_adev);
+	return 0;
 }
 
-static void mlx5e_remove(struct auxiliary_device *adev)
+static void _mlx5e_remove(struct auxiliary_device *adev)
 {
+	struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
 	struct mlx5e_dev *mlx5e_dev = auxiliary_get_drvdata(adev);
 	struct mlx5e_priv *priv = mlx5e_dev->priv;
+	struct mlx5_core_dev *mdev = edev->mdev;
 
-	mlx5_core_uplink_netdev_set(priv->mdev, NULL);
+	mlx5_core_uplink_netdev_set(mdev, NULL);
 	mlx5e_dcbnl_delete_app(priv);
 	unregister_netdev(priv->netdev);
 	_mlx5e_suspend(adev);
@@ -6129,6 +6171,19 @@ static void mlx5e_remove(struct auxiliary_device *adev)
 	mlx5e_destroy_devlink(mlx5e_dev);
 }
 
+static void mlx5e_remove(struct auxiliary_device *adev)
+{
+	struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
+	struct mlx5_core_dev *mdev = edev->mdev;
+	struct auxiliary_device *actual_adev;
+
+	actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
+	if (actual_adev)
+		_mlx5e_remove(actual_adev);
+
+	mlx5_sd_cleanup(mdev);
+}
+
 static const struct auxiliary_device_id mlx5e_id_table[] = {
 	{ .name = MLX5_ADEV_NAME ".eth", },
 	{},
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [net-next V3 09/15] net/mlx5e: Create EN core HW resources for all secondary devices
  2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
                   ` (7 preceding siblings ...)
  2024-02-15  3:08 ` [net-next V3 08/15] net/mlx5e: Create single netdev per SD group Saeed Mahameed
@ 2024-02-15  3:08 ` Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 10/15] net/mlx5e: Let channels be SD-aware Saeed Mahameed
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Tariq Toukan <tariqt@nvidia.com>

Traffic queues will be created on all devices, including the
secondaries. Create the needed core layer resources for them as well.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 32 +++++++++++++------
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 55c6ace0acd5..6c143088e247 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -60,6 +60,7 @@
 #include "lib/clock.h"
 #include "en/rx_res.h"
 #include "en/selq.h"
+#include "lib/sd.h"
 
 extern const struct net_device_ops mlx5e_netdev_ops;
 struct page_pool;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index ef6a342742a2..c6c406c18b54 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -5995,22 +5995,29 @@ static int _mlx5e_resume(struct auxiliary_device *adev)
 	struct mlx5e_priv *priv = mlx5e_dev->priv;
 	struct net_device *netdev = priv->netdev;
 	struct mlx5_core_dev *mdev = edev->mdev;
-	int err;
+	struct mlx5_core_dev *pos, *to;
+	int err, i;
 
 	if (netif_device_present(netdev))
 		return 0;
 
-	err = mlx5e_create_mdev_resources(mdev, true);
-	if (err)
-		return err;
+	mlx5_sd_for_each_dev(i, mdev, pos) {
+		err = mlx5e_create_mdev_resources(pos, true);
+		if (err)
+			goto err_destroy_mdev_res;
+	}
 
 	err = mlx5e_attach_netdev(priv);
-	if (err) {
-		mlx5e_destroy_mdev_resources(mdev);
-		return err;
-	}
+	if (err)
+		goto err_destroy_mdev_res;
 
 	return 0;
+
+err_destroy_mdev_res:
+	to = pos;
+	mlx5_sd_for_each_dev_to(i, mdev, to, pos)
+		mlx5e_destroy_mdev_resources(pos);
+	return err;
 }
 
 static int mlx5e_resume(struct auxiliary_device *adev)
@@ -6036,15 +6043,20 @@ static int _mlx5e_suspend(struct auxiliary_device *adev)
 	struct mlx5e_priv *priv = mlx5e_dev->priv;
 	struct net_device *netdev = priv->netdev;
 	struct mlx5_core_dev *mdev = priv->mdev;
+	struct mlx5_core_dev *pos;
+	int i;
 
 	if (!netif_device_present(netdev)) {
 		if (test_bit(MLX5E_STATE_DESTROYING, &priv->state))
-			mlx5e_destroy_mdev_resources(mdev);
+			mlx5_sd_for_each_dev(i, mdev, pos)
+				mlx5e_destroy_mdev_resources(pos);
 		return -ENODEV;
 	}
 
 	mlx5e_detach_netdev(priv);
-	mlx5e_destroy_mdev_resources(mdev);
+	mlx5_sd_for_each_dev(i, mdev, pos)
+		mlx5e_destroy_mdev_resources(pos);
+
 	return 0;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [net-next V3 10/15] net/mlx5e: Let channels be SD-aware
  2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
                   ` (8 preceding siblings ...)
  2024-02-15  3:08 ` [net-next V3 09/15] net/mlx5e: Create EN core HW resources for all secondary devices Saeed Mahameed
@ 2024-02-15  3:08 ` Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 11/15] net/mlx5e: Support cross-vhca RSS Saeed Mahameed
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Tariq Toukan <tariqt@nvidia.com>

Distribute the channels between the different SD-devices to acheive
local numa node performance on multiple numas.

Each channel works against one specific mdev, creating all datapath
queues against it.

We distribute channels to mdevs in a round-robin policy.

Example for 2 mdevs and 6 channels:
+-------+---------+
| ch ix | mdev ix |
+-------+---------+
|   0   |    0    |
|   1   |    1    |
|   2   |    0    |
|   3   |    1    |
|   4   |    0    |
|   5   |    1    |
+-------+---------+

This round-robin distribution policy is preferred over another suggested
intuitive distribution, in which we first distribute one half of the
channels to mdev #0 and then the second half to mdev #1.

We prefer round-robin for a reason: it is less influenced by changes in
the number of channels. The mapping between channel index and mdev is
fixed, no matter how many channels the user configures. As the channel
stats are persistent to channels closure, changing the mapping every
single time would turn the accumulative stats less representing of the
channel's history.

Per-channel objects should stop using the primary mdev (priv->mdev)
directly, and instead move to using their own channel's mdev.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
 .../ethernet/mellanox/mlx5/core/en/params.c   |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en/qos.c  |  8 ++---
 .../mellanox/mlx5/core/en/reporter_rx.c       |  4 +--
 .../mellanox/mlx5/core/en/reporter_tx.c       |  3 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/pool.c |  6 ++--
 .../mellanox/mlx5/core/en_accel/ktls_rx.c     |  6 ++--
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 32 ++++++++++++-------
 8 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 6c143088e247..f6e78c465c7a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -792,6 +792,7 @@ struct mlx5e_channel {
 	struct hwtstamp_config    *tstamp;
 	DECLARE_BITMAP(state, MLX5E_CHANNEL_NUM_STATES);
 	int                        ix;
+	int                        vec_ix;
 	int                        cpu;
 	/* Sync between icosq recovery and XSK enable/disable. */
 	struct mutex               icosq_recovery_lock;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index 5757f4f10c12..8b99cc11f138 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -688,7 +688,7 @@ void mlx5e_build_create_cq_param(struct mlx5e_create_cq_param *ccp, struct mlx5e
 		.napi = &c->napi,
 		.ch_stats = c->stats,
 		.node = cpu_to_node(c->cpu),
-		.ix = c->ix,
+		.ix = c->vec_ix,
 	};
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/en/qos.c
index 34adf8c3f81a..e87e26f2c669 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/qos.c
@@ -122,8 +122,8 @@ int mlx5e_open_qos_sq(struct mlx5e_priv *priv, struct mlx5e_channels *chs,
 
 	memset(&param_sq, 0, sizeof(param_sq));
 	memset(&param_cq, 0, sizeof(param_cq));
-	mlx5e_build_sq_param(priv->mdev, params, &param_sq);
-	mlx5e_build_tx_cq_param(priv->mdev, params, &param_cq);
+	mlx5e_build_sq_param(c->mdev, params, &param_sq);
+	mlx5e_build_tx_cq_param(c->mdev, params, &param_cq);
 	err = mlx5e_open_cq(c->mdev, params->tx_cq_moderation, &param_cq, &ccp, &sq->cq);
 	if (err)
 		goto err_free_sq;
@@ -176,7 +176,7 @@ int mlx5e_activate_qos_sq(void *data, u16 node_qid, u32 hw_id)
 	 */
 	smp_wmb();
 
-	qos_dbg(priv->mdev, "Activate QoS SQ qid %u\n", node_qid);
+	qos_dbg(sq->mdev, "Activate QoS SQ qid %u\n", node_qid);
 	mlx5e_activate_txqsq(sq);
 
 	return 0;
@@ -190,7 +190,7 @@ void mlx5e_deactivate_qos_sq(struct mlx5e_priv *priv, u16 qid)
 	if (!sq) /* Handle the case when the SQ failed to open. */
 		return;
 
-	qos_dbg(priv->mdev, "Deactivate QoS SQ qid %u\n", qid);
+	qos_dbg(sq->mdev, "Deactivate QoS SQ qid %u\n", qid);
 	mlx5e_deactivate_txqsq(sq);
 
 	priv->txq2sq[mlx5e_qid_from_qos(&priv->channels, qid)] = NULL;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c
index 4358798d6ce1..25d751eba99b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c
@@ -294,8 +294,8 @@ static void mlx5e_rx_reporter_diagnose_generic_rq(struct mlx5e_rq *rq,
 
 	params = &priv->channels.params;
 	rq_sz = mlx5e_rqwq_get_size(rq);
-	real_time =  mlx5_is_real_time_rq(priv->mdev);
-	rq_stride = BIT(mlx5e_mpwqe_get_log_stride_size(priv->mdev, params, NULL));
+	real_time =  mlx5_is_real_time_rq(rq->mdev);
+	rq_stride = BIT(mlx5e_mpwqe_get_log_stride_size(rq->mdev, params, NULL));
 
 	mlx5e_health_fmsg_named_obj_nest_start(fmsg, "RQ");
 	devlink_fmsg_u8_pair_put(fmsg, "type", params->rq_wq_type);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
index 6b44ddce14e9..0ab9db319530 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
@@ -219,7 +219,6 @@ mlx5e_tx_reporter_build_diagnose_output_sq_common(struct devlink_fmsg *fmsg,
 						  struct mlx5e_txqsq *sq, int tc)
 {
 	bool stopped = netif_xmit_stopped(sq->txq);
-	struct mlx5e_priv *priv = sq->priv;
 	u8 state;
 	int err;
 
@@ -227,7 +226,7 @@ mlx5e_tx_reporter_build_diagnose_output_sq_common(struct devlink_fmsg *fmsg,
 	devlink_fmsg_u32_pair_put(fmsg, "txq ix", sq->txq_ix);
 	devlink_fmsg_u32_pair_put(fmsg, "sqn", sq->sqn);
 
-	err = mlx5_core_query_sq_state(priv->mdev, sq->sqn, &state);
+	err = mlx5_core_query_sq_state(sq->mdev, sq->sqn, &state);
 	if (!err)
 		devlink_fmsg_u8_pair_put(fmsg, "HW state", state);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c
index ebada0c5af3c..db776e515b6a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c
@@ -6,10 +6,10 @@
 #include "setup.h"
 #include "en/params.h"
 
-static int mlx5e_xsk_map_pool(struct mlx5e_priv *priv,
+static int mlx5e_xsk_map_pool(struct mlx5_core_dev *mdev,
 			      struct xsk_buff_pool *pool)
 {
-	struct device *dev = mlx5_core_dma_dev(priv->mdev);
+	struct device *dev = mlx5_core_dma_dev(mdev);
 
 	return xsk_pool_dma_map(pool, dev, DMA_ATTR_SKIP_CPU_SYNC);
 }
@@ -89,7 +89,7 @@ static int mlx5e_xsk_enable_locked(struct mlx5e_priv *priv,
 	if (unlikely(!mlx5e_xsk_is_pool_sane(pool)))
 		return -EINVAL;
 
-	err = mlx5e_xsk_map_pool(priv, pool);
+	err = mlx5e_xsk_map_pool(mlx5_sd_ch_ix_get_dev(priv->mdev, ix), pool);
 	if (unlikely(err))
 		return err;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c
index 9b597cb24598..65ccb33edafb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c
@@ -267,7 +267,7 @@ resync_post_get_progress_params(struct mlx5e_icosq *sq,
 		goto err_out;
 	}
 
-	pdev = mlx5_core_dma_dev(sq->channel->priv->mdev);
+	pdev = mlx5_core_dma_dev(sq->channel->mdev);
 	buf->dma_addr = dma_map_single(pdev, &buf->progress,
 				       PROGRESS_PARAMS_PADDED_SIZE, DMA_FROM_DEVICE);
 	if (unlikely(dma_mapping_error(pdev, buf->dma_addr))) {
@@ -425,14 +425,12 @@ void mlx5e_ktls_handle_get_psv_completion(struct mlx5e_icosq_wqe_info *wi,
 {
 	struct mlx5e_ktls_rx_resync_buf *buf = wi->tls_get_params.buf;
 	struct mlx5e_ktls_offload_context_rx *priv_rx;
-	struct mlx5e_ktls_rx_resync_ctx *resync;
 	u8 tracker_state, auth_state, *ctx;
 	struct device *dev;
 	u32 hw_seq;
 
 	priv_rx = buf->priv_rx;
-	resync = &priv_rx->resync;
-	dev = mlx5_core_dma_dev(resync->priv->mdev);
+	dev = mlx5_core_dma_dev(sq->channel->mdev);
 	if (unlikely(test_bit(MLX5E_PRIV_RX_FLAG_DELETING, priv_rx->flags)))
 		goto out;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index c6c406c18b54..8d9b0cdb4e01 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2529,14 +2529,20 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 			      struct xsk_buff_pool *xsk_pool,
 			      struct mlx5e_channel **cp)
 {
-	int cpu = mlx5_comp_vector_get_cpu(priv->mdev, ix);
 	struct net_device *netdev = priv->netdev;
+	struct mlx5_core_dev *mdev;
 	struct mlx5e_xsk_param xsk;
 	struct mlx5e_channel *c;
 	unsigned int irq;
+	int vec_ix;
+	int cpu;
 	int err;
 
-	err = mlx5_comp_irqn_get(priv->mdev, ix, &irq);
+	mdev = mlx5_sd_ch_ix_get_dev(priv->mdev, ix);
+	vec_ix = mlx5_sd_ch_ix_get_vec_ix(mdev, ix);
+	cpu = mlx5_comp_vector_get_cpu(mdev, vec_ix);
+
+	err = mlx5_comp_irqn_get(mdev, vec_ix, &irq);
 	if (err)
 		return err;
 
@@ -2549,18 +2555,19 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 		return -ENOMEM;
 
 	c->priv     = priv;
-	c->mdev     = priv->mdev;
+	c->mdev     = mdev;
 	c->tstamp   = &priv->tstamp;
 	c->ix       = ix;
+	c->vec_ix   = vec_ix;
 	c->cpu      = cpu;
-	c->pdev     = mlx5_core_dma_dev(priv->mdev);
+	c->pdev     = mlx5_core_dma_dev(mdev);
 	c->netdev   = priv->netdev;
-	c->mkey_be  = cpu_to_be32(priv->mdev->mlx5e_res.hw_objs.mkey);
+	c->mkey_be  = cpu_to_be32(mdev->mlx5e_res.hw_objs.mkey);
 	c->num_tc   = mlx5e_get_dcb_num_tc(params);
 	c->xdp      = !!params->xdp_prog;
 	c->stats    = &priv->channel_stats[ix]->ch;
 	c->aff_mask = irq_get_effective_affinity_mask(irq);
-	c->lag_port = mlx5e_enumerate_lag_port(priv->mdev, ix);
+	c->lag_port = mlx5e_enumerate_lag_port(mdev, ix);
 
 	netif_napi_add(netdev, &c->napi, mlx5e_napi_poll);
 	netif_napi_set_irq(&c->napi, irq);
@@ -2943,15 +2950,18 @@ static MLX5E_DEFINE_PREACTIVATE_WRAPPER_CTX(mlx5e_update_netdev_queues);
 static void mlx5e_set_default_xps_cpumasks(struct mlx5e_priv *priv,
 					   struct mlx5e_params *params)
 {
-	struct mlx5_core_dev *mdev = priv->mdev;
-	int num_comp_vectors, ix, irq;
-
-	num_comp_vectors = mlx5_comp_vectors_max(mdev);
+	int ix;
 
 	for (ix = 0; ix < params->num_channels; ix++) {
+		int num_comp_vectors, irq, vec_ix;
+		struct mlx5_core_dev *mdev;
+
+		mdev = mlx5_sd_ch_ix_get_dev(priv->mdev, ix);
+		num_comp_vectors = mlx5_comp_vectors_max(mdev);
 		cpumask_clear(priv->scratchpad.cpumask);
+		vec_ix = mlx5_sd_ch_ix_get_vec_ix(mdev, ix);
 
-		for (irq = ix; irq < num_comp_vectors; irq += params->num_channels) {
+		for (irq = vec_ix; irq < num_comp_vectors; irq += params->num_channels) {
 			int cpu = mlx5_comp_vector_get_cpu(mdev, irq);
 
 			cpumask_set_cpu(cpu, priv->scratchpad.cpumask);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [net-next V3 11/15] net/mlx5e: Support cross-vhca RSS
  2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
                   ` (9 preceding siblings ...)
  2024-02-15  3:08 ` [net-next V3 10/15] net/mlx5e: Let channels be SD-aware Saeed Mahameed
@ 2024-02-15  3:08 ` Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 12/15] net/mlx5e: Support per-mdev queue counter Saeed Mahameed
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Tariq Toukan <tariqt@nvidia.com>

Implement driver support for the HW feature that allows RX steering of
one device to target other device's RQs.

In SD multi-pf netdev mode, we set the secondaries into silent mode,
disconnecting them from the network. This feature is then used to steer
traffic from the primary to the secondaries.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/en/channels.c |  10 +-
 .../ethernet/mellanox/mlx5/core/en/channels.h |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en/rqt.c  | 123 ++++++++++++++----
 .../net/ethernet/mellanox/mlx5/core/en/rqt.h  |   9 +-
 .../net/ethernet/mellanox/mlx5/core/en/rss.c  |  17 +--
 .../net/ethernet/mellanox/mlx5/core/en/rss.h  |   4 +-
 .../ethernet/mellanox/mlx5/core/en/rx_res.c   |  62 ++++++---
 .../ethernet/mellanox/mlx5/core/en/rx_res.h   |   1 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   2 +
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   |   2 +-
 10 files changed, 179 insertions(+), 57 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/channels.c b/drivers/net/ethernet/mellanox/mlx5/core/en/channels.c
index 48581ea3adcb..874a1016623c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/channels.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/channels.c
@@ -23,20 +23,26 @@ bool mlx5e_channels_is_xsk(struct mlx5e_channels *chs, unsigned int ix)
 	return test_bit(MLX5E_CHANNEL_STATE_XSK, c->state);
 }
 
-void mlx5e_channels_get_regular_rqn(struct mlx5e_channels *chs, unsigned int ix, u32 *rqn)
+void mlx5e_channels_get_regular_rqn(struct mlx5e_channels *chs, unsigned int ix, u32 *rqn,
+				    u32 *vhca_id)
 {
 	struct mlx5e_channel *c = mlx5e_channels_get(chs, ix);
 
 	*rqn = c->rq.rqn;
+	if (vhca_id)
+		*vhca_id = MLX5_CAP_GEN(c->mdev, vhca_id);
 }
 
-void mlx5e_channels_get_xsk_rqn(struct mlx5e_channels *chs, unsigned int ix, u32 *rqn)
+void mlx5e_channels_get_xsk_rqn(struct mlx5e_channels *chs, unsigned int ix, u32 *rqn,
+				u32 *vhca_id)
 {
 	struct mlx5e_channel *c = mlx5e_channels_get(chs, ix);
 
 	WARN_ON_ONCE(!test_bit(MLX5E_CHANNEL_STATE_XSK, c->state));
 
 	*rqn = c->xskrq.rqn;
+	if (vhca_id)
+		*vhca_id = MLX5_CAP_GEN(c->mdev, vhca_id);
 }
 
 bool mlx5e_channels_get_ptp_rqn(struct mlx5e_channels *chs, u32 *rqn)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/channels.h b/drivers/net/ethernet/mellanox/mlx5/core/en/channels.h
index 637ca90daaa8..6715aa9383b9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/channels.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/channels.h
@@ -10,8 +10,10 @@ struct mlx5e_channels;
 
 unsigned int mlx5e_channels_get_num(struct mlx5e_channels *chs);
 bool mlx5e_channels_is_xsk(struct mlx5e_channels *chs, unsigned int ix);
-void mlx5e_channels_get_regular_rqn(struct mlx5e_channels *chs, unsigned int ix, u32 *rqn);
-void mlx5e_channels_get_xsk_rqn(struct mlx5e_channels *chs, unsigned int ix, u32 *rqn);
+void mlx5e_channels_get_regular_rqn(struct mlx5e_channels *chs, unsigned int ix, u32 *rqn,
+				    u32 *vhca_id);
+void mlx5e_channels_get_xsk_rqn(struct mlx5e_channels *chs, unsigned int ix, u32 *rqn,
+				u32 *vhca_id);
 bool mlx5e_channels_get_ptp_rqn(struct mlx5e_channels *chs, u32 *rqn);
 
 #endif /* __MLX5_EN_CHANNELS_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rqt.c b/drivers/net/ethernet/mellanox/mlx5/core/en/rqt.c
index 7b8ff7a71003..bcafb4bf9415 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/rqt.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rqt.c
@@ -4,6 +4,33 @@
 #include "rqt.h"
 #include <linux/mlx5/transobj.h>
 
+static bool verify_num_vhca_ids(struct mlx5_core_dev *mdev, u32 *vhca_ids,
+				unsigned int size)
+{
+	unsigned int max_num_vhca_id = MLX5_CAP_GEN_2(mdev, max_rqt_vhca_id);
+	int i;
+
+	/* Verify that all vhca_ids are in range [0, max_num_vhca_ids - 1] */
+	for (i = 0; i < size; i++)
+		if (vhca_ids[i] >= max_num_vhca_id)
+			return false;
+	return true;
+}
+
+static bool rqt_verify_vhca_ids(struct mlx5_core_dev *mdev, u32 *vhca_ids,
+				unsigned int size)
+{
+	if (!vhca_ids)
+		return true;
+
+	if (!MLX5_CAP_GEN(mdev, cross_vhca_rqt))
+		return false;
+	if (!verify_num_vhca_ids(mdev, vhca_ids, size))
+		return false;
+
+	return true;
+}
+
 void mlx5e_rss_params_indir_init_uniform(struct mlx5e_rss_params_indir *indir,
 					 unsigned int num_channels)
 {
@@ -13,19 +40,38 @@ void mlx5e_rss_params_indir_init_uniform(struct mlx5e_rss_params_indir *indir,
 		indir->table[i] = i % num_channels;
 }
 
+static void fill_rqn_list(void *rqtc, u32 *rqns, u32 *vhca_ids, unsigned int size)
+{
+	unsigned int i;
+
+	if (vhca_ids) {
+		MLX5_SET(rqtc, rqtc, rq_vhca_id_format, 1);
+		for (i = 0; i < size; i++) {
+			MLX5_SET(rqtc, rqtc, rq_vhca[i].rq_num, rqns[i]);
+			MLX5_SET(rqtc, rqtc, rq_vhca[i].rq_vhca_id, vhca_ids[i]);
+		}
+	} else {
+		for (i = 0; i < size; i++)
+			MLX5_SET(rqtc, rqtc, rq_num[i], rqns[i]);
+	}
+}
 static int mlx5e_rqt_init(struct mlx5e_rqt *rqt, struct mlx5_core_dev *mdev,
-			  u16 max_size, u32 *init_rqns, u16 init_size)
+			  u16 max_size, u32 *init_rqns, u32 *init_vhca_ids, u16 init_size)
 {
+	int entry_sz;
 	void *rqtc;
 	int inlen;
 	int err;
 	u32 *in;
-	int i;
+
+	if (!rqt_verify_vhca_ids(mdev, init_vhca_ids, init_size))
+		return -EOPNOTSUPP;
 
 	rqt->mdev = mdev;
 	rqt->size = max_size;
 
-	inlen = MLX5_ST_SZ_BYTES(create_rqt_in) + sizeof(u32) * init_size;
+	entry_sz = init_vhca_ids ? MLX5_ST_SZ_BYTES(rq_vhca) : MLX5_ST_SZ_BYTES(rq_num);
+	inlen = MLX5_ST_SZ_BYTES(create_rqt_in) + entry_sz * init_size;
 	in = kvzalloc(inlen, GFP_KERNEL);
 	if (!in)
 		return -ENOMEM;
@@ -33,10 +79,9 @@ static int mlx5e_rqt_init(struct mlx5e_rqt *rqt, struct mlx5_core_dev *mdev,
 	rqtc = MLX5_ADDR_OF(create_rqt_in, in, rqt_context);
 
 	MLX5_SET(rqtc, rqtc, rqt_max_size, rqt->size);
-
 	MLX5_SET(rqtc, rqtc, rqt_actual_size, init_size);
-	for (i = 0; i < init_size; i++)
-		MLX5_SET(rqtc, rqtc, rq_num[i], init_rqns[i]);
+
+	fill_rqn_list(rqtc, init_rqns, init_vhca_ids, init_size);
 
 	err = mlx5_core_create_rqt(rqt->mdev, in, inlen, &rqt->rqtn);
 
@@ -49,7 +94,7 @@ int mlx5e_rqt_init_direct(struct mlx5e_rqt *rqt, struct mlx5_core_dev *mdev,
 {
 	u16 max_size = indir_enabled ? indir_table_size : 1;
 
-	return mlx5e_rqt_init(rqt, mdev, max_size, &init_rqn, 1);
+	return mlx5e_rqt_init(rqt, mdev, max_size, &init_rqn, NULL, 1);
 }
 
 static int mlx5e_bits_invert(unsigned long a, int size)
@@ -63,7 +108,8 @@ static int mlx5e_bits_invert(unsigned long a, int size)
 	return inv;
 }
 
-static int mlx5e_calc_indir_rqns(u32 *rss_rqns, u32 *rqns, unsigned int num_rqns,
+static int mlx5e_calc_indir_rqns(u32 *rss_rqns, u32 *rqns, u32 *rss_vhca_ids, u32 *vhca_ids,
+				 unsigned int num_rqns,
 				 u8 hfunc, struct mlx5e_rss_params_indir *indir)
 {
 	unsigned int i;
@@ -82,30 +128,42 @@ static int mlx5e_calc_indir_rqns(u32 *rss_rqns, u32 *rqns, unsigned int num_rqns
 			 */
 			return -EINVAL;
 		rss_rqns[i] = rqns[ix];
+		if (vhca_ids)
+			rss_vhca_ids[i] = vhca_ids[ix];
 	}
 
 	return 0;
 }
 
 int mlx5e_rqt_init_indir(struct mlx5e_rqt *rqt, struct mlx5_core_dev *mdev,
-			 u32 *rqns, unsigned int num_rqns,
+			 u32 *rqns, u32 *vhca_ids, unsigned int num_rqns,
 			 u8 hfunc, struct mlx5e_rss_params_indir *indir)
 {
-	u32 *rss_rqns;
+	u32 *rss_rqns, *rss_vhca_ids = NULL;
 	int err;
 
 	rss_rqns = kvmalloc_array(indir->actual_table_size, sizeof(*rss_rqns), GFP_KERNEL);
 	if (!rss_rqns)
 		return -ENOMEM;
 
-	err = mlx5e_calc_indir_rqns(rss_rqns, rqns, num_rqns, hfunc, indir);
+	if (vhca_ids) {
+		rss_vhca_ids = kvmalloc_array(indir->actual_table_size, sizeof(*rss_vhca_ids),
+					      GFP_KERNEL);
+		if (!rss_vhca_ids) {
+			kvfree(rss_rqns);
+			return -ENOMEM;
+		}
+	}
+
+	err = mlx5e_calc_indir_rqns(rss_rqns, rqns, rss_vhca_ids, vhca_ids, num_rqns, hfunc, indir);
 	if (err)
 		goto out;
 
-	err = mlx5e_rqt_init(rqt, mdev, indir->max_table_size, rss_rqns,
+	err = mlx5e_rqt_init(rqt, mdev, indir->max_table_size, rss_rqns, rss_vhca_ids,
 			     indir->actual_table_size);
 
 out:
+	kvfree(rss_vhca_ids);
 	kvfree(rss_rqns);
 	return err;
 }
@@ -126,15 +184,20 @@ void mlx5e_rqt_destroy(struct mlx5e_rqt *rqt)
 	mlx5_core_destroy_rqt(rqt->mdev, rqt->rqtn);
 }
 
-static int mlx5e_rqt_redirect(struct mlx5e_rqt *rqt, u32 *rqns, unsigned int size)
+static int mlx5e_rqt_redirect(struct mlx5e_rqt *rqt, u32 *rqns, u32 *vhca_ids,
+			      unsigned int size)
 {
-	unsigned int i;
+	int entry_sz;
 	void *rqtc;
 	int inlen;
 	u32 *in;
 	int err;
 
-	inlen = MLX5_ST_SZ_BYTES(modify_rqt_in) + sizeof(u32) * size;
+	if (!rqt_verify_vhca_ids(rqt->mdev, vhca_ids, size))
+		return -EINVAL;
+
+	entry_sz = vhca_ids ? MLX5_ST_SZ_BYTES(rq_vhca) : MLX5_ST_SZ_BYTES(rq_num);
+	inlen = MLX5_ST_SZ_BYTES(modify_rqt_in) + entry_sz * size;
 	in = kvzalloc(inlen, GFP_KERNEL);
 	if (!in)
 		return -ENOMEM;
@@ -143,8 +206,8 @@ static int mlx5e_rqt_redirect(struct mlx5e_rqt *rqt, u32 *rqns, unsigned int siz
 
 	MLX5_SET(modify_rqt_in, in, bitmask.rqn_list, 1);
 	MLX5_SET(rqtc, rqtc, rqt_actual_size, size);
-	for (i = 0; i < size; i++)
-		MLX5_SET(rqtc, rqtc, rq_num[i], rqns[i]);
+
+	fill_rqn_list(rqtc, rqns, vhca_ids, size);
 
 	err = mlx5_core_modify_rqt(rqt->mdev, rqt->rqtn, in, inlen);
 
@@ -152,17 +215,21 @@ static int mlx5e_rqt_redirect(struct mlx5e_rqt *rqt, u32 *rqns, unsigned int siz
 	return err;
 }
 
-int mlx5e_rqt_redirect_direct(struct mlx5e_rqt *rqt, u32 rqn)
+int mlx5e_rqt_redirect_direct(struct mlx5e_rqt *rqt, u32 rqn, u32 *vhca_id)
 {
-	return mlx5e_rqt_redirect(rqt, &rqn, 1);
+	return mlx5e_rqt_redirect(rqt, &rqn, vhca_id, 1);
 }
 
-int mlx5e_rqt_redirect_indir(struct mlx5e_rqt *rqt, u32 *rqns, unsigned int num_rqns,
+int mlx5e_rqt_redirect_indir(struct mlx5e_rqt *rqt, u32 *rqns, u32 *vhca_ids,
+			     unsigned int num_rqns,
 			     u8 hfunc, struct mlx5e_rss_params_indir *indir)
 {
-	u32 *rss_rqns;
+	u32 *rss_rqns, *rss_vhca_ids = NULL;
 	int err;
 
+	if (!rqt_verify_vhca_ids(rqt->mdev, vhca_ids, num_rqns))
+		return -EINVAL;
+
 	if (WARN_ON(rqt->size != indir->max_table_size))
 		return -EINVAL;
 
@@ -170,13 +237,23 @@ int mlx5e_rqt_redirect_indir(struct mlx5e_rqt *rqt, u32 *rqns, unsigned int num_
 	if (!rss_rqns)
 		return -ENOMEM;
 
-	err = mlx5e_calc_indir_rqns(rss_rqns, rqns, num_rqns, hfunc, indir);
+	if (vhca_ids) {
+		rss_vhca_ids = kvmalloc_array(indir->actual_table_size, sizeof(*rss_vhca_ids),
+					      GFP_KERNEL);
+		if (!rss_vhca_ids) {
+			kvfree(rss_rqns);
+			return -ENOMEM;
+		}
+	}
+
+	err = mlx5e_calc_indir_rqns(rss_rqns, rqns, rss_vhca_ids, vhca_ids, num_rqns, hfunc, indir);
 	if (err)
 		goto out;
 
-	err = mlx5e_rqt_redirect(rqt, rss_rqns, indir->actual_table_size);
+	err = mlx5e_rqt_redirect(rqt, rss_rqns, rss_vhca_ids, indir->actual_table_size);
 
 out:
+	kvfree(rss_vhca_ids);
 	kvfree(rss_rqns);
 	return err;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rqt.h b/drivers/net/ethernet/mellanox/mlx5/core/en/rqt.h
index 77fba3ebd18d..e0bc30308c77 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/rqt.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rqt.h
@@ -20,7 +20,7 @@ void mlx5e_rss_params_indir_init_uniform(struct mlx5e_rss_params_indir *indir,
 					 unsigned int num_channels);
 
 struct mlx5e_rqt {
-	struct mlx5_core_dev *mdev;
+	struct mlx5_core_dev *mdev; /* primary */
 	u32 rqtn;
 	u16 size;
 };
@@ -28,7 +28,7 @@ struct mlx5e_rqt {
 int mlx5e_rqt_init_direct(struct mlx5e_rqt *rqt, struct mlx5_core_dev *mdev,
 			  bool indir_enabled, u32 init_rqn, u32 indir_table_size);
 int mlx5e_rqt_init_indir(struct mlx5e_rqt *rqt, struct mlx5_core_dev *mdev,
-			 u32 *rqns, unsigned int num_rqns,
+			 u32 *rqns, u32 *vhca_ids, unsigned int num_rqns,
 			 u8 hfunc, struct mlx5e_rss_params_indir *indir);
 void mlx5e_rqt_destroy(struct mlx5e_rqt *rqt);
 
@@ -38,8 +38,9 @@ static inline u32 mlx5e_rqt_get_rqtn(struct mlx5e_rqt *rqt)
 }
 
 u32 mlx5e_rqt_size(struct mlx5_core_dev *mdev, unsigned int num_channels);
-int mlx5e_rqt_redirect_direct(struct mlx5e_rqt *rqt, u32 rqn);
-int mlx5e_rqt_redirect_indir(struct mlx5e_rqt *rqt, u32 *rqns, unsigned int num_rqns,
+int mlx5e_rqt_redirect_direct(struct mlx5e_rqt *rqt, u32 rqn, u32 *vhca_id);
+int mlx5e_rqt_redirect_indir(struct mlx5e_rqt *rqt, u32 *rqns, u32 *vhca_ids,
+			     unsigned int num_rqns,
 			     u8 hfunc, struct mlx5e_rss_params_indir *indir);
 
 #endif /* __MLX5_EN_RQT_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rss.c b/drivers/net/ethernet/mellanox/mlx5/core/en/rss.c
index c1545a2e8d6d..5f742f896600 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/rss.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rss.c
@@ -74,7 +74,7 @@ struct mlx5e_rss {
 	struct mlx5e_tir *tir[MLX5E_NUM_INDIR_TIRS];
 	struct mlx5e_tir *inner_tir[MLX5E_NUM_INDIR_TIRS];
 	struct mlx5e_rqt rqt;
-	struct mlx5_core_dev *mdev;
+	struct mlx5_core_dev *mdev; /* primary */
 	u32 drop_rqn;
 	bool inner_ft_support;
 	bool enabled;
@@ -473,21 +473,22 @@ int mlx5e_rss_obtain_tirn(struct mlx5e_rss *rss,
 	return 0;
 }
 
-static int mlx5e_rss_apply(struct mlx5e_rss *rss, u32 *rqns, unsigned int num_rqns)
+static int mlx5e_rss_apply(struct mlx5e_rss *rss, u32 *rqns, u32 *vhca_ids, unsigned int num_rqns)
 {
 	int err;
 
-	err = mlx5e_rqt_redirect_indir(&rss->rqt, rqns, num_rqns, rss->hash.hfunc, &rss->indir);
+	err = mlx5e_rqt_redirect_indir(&rss->rqt, rqns, vhca_ids, num_rqns, rss->hash.hfunc,
+				       &rss->indir);
 	if (err)
 		mlx5e_rss_warn(rss->mdev, "Failed to redirect RQT %#x to channels: err = %d\n",
 			       mlx5e_rqt_get_rqtn(&rss->rqt), err);
 	return err;
 }
 
-void mlx5e_rss_enable(struct mlx5e_rss *rss, u32 *rqns, unsigned int num_rqns)
+void mlx5e_rss_enable(struct mlx5e_rss *rss, u32 *rqns, u32 *vhca_ids, unsigned int num_rqns)
 {
 	rss->enabled = true;
-	mlx5e_rss_apply(rss, rqns, num_rqns);
+	mlx5e_rss_apply(rss, rqns, vhca_ids, num_rqns);
 }
 
 void mlx5e_rss_disable(struct mlx5e_rss *rss)
@@ -495,7 +496,7 @@ void mlx5e_rss_disable(struct mlx5e_rss *rss)
 	int err;
 
 	rss->enabled = false;
-	err = mlx5e_rqt_redirect_direct(&rss->rqt, rss->drop_rqn);
+	err = mlx5e_rqt_redirect_direct(&rss->rqt, rss->drop_rqn, NULL);
 	if (err)
 		mlx5e_rss_warn(rss->mdev, "Failed to redirect RQT %#x to drop RQ %#x: err = %d\n",
 			       mlx5e_rqt_get_rqtn(&rss->rqt), rss->drop_rqn, err);
@@ -568,7 +569,7 @@ int mlx5e_rss_get_rxfh(struct mlx5e_rss *rss, u32 *indir, u8 *key, u8 *hfunc)
 
 int mlx5e_rss_set_rxfh(struct mlx5e_rss *rss, const u32 *indir,
 		       const u8 *key, const u8 *hfunc,
-		       u32 *rqns, unsigned int num_rqns)
+		       u32 *rqns, u32 *vhca_ids, unsigned int num_rqns)
 {
 	bool changed_indir = false;
 	bool changed_hash = false;
@@ -608,7 +609,7 @@ int mlx5e_rss_set_rxfh(struct mlx5e_rss *rss, const u32 *indir,
 	}
 
 	if (changed_indir && rss->enabled) {
-		err = mlx5e_rss_apply(rss, rqns, num_rqns);
+		err = mlx5e_rss_apply(rss, rqns, vhca_ids, num_rqns);
 		if (err) {
 			mlx5e_rss_copy(rss, old_rss);
 			goto out;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rss.h b/drivers/net/ethernet/mellanox/mlx5/core/en/rss.h
index d1d0bc350e92..d0df98963c8d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/rss.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rss.h
@@ -39,7 +39,7 @@ int mlx5e_rss_obtain_tirn(struct mlx5e_rss *rss,
 			  const struct mlx5e_packet_merge_param *init_pkt_merge_param,
 			  bool inner, u32 *tirn);
 
-void mlx5e_rss_enable(struct mlx5e_rss *rss, u32 *rqns, unsigned int num_rqns);
+void mlx5e_rss_enable(struct mlx5e_rss *rss, u32 *rqns, u32 *vhca_ids, unsigned int num_rqns);
 void mlx5e_rss_disable(struct mlx5e_rss *rss);
 
 int mlx5e_rss_packet_merge_set_param(struct mlx5e_rss *rss,
@@ -47,7 +47,7 @@ int mlx5e_rss_packet_merge_set_param(struct mlx5e_rss *rss,
 int mlx5e_rss_get_rxfh(struct mlx5e_rss *rss, u32 *indir, u8 *key, u8 *hfunc);
 int mlx5e_rss_set_rxfh(struct mlx5e_rss *rss, const u32 *indir,
 		       const u8 *key, const u8 *hfunc,
-		       u32 *rqns, unsigned int num_rqns);
+		       u32 *rqns, u32 *vhca_ids, unsigned int num_rqns);
 struct mlx5e_rss_params_hash mlx5e_rss_get_hash(struct mlx5e_rss *rss);
 u8 mlx5e_rss_get_hash_fields(struct mlx5e_rss *rss, enum mlx5_traffic_types tt);
 int mlx5e_rss_set_hash_fields(struct mlx5e_rss *rss, enum mlx5_traffic_types tt,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rx_res.c b/drivers/net/ethernet/mellanox/mlx5/core/en/rx_res.c
index b23e224e3763..a86eade9a9e0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/rx_res.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rx_res.c
@@ -8,7 +8,7 @@
 #define MLX5E_MAX_NUM_RSS 16
 
 struct mlx5e_rx_res {
-	struct mlx5_core_dev *mdev;
+	struct mlx5_core_dev *mdev; /* primary */
 	enum mlx5e_rx_res_features features;
 	unsigned int max_nch;
 	u32 drop_rqn;
@@ -19,6 +19,7 @@ struct mlx5e_rx_res {
 	struct mlx5e_rss *rss[MLX5E_MAX_NUM_RSS];
 	bool rss_active;
 	u32 *rss_rqns;
+	u32 *rss_vhca_ids;
 	unsigned int rss_nch;
 
 	struct {
@@ -34,6 +35,13 @@ struct mlx5e_rx_res {
 
 /* API for rx_res_rss_* */
 
+static u32 *get_vhca_ids(struct mlx5e_rx_res *res, int offset)
+{
+	bool multi_vhca = res->features & MLX5E_RX_RES_FEATURE_MULTI_VHCA;
+
+	return multi_vhca ? res->rss_vhca_ids + offset : NULL;
+}
+
 void mlx5e_rx_res_rss_update_num_channels(struct mlx5e_rx_res *res, u32 nch)
 {
 	int i;
@@ -85,8 +93,11 @@ int mlx5e_rx_res_rss_init(struct mlx5e_rx_res *res, u32 *rss_idx, unsigned int i
 		return PTR_ERR(rss);
 
 	mlx5e_rss_set_indir_uniform(rss, init_nch);
-	if (res->rss_active)
-		mlx5e_rss_enable(rss, res->rss_rqns, res->rss_nch);
+	if (res->rss_active) {
+		u32 *vhca_ids = get_vhca_ids(res, 0);
+
+		mlx5e_rss_enable(rss, res->rss_rqns, vhca_ids, res->rss_nch);
+	}
 
 	res->rss[i] = rss;
 	*rss_idx = i;
@@ -153,10 +164,12 @@ static void mlx5e_rx_res_rss_enable(struct mlx5e_rx_res *res)
 
 	for (i = 0; i < MLX5E_MAX_NUM_RSS; i++) {
 		struct mlx5e_rss *rss = res->rss[i];
+		u32 *vhca_ids;
 
 		if (!rss)
 			continue;
-		mlx5e_rss_enable(rss, res->rss_rqns, res->rss_nch);
+		vhca_ids = get_vhca_ids(res, 0);
+		mlx5e_rss_enable(rss, res->rss_rqns, vhca_ids, res->rss_nch);
 	}
 }
 
@@ -200,6 +213,7 @@ int mlx5e_rx_res_rss_get_rxfh(struct mlx5e_rx_res *res, u32 rss_idx,
 int mlx5e_rx_res_rss_set_rxfh(struct mlx5e_rx_res *res, u32 rss_idx,
 			      const u32 *indir, const u8 *key, const u8 *hfunc)
 {
+	u32 *vhca_ids = get_vhca_ids(res, 0);
 	struct mlx5e_rss *rss;
 
 	if (rss_idx >= MLX5E_MAX_NUM_RSS)
@@ -209,7 +223,8 @@ int mlx5e_rx_res_rss_set_rxfh(struct mlx5e_rx_res *res, u32 rss_idx,
 	if (!rss)
 		return -ENOENT;
 
-	return mlx5e_rss_set_rxfh(rss, indir, key, hfunc, res->rss_rqns, res->rss_nch);
+	return mlx5e_rss_set_rxfh(rss, indir, key, hfunc, res->rss_rqns, vhca_ids,
+				  res->rss_nch);
 }
 
 int mlx5e_rx_res_rss_get_hash_fields(struct mlx5e_rx_res *res, u32 rss_idx,
@@ -280,11 +295,13 @@ struct mlx5e_rss *mlx5e_rx_res_rss_get(struct mlx5e_rx_res *res, u32 rss_idx)
 
 static void mlx5e_rx_res_free(struct mlx5e_rx_res *res)
 {
+	kvfree(res->rss_vhca_ids);
 	kvfree(res->rss_rqns);
 	kvfree(res);
 }
 
-static struct mlx5e_rx_res *mlx5e_rx_res_alloc(struct mlx5_core_dev *mdev, unsigned int max_nch)
+static struct mlx5e_rx_res *mlx5e_rx_res_alloc(struct mlx5_core_dev *mdev, unsigned int max_nch,
+					       bool multi_vhca)
 {
 	struct mlx5e_rx_res *rx_res;
 
@@ -298,6 +315,15 @@ static struct mlx5e_rx_res *mlx5e_rx_res_alloc(struct mlx5_core_dev *mdev, unsig
 		return NULL;
 	}
 
+	if (multi_vhca) {
+		rx_res->rss_vhca_ids = kvcalloc(max_nch, sizeof(*rx_res->rss_vhca_ids), GFP_KERNEL);
+		if (!rx_res->rss_vhca_ids) {
+			kvfree(rx_res->rss_rqns);
+			kvfree(rx_res);
+			return NULL;
+		}
+	}
+
 	return rx_res;
 }
 
@@ -424,10 +450,11 @@ mlx5e_rx_res_create(struct mlx5_core_dev *mdev, enum mlx5e_rx_res_features featu
 		    const struct mlx5e_packet_merge_param *init_pkt_merge_param,
 		    unsigned int init_nch)
 {
+	bool multi_vhca = features & MLX5E_RX_RES_FEATURE_MULTI_VHCA;
 	struct mlx5e_rx_res *res;
 	int err;
 
-	res = mlx5e_rx_res_alloc(mdev, max_nch);
+	res = mlx5e_rx_res_alloc(mdev, max_nch, multi_vhca);
 	if (!res)
 		return ERR_PTR(-ENOMEM);
 
@@ -504,10 +531,11 @@ static void mlx5e_rx_res_channel_activate_direct(struct mlx5e_rx_res *res,
 						 struct mlx5e_channels *chs,
 						 unsigned int ix)
 {
+	u32 *vhca_id = get_vhca_ids(res, ix);
 	u32 rqn = res->rss_rqns[ix];
 	int err;
 
-	err = mlx5e_rqt_redirect_direct(&res->channels[ix].direct_rqt, rqn);
+	err = mlx5e_rqt_redirect_direct(&res->channels[ix].direct_rqt, rqn, vhca_id);
 	if (err)
 		mlx5_core_warn(res->mdev, "Failed to redirect direct RQT %#x to RQ %#x (channel %u): err = %d\n",
 			       mlx5e_rqt_get_rqtn(&res->channels[ix].direct_rqt),
@@ -519,7 +547,7 @@ static void mlx5e_rx_res_channel_deactivate_direct(struct mlx5e_rx_res *res,
 {
 	int err;
 
-	err = mlx5e_rqt_redirect_direct(&res->channels[ix].direct_rqt, res->drop_rqn);
+	err = mlx5e_rqt_redirect_direct(&res->channels[ix].direct_rqt, res->drop_rqn, NULL);
 	if (err)
 		mlx5_core_warn(res->mdev, "Failed to redirect direct RQT %#x to drop RQ %#x (channel %u): err = %d\n",
 			       mlx5e_rqt_get_rqtn(&res->channels[ix].direct_rqt),
@@ -534,10 +562,12 @@ void mlx5e_rx_res_channels_activate(struct mlx5e_rx_res *res, struct mlx5e_chann
 	nch = mlx5e_channels_get_num(chs);
 
 	for (ix = 0; ix < chs->num; ix++) {
+		u32 *vhca_id = get_vhca_ids(res, ix);
+
 		if (mlx5e_channels_is_xsk(chs, ix))
-			mlx5e_channels_get_xsk_rqn(chs, ix, &res->rss_rqns[ix]);
+			mlx5e_channels_get_xsk_rqn(chs, ix, &res->rss_rqns[ix], vhca_id);
 		else
-			mlx5e_channels_get_regular_rqn(chs, ix, &res->rss_rqns[ix]);
+			mlx5e_channels_get_regular_rqn(chs, ix, &res->rss_rqns[ix], vhca_id);
 	}
 	res->rss_nch = chs->num;
 
@@ -554,7 +584,7 @@ void mlx5e_rx_res_channels_activate(struct mlx5e_rx_res *res, struct mlx5e_chann
 		if (!mlx5e_channels_get_ptp_rqn(chs, &rqn))
 			rqn = res->drop_rqn;
 
-		err = mlx5e_rqt_redirect_direct(&res->ptp.rqt, rqn);
+		err = mlx5e_rqt_redirect_direct(&res->ptp.rqt, rqn, NULL);
 		if (err)
 			mlx5_core_warn(res->mdev, "Failed to redirect direct RQT %#x to RQ %#x (PTP): err = %d\n",
 				       mlx5e_rqt_get_rqtn(&res->ptp.rqt),
@@ -573,7 +603,7 @@ void mlx5e_rx_res_channels_deactivate(struct mlx5e_rx_res *res)
 		mlx5e_rx_res_channel_deactivate_direct(res, ix);
 
 	if (res->features & MLX5E_RX_RES_FEATURE_PTP) {
-		err = mlx5e_rqt_redirect_direct(&res->ptp.rqt, res->drop_rqn);
+		err = mlx5e_rqt_redirect_direct(&res->ptp.rqt, res->drop_rqn, NULL);
 		if (err)
 			mlx5_core_warn(res->mdev, "Failed to redirect direct RQT %#x to drop RQ %#x (PTP): err = %d\n",
 				       mlx5e_rqt_get_rqtn(&res->ptp.rqt),
@@ -584,10 +614,12 @@ void mlx5e_rx_res_channels_deactivate(struct mlx5e_rx_res *res)
 void mlx5e_rx_res_xsk_update(struct mlx5e_rx_res *res, struct mlx5e_channels *chs,
 			     unsigned int ix, bool xsk)
 {
+	u32 *vhca_id = get_vhca_ids(res, ix);
+
 	if (xsk)
-		mlx5e_channels_get_xsk_rqn(chs, ix, &res->rss_rqns[ix]);
+		mlx5e_channels_get_xsk_rqn(chs, ix, &res->rss_rqns[ix], vhca_id);
 	else
-		mlx5e_channels_get_regular_rqn(chs, ix, &res->rss_rqns[ix]);
+		mlx5e_channels_get_regular_rqn(chs, ix, &res->rss_rqns[ix], vhca_id);
 
 	mlx5e_rx_res_rss_enable(res);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rx_res.h b/drivers/net/ethernet/mellanox/mlx5/core/en/rx_res.h
index 82aaba8a82b3..7b1a9f0f1874 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/rx_res.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rx_res.h
@@ -18,6 +18,7 @@ struct mlx5e_rss_params_hash;
 enum mlx5e_rx_res_features {
 	MLX5E_RX_RES_FEATURE_INNER_FT = BIT(0),
 	MLX5E_RX_RES_FEATURE_PTP = BIT(1),
+	MLX5E_RX_RES_FEATURE_MULTI_VHCA = BIT(2),
 };
 
 /* Setup */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8d9b0cdb4e01..8cc636cf995a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -5389,6 +5389,8 @@ static int mlx5e_init_nic_rx(struct mlx5e_priv *priv)
 	features = MLX5E_RX_RES_FEATURE_PTP;
 	if (mlx5_tunnel_inner_ft_supported(mdev))
 		features |= MLX5E_RX_RES_FEATURE_INNER_FT;
+	if (mlx5_get_sd(priv->mdev))
+		features |= MLX5E_RX_RES_FEATURE_MULTI_VHCA;
 
 	priv->rx_res = mlx5e_rx_res_create(priv->mdev, features, priv->max_nch, priv->drop_rq.rqn,
 					   &priv->channels.params.packet_merge,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 9fb2c057bd78..080d79d80dd6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -766,7 +766,7 @@ static int mlx5e_hairpin_create_indirect_rqt(struct mlx5e_hairpin *hp)
 		return err;
 
 	mlx5e_rss_params_indir_init_uniform(&indir, hp->num_channels);
-	err = mlx5e_rqt_init_indir(&hp->indir_rqt, mdev, hp->pair->rqn, hp->num_channels,
+	err = mlx5e_rqt_init_indir(&hp->indir_rqt, mdev, hp->pair->rqn, NULL, hp->num_channels,
 				   mlx5e_rx_res_get_current_hash(priv->rx_res).hfunc,
 				   &indir);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [net-next V3 12/15] net/mlx5e: Support per-mdev queue counter
  2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
                   ` (10 preceding siblings ...)
  2024-02-15  3:08 ` [net-next V3 11/15] net/mlx5e: Support cross-vhca RSS Saeed Mahameed
@ 2024-02-15  3:08 ` Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 13/15] net/mlx5e: Block TLS device offload on combined SD netdev Saeed Mahameed
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Tariq Toukan <tariqt@nvidia.com>

Each queue counter object counts some events (in hardware) for the RQs
that are attached to it, like events of packet drops due to no receive
WQE (rx_out_of_buffer).

Each RQ can be attached to a queue counter only within the same vhca. To
still cover all RQs with these counters, we create multiple instances,
one per vhca.

The result that's shown to the user is now the sum of all instances.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  7 +--
 .../mellanox/mlx5/core/en/monitor_stats.c     | 48 +++++++++++++------
 .../ethernet/mellanox/mlx5/core/en/params.c   |  7 +--
 .../ethernet/mellanox/mlx5/core/en/params.h   |  3 --
 .../net/ethernet/mellanox/mlx5/core/en/ptp.c  | 12 +++--
 .../net/ethernet/mellanox/mlx5/core/en/trap.c | 11 +++--
 .../mellanox/mlx5/core/en/xsk/setup.c         |  8 ++--
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 43 ++++++++++-------
 .../ethernet/mellanox/mlx5/core/en_stats.c    | 39 ++++++++++-----
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   |  2 +-
 10 files changed, 111 insertions(+), 69 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index f6e78c465c7a..84db05fb9389 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -793,6 +793,7 @@ struct mlx5e_channel {
 	DECLARE_BITMAP(state, MLX5E_CHANNEL_NUM_STATES);
 	int                        ix;
 	int                        vec_ix;
+	int                        sd_ix;
 	int                        cpu;
 	/* Sync between icosq recovery and XSK enable/disable. */
 	struct mutex               icosq_recovery_lock;
@@ -916,7 +917,7 @@ struct mlx5e_priv {
 	bool                       tx_ptp_opened;
 	bool                       rx_ptp_opened;
 	struct hwtstamp_config     tstamp;
-	u16                        q_counter;
+	u16                        q_counter[MLX5_SD_MAX_GROUP_SZ];
 	u16                        drop_rq_q_counter;
 	struct notifier_block      events_nb;
 	struct notifier_block      blocking_events_nb;
@@ -1031,12 +1032,12 @@ struct mlx5e_xsk_param;
 
 struct mlx5e_rq_param;
 int mlx5e_open_rq(struct mlx5e_params *params, struct mlx5e_rq_param *param,
-		  struct mlx5e_xsk_param *xsk, int node,
+		  struct mlx5e_xsk_param *xsk, int node, u16 q_counter,
 		  struct mlx5e_rq *rq);
 #define MLX5E_RQ_WQES_TIMEOUT 20000 /* msecs */
 int mlx5e_wait_for_min_rx_wqes(struct mlx5e_rq *rq, int wait_time);
 void mlx5e_close_rq(struct mlx5e_rq *rq);
-int mlx5e_create_rq(struct mlx5e_rq *rq, struct mlx5e_rq_param *param);
+int mlx5e_create_rq(struct mlx5e_rq *rq, struct mlx5e_rq_param *param, u16 q_counter);
 void mlx5e_destroy_rq(struct mlx5e_rq *rq);
 
 struct mlx5e_sq_param;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/monitor_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en/monitor_stats.c
index 40c8df111754..e2d8d2754be0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/monitor_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/monitor_stats.c
@@ -20,10 +20,8 @@
 #define NUM_REQ_PPCNT_COUNTER_S1 MLX5_CMD_SET_MONITOR_NUM_PPCNT_COUNTER_SET1
 #define NUM_REQ_Q_COUNTERS_S1    MLX5_CMD_SET_MONITOR_NUM_Q_COUNTERS_SET1
 
-int mlx5e_monitor_counter_supported(struct mlx5e_priv *priv)
+static int mlx5e_monitor_counter_cap(struct mlx5_core_dev *mdev)
 {
-	struct mlx5_core_dev *mdev = priv->mdev;
-
 	if (!MLX5_CAP_GEN(mdev, max_num_of_monitor_counters))
 		return false;
 	if (MLX5_CAP_PCAM_REG(mdev, ppcnt) &&
@@ -36,24 +34,38 @@ int mlx5e_monitor_counter_supported(struct mlx5e_priv *priv)
 	return true;
 }
 
-static void mlx5e_monitor_counter_arm(struct mlx5e_priv *priv)
+int mlx5e_monitor_counter_supported(struct mlx5e_priv *priv)
+{
+	struct mlx5_core_dev *pos;
+	int i;
+
+	mlx5_sd_for_each_dev(i, priv->mdev, pos)
+		if (!mlx5e_monitor_counter_cap(pos))
+			return false;
+	return true;
+}
+
+static void mlx5e_monitor_counter_arm(struct mlx5_core_dev *mdev)
 {
 	u32 in[MLX5_ST_SZ_DW(arm_monitor_counter_in)] = {};
 
 	MLX5_SET(arm_monitor_counter_in, in, opcode,
 		 MLX5_CMD_OP_ARM_MONITOR_COUNTER);
-	mlx5_cmd_exec_in(priv->mdev, arm_monitor_counter, in);
+	mlx5_cmd_exec_in(mdev, arm_monitor_counter, in);
 }
 
 static void mlx5e_monitor_counters_work(struct work_struct *work)
 {
 	struct mlx5e_priv *priv = container_of(work, struct mlx5e_priv,
 					       monitor_counters_work);
+	struct mlx5_core_dev *pos;
+	int i;
 
 	mutex_lock(&priv->state_lock);
 	mlx5e_stats_update_ndo_stats(priv);
 	mutex_unlock(&priv->state_lock);
-	mlx5e_monitor_counter_arm(priv);
+	mlx5_sd_for_each_dev(i, priv->mdev, pos)
+		mlx5e_monitor_counter_arm(pos);
 }
 
 static int mlx5e_monitor_event_handler(struct notifier_block *nb,
@@ -97,15 +109,13 @@ static int fill_monitor_counter_q_counter_set1(int cnt, int q_counter, u32 *in)
 }
 
 /* check if mlx5e_monitor_counter_supported before calling this function*/
-static void mlx5e_set_monitor_counter(struct mlx5e_priv *priv)
+static void mlx5e_set_monitor_counter(struct mlx5_core_dev *mdev, int q_counter)
 {
-	struct mlx5_core_dev *mdev = priv->mdev;
 	int max_num_of_counters = MLX5_CAP_GEN(mdev, max_num_of_monitor_counters);
 	int num_q_counters      = MLX5_CAP_GEN(mdev, num_q_monitor_counters);
 	int num_ppcnt_counters  = !MLX5_CAP_PCAM_REG(mdev, ppcnt) ? 0 :
 				  MLX5_CAP_GEN(mdev, num_ppcnt_monitor_counters);
 	u32 in[MLX5_ST_SZ_DW(set_monitor_counter_in)] = {};
-	int q_counter = priv->q_counter;
 	int cnt	= 0;
 
 	if (num_ppcnt_counters  >=  NUM_REQ_PPCNT_COUNTER_S1 &&
@@ -127,13 +137,17 @@ static void mlx5e_set_monitor_counter(struct mlx5e_priv *priv)
 /* check if mlx5e_monitor_counter_supported before calling this function*/
 void mlx5e_monitor_counter_init(struct mlx5e_priv *priv)
 {
+	struct mlx5_core_dev *pos;
+	int i;
+
 	INIT_WORK(&priv->monitor_counters_work, mlx5e_monitor_counters_work);
 	MLX5_NB_INIT(&priv->monitor_counters_nb, mlx5e_monitor_event_handler,
 		     MONITOR_COUNTER);
-	mlx5_eq_notifier_register(priv->mdev, &priv->monitor_counters_nb);
-
-	mlx5e_set_monitor_counter(priv);
-	mlx5e_monitor_counter_arm(priv);
+	mlx5_sd_for_each_dev(i, priv->mdev, pos) {
+		mlx5_eq_notifier_register(pos, &priv->monitor_counters_nb);
+		mlx5e_set_monitor_counter(pos, priv->q_counter[i]);
+		mlx5e_monitor_counter_arm(pos);
+	}
 	queue_work(priv->wq, &priv->update_stats_work);
 }
 
@@ -141,11 +155,15 @@ void mlx5e_monitor_counter_init(struct mlx5e_priv *priv)
 void mlx5e_monitor_counter_cleanup(struct mlx5e_priv *priv)
 {
 	u32 in[MLX5_ST_SZ_DW(set_monitor_counter_in)] = {};
+	struct mlx5_core_dev *pos;
+	int i;
 
 	MLX5_SET(set_monitor_counter_in, in, opcode,
 		 MLX5_CMD_OP_SET_MONITOR_COUNTER);
 
-	mlx5_cmd_exec_in(priv->mdev, set_monitor_counter, in);
-	mlx5_eq_notifier_unregister(priv->mdev, &priv->monitor_counters_nb);
+	mlx5_sd_for_each_dev(i, priv->mdev, pos) {
+		mlx5_cmd_exec_in(pos, set_monitor_counter, in);
+		mlx5_eq_notifier_unregister(pos, &priv->monitor_counters_nb);
+	}
 	cancel_work_sync(&priv->monitor_counters_work);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index 8b99cc11f138..a3f31d9d527e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -959,7 +959,6 @@ static u8 rq_end_pad_mode(struct mlx5_core_dev *mdev, struct mlx5e_params *param
 int mlx5e_build_rq_param(struct mlx5_core_dev *mdev,
 			 struct mlx5e_params *params,
 			 struct mlx5e_xsk_param *xsk,
-			 u16 q_counter,
 			 struct mlx5e_rq_param *param)
 {
 	void *rqc = param->rqc;
@@ -1021,7 +1020,6 @@ int mlx5e_build_rq_param(struct mlx5_core_dev *mdev,
 	MLX5_SET(wq, wq, log_wq_stride,
 		 mlx5e_get_rqwq_log_stride(params->rq_wq_type, ndsegs));
 	MLX5_SET(wq, wq, pd,               mdev->mlx5e_res.hw_objs.pdn);
-	MLX5_SET(rqc, rqc, counter_set_id, q_counter);
 	MLX5_SET(rqc, rqc, vsd,            params->vlan_strip_disable);
 	MLX5_SET(rqc, rqc, scatter_fcs,    params->scatter_fcs_en);
 
@@ -1032,7 +1030,6 @@ int mlx5e_build_rq_param(struct mlx5_core_dev *mdev,
 }
 
 void mlx5e_build_drop_rq_param(struct mlx5_core_dev *mdev,
-			       u16 q_counter,
 			       struct mlx5e_rq_param *param)
 {
 	void *rqc = param->rqc;
@@ -1041,7 +1038,6 @@ void mlx5e_build_drop_rq_param(struct mlx5_core_dev *mdev,
 	MLX5_SET(wq, wq, wq_type, MLX5_WQ_TYPE_CYCLIC);
 	MLX5_SET(wq, wq, log_wq_stride,
 		 mlx5e_get_rqwq_log_stride(MLX5_WQ_TYPE_CYCLIC, 1));
-	MLX5_SET(rqc, rqc, counter_set_id, q_counter);
 
 	param->wq.buf_numa_node = dev_to_node(mlx5_core_dma_dev(mdev));
 }
@@ -1306,13 +1302,12 @@ void mlx5e_build_xdpsq_param(struct mlx5_core_dev *mdev,
 
 int mlx5e_build_channel_param(struct mlx5_core_dev *mdev,
 			      struct mlx5e_params *params,
-			      u16 q_counter,
 			      struct mlx5e_channel_param *cparam)
 {
 	u8 icosq_log_wq_sz, async_icosq_log_wq_sz;
 	int err;
 
-	err = mlx5e_build_rq_param(mdev, params, NULL, q_counter, &cparam->rq);
+	err = mlx5e_build_rq_param(mdev, params, NULL, &cparam->rq);
 	if (err)
 		return err;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
index 6800949dafbc..9a781f18b57f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
@@ -130,10 +130,8 @@ void mlx5e_build_create_cq_param(struct mlx5e_create_cq_param *ccp, struct mlx5e
 int mlx5e_build_rq_param(struct mlx5_core_dev *mdev,
 			 struct mlx5e_params *params,
 			 struct mlx5e_xsk_param *xsk,
-			 u16 q_counter,
 			 struct mlx5e_rq_param *param);
 void mlx5e_build_drop_rq_param(struct mlx5_core_dev *mdev,
-			       u16 q_counter,
 			       struct mlx5e_rq_param *param);
 void mlx5e_build_sq_param_common(struct mlx5_core_dev *mdev,
 				 struct mlx5e_sq_param *param);
@@ -149,7 +147,6 @@ void mlx5e_build_xdpsq_param(struct mlx5_core_dev *mdev,
 			     struct mlx5e_sq_param *param);
 int mlx5e_build_channel_param(struct mlx5_core_dev *mdev,
 			      struct mlx5e_params *params,
-			      u16 q_counter,
 			      struct mlx5e_channel_param *cparam);
 
 u16 mlx5e_calc_sq_stop_room(struct mlx5_core_dev *mdev, struct mlx5e_params *params);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
index fd4ef6431142..d0552751a974 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
@@ -646,7 +646,6 @@ static void mlx5e_ptp_build_sq_param(struct mlx5_core_dev *mdev,
 
 static void mlx5e_ptp_build_rq_param(struct mlx5_core_dev *mdev,
 				     struct net_device *netdev,
-				     u16 q_counter,
 				     struct mlx5e_ptp_params *ptp_params)
 {
 	struct mlx5e_rq_param *rq_params = &ptp_params->rq_param;
@@ -655,7 +654,7 @@ static void mlx5e_ptp_build_rq_param(struct mlx5_core_dev *mdev,
 	params->rq_wq_type = MLX5_WQ_TYPE_CYCLIC;
 	mlx5e_init_rq_type_params(mdev, params);
 	params->sw_mtu = netdev->max_mtu;
-	mlx5e_build_rq_param(mdev, params, NULL, q_counter, rq_params);
+	mlx5e_build_rq_param(mdev, params, NULL, rq_params);
 }
 
 static void mlx5e_ptp_build_params(struct mlx5e_ptp *c,
@@ -681,7 +680,7 @@ static void mlx5e_ptp_build_params(struct mlx5e_ptp *c,
 	/* RQ */
 	if (test_bit(MLX5E_PTP_STATE_RX, c->state)) {
 		params->vlan_strip_disable = orig->vlan_strip_disable;
-		mlx5e_ptp_build_rq_param(c->mdev, c->netdev, c->priv->q_counter, cparams);
+		mlx5e_ptp_build_rq_param(c->mdev, c->netdev, cparams);
 	}
 }
 
@@ -714,13 +713,16 @@ static int mlx5e_ptp_open_rq(struct mlx5e_ptp *c, struct mlx5e_params *params,
 			     struct mlx5e_rq_param *rq_param)
 {
 	int node = dev_to_node(c->mdev->device);
-	int err;
+	int err, sd_ix;
+	u16 q_counter;
 
 	err = mlx5e_init_ptp_rq(c, params, &c->rq);
 	if (err)
 		return err;
 
-	return mlx5e_open_rq(params, rq_param, NULL, node, &c->rq);
+	sd_ix = mlx5_sd_ch_ix_get_dev_ix(c->mdev, MLX5E_PTP_CHANNEL_IX);
+	q_counter = c->priv->q_counter[sd_ix];
+	return mlx5e_open_rq(params, rq_param, NULL, node, q_counter, &c->rq);
 }
 
 static int mlx5e_ptp_open_queues(struct mlx5e_ptp *c,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c b/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c
index ac458a8d10e0..53ca16cb9c41 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/trap.c
@@ -63,10 +63,12 @@ static int mlx5e_open_trap_rq(struct mlx5e_priv *priv, struct mlx5e_trap *t)
 	struct mlx5e_create_cq_param ccp = {};
 	struct dim_cq_moder trap_moder = {};
 	struct mlx5e_rq *rq = &t->rq;
+	u16 q_counter;
 	int node;
 	int err;
 
 	node = dev_to_node(mdev->device);
+	q_counter = priv->q_counter[0];
 
 	ccp.netdev   = priv->netdev;
 	ccp.wq       = priv->wq;
@@ -79,7 +81,7 @@ static int mlx5e_open_trap_rq(struct mlx5e_priv *priv, struct mlx5e_trap *t)
 		return err;
 
 	mlx5e_init_trap_rq(t, &t->params, rq);
-	err = mlx5e_open_rq(&t->params, rq_param, NULL, node, rq);
+	err = mlx5e_open_rq(&t->params, rq_param, NULL, node, q_counter, rq);
 	if (err)
 		goto err_destroy_cq;
 
@@ -116,15 +118,14 @@ static int mlx5e_create_trap_direct_rq_tir(struct mlx5_core_dev *mdev, struct ml
 }
 
 static void mlx5e_build_trap_params(struct mlx5_core_dev *mdev,
-				    int max_mtu, u16 q_counter,
-				    struct mlx5e_trap *t)
+				    int max_mtu, struct mlx5e_trap *t)
 {
 	struct mlx5e_params *params = &t->params;
 
 	params->rq_wq_type = MLX5_WQ_TYPE_CYCLIC;
 	mlx5e_init_rq_type_params(mdev, params);
 	params->sw_mtu = max_mtu;
-	mlx5e_build_rq_param(mdev, params, NULL, q_counter, &t->rq_param);
+	mlx5e_build_rq_param(mdev, params, NULL, &t->rq_param);
 }
 
 static struct mlx5e_trap *mlx5e_open_trap(struct mlx5e_priv *priv)
@@ -138,7 +139,7 @@ static struct mlx5e_trap *mlx5e_open_trap(struct mlx5e_priv *priv)
 	if (!t)
 		return ERR_PTR(-ENOMEM);
 
-	mlx5e_build_trap_params(priv->mdev, netdev->max_mtu, priv->q_counter, t);
+	mlx5e_build_trap_params(priv->mdev, netdev->max_mtu, t);
 
 	t->priv     = priv;
 	t->mdev     = priv->mdev;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
index 82e6abbc1734..06592b9f0424 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
@@ -49,10 +49,9 @@ bool mlx5e_validate_xsk_param(struct mlx5e_params *params,
 static void mlx5e_build_xsk_cparam(struct mlx5_core_dev *mdev,
 				   struct mlx5e_params *params,
 				   struct mlx5e_xsk_param *xsk,
-				   u16 q_counter,
 				   struct mlx5e_channel_param *cparam)
 {
-	mlx5e_build_rq_param(mdev, params, xsk, q_counter, &cparam->rq);
+	mlx5e_build_rq_param(mdev, params, xsk, &cparam->rq);
 	mlx5e_build_xdpsq_param(mdev, params, xsk, &cparam->xdp_sq);
 }
 
@@ -93,6 +92,7 @@ static int mlx5e_open_xsk_rq(struct mlx5e_channel *c, struct mlx5e_params *param
 			     struct mlx5e_rq_param *rq_params, struct xsk_buff_pool *pool,
 			     struct mlx5e_xsk_param *xsk)
 {
+	u16 q_counter = c->priv->q_counter[c->sd_ix];
 	struct mlx5e_rq *xskrq = &c->xskrq;
 	int err;
 
@@ -100,7 +100,7 @@ static int mlx5e_open_xsk_rq(struct mlx5e_channel *c, struct mlx5e_params *param
 	if (err)
 		return err;
 
-	err = mlx5e_open_rq(params, rq_params, xsk, cpu_to_node(c->cpu), xskrq);
+	err = mlx5e_open_rq(params, rq_params, xsk, cpu_to_node(c->cpu), q_counter, xskrq);
 	if (err)
 		return err;
 
@@ -125,7 +125,7 @@ int mlx5e_open_xsk(struct mlx5e_priv *priv, struct mlx5e_params *params,
 	if (!cparam)
 		return -ENOMEM;
 
-	mlx5e_build_xsk_cparam(priv->mdev, params, xsk, priv->q_counter, cparam);
+	mlx5e_build_xsk_cparam(priv->mdev, params, xsk, cparam);
 
 	err = mlx5e_open_cq(c->mdev, params->rx_cq_moderation, &cparam->rq.cqp, &ccp,
 			    &c->xskrq.cq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8cc636cf995a..91848eae4565 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1025,7 +1025,7 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
 	mlx5_wq_destroy(&rq->wq_ctrl);
 }
 
-int mlx5e_create_rq(struct mlx5e_rq *rq, struct mlx5e_rq_param *param)
+int mlx5e_create_rq(struct mlx5e_rq *rq, struct mlx5e_rq_param *param, u16 q_counter)
 {
 	struct mlx5_core_dev *mdev = rq->mdev;
 	u8 ts_format;
@@ -1052,6 +1052,7 @@ int mlx5e_create_rq(struct mlx5e_rq *rq, struct mlx5e_rq_param *param)
 	MLX5_SET(rqc,  rqc, cqn,		rq->cq.mcq.cqn);
 	MLX5_SET(rqc,  rqc, state,		MLX5_RQC_STATE_RST);
 	MLX5_SET(rqc,  rqc, ts_format,		ts_format);
+	MLX5_SET(rqc,  rqc, counter_set_id,     q_counter);
 	MLX5_SET(wq,   wq,  log_wq_pg_sz,	rq->wq_ctrl.buf.page_shift -
 						MLX5_ADAPTER_PAGE_SHIFT);
 	MLX5_SET64(wq, wq,  dbr_addr,		rq->wq_ctrl.db.dma);
@@ -1275,7 +1276,7 @@ void mlx5e_free_rx_descs(struct mlx5e_rq *rq)
 }
 
 int mlx5e_open_rq(struct mlx5e_params *params, struct mlx5e_rq_param *param,
-		  struct mlx5e_xsk_param *xsk, int node,
+		  struct mlx5e_xsk_param *xsk, int node, u16 q_counter,
 		  struct mlx5e_rq *rq)
 {
 	struct mlx5_core_dev *mdev = rq->mdev;
@@ -1288,7 +1289,7 @@ int mlx5e_open_rq(struct mlx5e_params *params, struct mlx5e_rq_param *param,
 	if (err)
 		return err;
 
-	err = mlx5e_create_rq(rq, param);
+	err = mlx5e_create_rq(rq, param, q_counter);
 	if (err)
 		goto err_free_rq;
 
@@ -2336,13 +2337,14 @@ static int mlx5e_set_tx_maxrate(struct net_device *dev, int index, u32 rate)
 static int mlx5e_open_rxq_rq(struct mlx5e_channel *c, struct mlx5e_params *params,
 			     struct mlx5e_rq_param *rq_params)
 {
+	u16 q_counter = c->priv->q_counter[c->sd_ix];
 	int err;
 
 	err = mlx5e_init_rxq_rq(c, params, rq_params->xdp_frag_size, &c->rq);
 	if (err)
 		return err;
 
-	return mlx5e_open_rq(params, rq_params, NULL, cpu_to_node(c->cpu), &c->rq);
+	return mlx5e_open_rq(params, rq_params, NULL, cpu_to_node(c->cpu), q_counter, &c->rq);
 }
 
 static int mlx5e_open_queues(struct mlx5e_channel *c,
@@ -2559,6 +2561,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 	c->tstamp   = &priv->tstamp;
 	c->ix       = ix;
 	c->vec_ix   = vec_ix;
+	c->sd_ix    = mlx5_sd_ch_ix_get_dev_ix(mdev, ix);
 	c->cpu      = cpu;
 	c->pdev     = mlx5_core_dma_dev(mdev);
 	c->netdev   = priv->netdev;
@@ -2662,7 +2665,7 @@ int mlx5e_open_channels(struct mlx5e_priv *priv,
 	if (!chs->c || !cparam)
 		goto err_free;
 
-	err = mlx5e_build_channel_param(priv->mdev, &chs->params, priv->q_counter, cparam);
+	err = mlx5e_build_channel_param(priv->mdev, &chs->params, cparam);
 	if (err)
 		goto err_free;
 
@@ -3353,7 +3356,7 @@ int mlx5e_open_drop_rq(struct mlx5e_priv *priv,
 	struct mlx5e_cq *cq = &drop_rq->cq;
 	int err;
 
-	mlx5e_build_drop_rq_param(mdev, priv->drop_rq_q_counter, &rq_param);
+	mlx5e_build_drop_rq_param(mdev, &rq_param);
 
 	err = mlx5e_alloc_drop_cq(priv, cq, &cq_param);
 	if (err)
@@ -3367,7 +3370,7 @@ int mlx5e_open_drop_rq(struct mlx5e_priv *priv,
 	if (err)
 		goto err_destroy_cq;
 
-	err = mlx5e_create_rq(drop_rq, &rq_param);
+	err = mlx5e_create_rq(drop_rq, &rq_param, priv->drop_rq_q_counter);
 	if (err)
 		goto err_free_rq;
 
@@ -5282,13 +5285,17 @@ void mlx5e_create_q_counters(struct mlx5e_priv *priv)
 	u32 out[MLX5_ST_SZ_DW(alloc_q_counter_out)] = {};
 	u32 in[MLX5_ST_SZ_DW(alloc_q_counter_in)] = {};
 	struct mlx5_core_dev *mdev = priv->mdev;
-	int err;
+	struct mlx5_core_dev *pos;
+	int err, i;
 
 	MLX5_SET(alloc_q_counter_in, in, opcode, MLX5_CMD_OP_ALLOC_Q_COUNTER);
-	err = mlx5_cmd_exec_inout(mdev, alloc_q_counter, in, out);
-	if (!err)
-		priv->q_counter =
-			MLX5_GET(alloc_q_counter_out, out, counter_set_id);
+
+	mlx5_sd_for_each_dev(i, mdev, pos) {
+		err = mlx5_cmd_exec_inout(pos, alloc_q_counter, in, out);
+		if (!err)
+			priv->q_counter[i] =
+				MLX5_GET(alloc_q_counter_out, out, counter_set_id);
+	}
 
 	err = mlx5_cmd_exec_inout(mdev, alloc_q_counter, in, out);
 	if (!err)
@@ -5299,13 +5306,17 @@ void mlx5e_create_q_counters(struct mlx5e_priv *priv)
 void mlx5e_destroy_q_counters(struct mlx5e_priv *priv)
 {
 	u32 in[MLX5_ST_SZ_DW(dealloc_q_counter_in)] = {};
+	struct mlx5_core_dev *pos;
+	int i;
 
 	MLX5_SET(dealloc_q_counter_in, in, opcode,
 		 MLX5_CMD_OP_DEALLOC_Q_COUNTER);
-	if (priv->q_counter) {
-		MLX5_SET(dealloc_q_counter_in, in, counter_set_id,
-			 priv->q_counter);
-		mlx5_cmd_exec_in(priv->mdev, dealloc_q_counter, in);
+	mlx5_sd_for_each_dev(i, priv->mdev, pos) {
+		if (priv->q_counter[i]) {
+			MLX5_SET(dealloc_q_counter_in, in, counter_set_id,
+				 priv->q_counter[i]);
+			mlx5_cmd_exec_in(pos, dealloc_q_counter, in);
+		}
 	}
 
 	if (priv->drop_rq_q_counter) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 4b96ad657145..f3d0898bdbc6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -561,11 +561,23 @@ static const struct counter_desc drop_rq_stats_desc[] = {
 #define NUM_Q_COUNTERS			ARRAY_SIZE(q_stats_desc)
 #define NUM_DROP_RQ_COUNTERS		ARRAY_SIZE(drop_rq_stats_desc)
 
+static bool q_counter_any(struct mlx5e_priv *priv)
+{
+	struct mlx5_core_dev *pos;
+	int i;
+
+	mlx5_sd_for_each_dev(i, priv->mdev, pos)
+		if (priv->q_counter[i++])
+			return true;
+
+	return false;
+}
+
 static MLX5E_DECLARE_STATS_GRP_OP_NUM_STATS(qcnt)
 {
 	int num_stats = 0;
 
-	if (priv->q_counter)
+	if (q_counter_any(priv))
 		num_stats += NUM_Q_COUNTERS;
 
 	if (priv->drop_rq_q_counter)
@@ -578,7 +590,7 @@ static MLX5E_DECLARE_STATS_GRP_OP_FILL_STRS(qcnt)
 {
 	int i;
 
-	for (i = 0; i < NUM_Q_COUNTERS && priv->q_counter; i++)
+	for (i = 0; i < NUM_Q_COUNTERS && q_counter_any(priv); i++)
 		strcpy(data + (idx++) * ETH_GSTRING_LEN,
 		       q_stats_desc[i].format);
 
@@ -593,7 +605,7 @@ static MLX5E_DECLARE_STATS_GRP_OP_FILL_STATS(qcnt)
 {
 	int i;
 
-	for (i = 0; i < NUM_Q_COUNTERS && priv->q_counter; i++)
+	for (i = 0; i < NUM_Q_COUNTERS && q_counter_any(priv); i++)
 		data[idx++] = MLX5E_READ_CTR32_CPU(&priv->stats.qcnt,
 						   q_stats_desc, i);
 	for (i = 0; i < NUM_DROP_RQ_COUNTERS && priv->drop_rq_q_counter; i++)
@@ -607,18 +619,23 @@ static MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(qcnt)
 	struct mlx5e_qcounter_stats *qcnt = &priv->stats.qcnt;
 	u32 out[MLX5_ST_SZ_DW(query_q_counter_out)] = {};
 	u32 in[MLX5_ST_SZ_DW(query_q_counter_in)] = {};
-	int ret;
+	struct mlx5_core_dev *pos;
+	u32 rx_out_of_buffer = 0;
+	int ret, i;
 
 	MLX5_SET(query_q_counter_in, in, opcode, MLX5_CMD_OP_QUERY_Q_COUNTER);
 
-	if (priv->q_counter) {
-		MLX5_SET(query_q_counter_in, in, counter_set_id,
-			 priv->q_counter);
-		ret = mlx5_cmd_exec_inout(priv->mdev, query_q_counter, in, out);
-		if (!ret)
-			qcnt->rx_out_of_buffer = MLX5_GET(query_q_counter_out,
-							  out, out_of_buffer);
+	mlx5_sd_for_each_dev(i, priv->mdev, pos) {
+		if (priv->q_counter[i]) {
+			MLX5_SET(query_q_counter_in, in, counter_set_id,
+				 priv->q_counter[i]);
+			ret = mlx5_cmd_exec_inout(pos, query_q_counter, in, out);
+			if (!ret)
+				rx_out_of_buffer += MLX5_GET(query_q_counter_out,
+							     out, out_of_buffer);
+		}
 	}
+	qcnt->rx_out_of_buffer = rx_out_of_buffer;
 
 	if (priv->drop_rq_q_counter) {
 		MLX5_SET(query_q_counter_in, in, counter_set_id,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 080d79d80dd6..31ed26cac9bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1169,7 +1169,7 @@ static int mlx5e_hairpin_flow_add(struct mlx5e_priv *priv,
 			MLX5_CAP_GEN(priv->mdev, log_min_hairpin_wq_data_sz),
 			MLX5_CAP_GEN(priv->mdev, log_max_hairpin_wq_data_sz));
 
-	params.q_counter = priv->q_counter;
+	params.q_counter = priv->q_counter[0];
 	err = devl_param_driverinit_value_get(
 		devlink, MLX5_DEVLINK_PARAM_ID_HAIRPIN_NUM_QUEUES, &val);
 	if (err) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [net-next V3 13/15] net/mlx5e: Block TLS device offload on combined SD netdev
  2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
                   ` (11 preceding siblings ...)
  2024-02-15  3:08 ` [net-next V3 12/15] net/mlx5e: Support per-mdev queue counter Saeed Mahameed
@ 2024-02-15  3:08 ` Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 14/15] net/mlx5: Enable SD feature Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev Saeed Mahameed
  14 siblings, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Tariq Toukan <tariqt@nvidia.com>

1) Each TX TLS device offloaded context has its own TIS object.  Extra work
is needed to get it working in a SD environment, where a stream can move
between different SQs (belonging to different mdevs).

2) Each RX TLS device offloaded context needs a DEK object from the DEK
pool.

Extra work is needed to get it working in a SD environment, as the DEK
pool currently falsely depends on TX cap, and is on the primary device
only.

Disallow this combination for now.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls.c | 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls.h | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls.c
index 984fa04bd331..e3e57c849436 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls.c
@@ -96,7 +96,7 @@ bool mlx5e_is_ktls_rx(struct mlx5_core_dev *mdev)
 {
 	u8 max_sq_wqebbs = mlx5e_get_max_sq_wqebbs(mdev);
 
-	if (is_kdump_kernel() || !MLX5_CAP_GEN(mdev, tls_rx))
+	if (is_kdump_kernel() || !MLX5_CAP_GEN(mdev, tls_rx) || mlx5_get_sd(mdev))
 		return false;
 
 	/* Check the possibility to post the required ICOSQ WQEs. */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls.h
index f11075e67658..adc6d8ea0960 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls.h
@@ -11,6 +11,7 @@
 
 #ifdef CONFIG_MLX5_EN_TLS
 #include "lib/crypto.h"
+#include "lib/mlx5.h"
 
 struct mlx5_crypto_dek *mlx5_ktls_create_key(struct mlx5_crypto_dek_pool *dek_pool,
 					     struct tls_crypto_info *crypto_info);
@@ -61,7 +62,8 @@ void mlx5e_ktls_rx_resync_destroy_resp_list(struct mlx5e_ktls_resync_resp *resp_
 
 static inline bool mlx5e_is_ktls_tx(struct mlx5_core_dev *mdev)
 {
-	return !is_kdump_kernel() && MLX5_CAP_GEN(mdev, tls_tx);
+	return !is_kdump_kernel() && MLX5_CAP_GEN(mdev, tls_tx) &&
+		!mlx5_get_sd(mdev);
 }
 
 bool mlx5e_is_ktls_rx(struct mlx5_core_dev *mdev);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [net-next V3 14/15] net/mlx5: Enable SD feature
  2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
                   ` (12 preceding siblings ...)
  2024-02-15  3:08 ` [net-next V3 13/15] net/mlx5e: Block TLS device offload on combined SD netdev Saeed Mahameed
@ 2024-02-15  3:08 ` Saeed Mahameed
  2024-02-15  3:08 ` [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev Saeed Mahameed
  14 siblings, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Tariq Toukan <tariqt@nvidia.com>

Have an actual mlx5_sd instance in the core device, and fix the getter
accordingly. This allows SD stuff to flow, the feature becomes supported
only here.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h | 3 ++-
 include/linux/mlx5/driver.h                        | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h
index 0810b92b48d0..37d5f445598c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h
@@ -59,10 +59,11 @@ struct mlx5_sd;
 
 static inline struct mlx5_sd *mlx5_get_sd(struct mlx5_core_dev *dev)
 {
-	return NULL;
+	return dev->sd;
 }
 
 static inline void mlx5_set_sd(struct mlx5_core_dev *dev, struct mlx5_sd *sd)
 {
+	dev->sd = sd;
 }
 #endif
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 41f03b352401..bf9324a31ae9 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -823,6 +823,7 @@ struct mlx5_core_dev {
 	struct blocking_notifier_head macsec_nh;
 #endif
 	u64 num_ipsec_offloads;
+	struct mlx5_sd          *sd;
 };
 
 struct mlx5_db {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
                   ` (13 preceding siblings ...)
  2024-02-15  3:08 ` [net-next V3 14/15] net/mlx5: Enable SD feature Saeed Mahameed
@ 2024-02-15  3:08 ` Saeed Mahameed
  2024-02-16  5:23   ` Jakub Kicinski
  2024-02-19 18:04   ` Jiri Pirko
  14 siblings, 2 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-15  3:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet
  Cc: Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

From: Tariq Toukan <tariqt@nvidia.com>

Add documentation for the multi-pf netdev feature.
Describe the mlx5 implementation and design decisions.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 Documentation/networking/index.rst           |   1 +
 Documentation/networking/multi-pf-netdev.rst | 157 +++++++++++++++++++
 2 files changed, 158 insertions(+)
 create mode 100644 Documentation/networking/multi-pf-netdev.rst

diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 69f3d6dcd9fd..473d72c36d61 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -74,6 +74,7 @@ Contents:
    mpls-sysctl
    mptcp-sysctl
    multiqueue
+   multi-pf-netdev
    napi
    net_cachelines/index
    netconsole
diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst
new file mode 100644
index 000000000000..6ef2ac448d1e
--- /dev/null
+++ b/Documentation/networking/multi-pf-netdev.rst
@@ -0,0 +1,157 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===============
+Multi-PF Netdev
+===============
+
+Contents
+========
+
+- `Background`_
+- `Overview`_
+- `mlx5 implementation`_
+- `Channels distribution`_
+- `Topology`_
+- `Steering`_
+- `Mutually exclusive features`_
+
+Background
+==========
+
+The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to
+connect directly to the network, each through its own dedicated PCIe interface. Through either a
+connection harness that splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a
+single card. This results in eliminating the network traffic traversing over the internal bus
+between the sockets, significantly reducing overhead and latency, in addition to reducing CPU
+utilization and increasing network throughput.
+
+Overview
+========
+
+This feature adds support for combining multiple devices (PFs) of the same port in a Multi-PF
+environment under one netdev instance. Passing traffic through different devices belonging to
+different NUMA sockets saves cross-numa traffic and allows apps running on the same netdev from
+different numas to still feel a sense of proximity to the device and achieve improved performance.
+
+mlx5 implementation
+===================
+
+Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
+NIC and has the socket-direct property enabled, once all PFS are probed, we create a single netdev
+to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
+
+The netdev network channels are distributed between all devices, a proper configuration would utilize
+the correct close numa node when working on a certain app/cpu.
+
+We pick one PF to be a primary (leader), and it fills a special role. The other devices
+(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
+mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
+the leader PF (east <-> west traffic) to function. All RX/TX traffic is steered through the primary
+to/from the secondaries.
+
+Currently, we limit the support to PFs only, and up to two PFs (sockets).
+
+Channels distribution
+=====================
+
+We distribute the channels between the different PFs to achieve local NUMA node performance
+on multiple NUMA nodes.
+
+Each combined channel works against one specific PF, creating all its datapath queues against it. We distribute
+channels to PFs in a round-robin policy.
+
+::
+
+        Example for 2 PFs and 6 channels:
+        +--------+--------+
+        | ch idx | PF idx |
+        +--------+--------+
+        |    0   |    0   |
+        |    1   |    1   |
+        |    2   |    0   |
+        |    3   |    1   |
+        |    4   |    0   |
+        |    5   |    1   |
+        +--------+--------+
+
+
+We prefer this round-robin distribution policy over another suggested intuitive distribution, in
+which we first distribute one half of the channels to PF0 and then the second half to PF1.
+
+The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
+mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
+As the channel stats are persistent across channel's closure, changing the mapping every single time
+would turn the accumulative stats less representing of the channel's history.
+
+This is achieved by using the correct core device instance (mdev) in each channel, instead of them
+all using the same instance under "priv->mdev".
+
+Topology
+========
+Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF.
+Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately.
+For now, debugfs is being used to reflect the topology:
+
+.. code-block:: bash
+
+        $ grep -H . /sys/kernel/debug/mlx5/0000\:08\:00.0/sd/*
+        /sys/kernel/debug/mlx5/0000:08:00.0/sd/group_id:0x00000101
+        /sys/kernel/debug/mlx5/0000:08:00.0/sd/primary:0000:08:00.0 vhca 0x0
+        /sys/kernel/debug/mlx5/0000:08:00.0/sd/secondary_0:0000:09:00.0 vhca 0x2
+
+Steering
+========
+Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
+
+In RX, the steering tables belong to the primary PF only, and it is its role to distribute incoming
+traffic to other PFs, via cross-vhca steering capabilities. Nothing special about the RSS table
+content, except that it needs a capable device to point to the receive queues of a different PF.
+
+In TX, the primary PF creates a new TX flow table, which is aliased by the secondaries, so they can
+go out to the network through it.
+
+In addition, we set default XPS configuration that, based on the cpu, selects an SQ belonging to the
+PF on the same node as the cpu.
+
+XPS default config example:
+
+NUMA node(s):          2
+NUMA node0 CPU(s):     0-11
+NUMA node1 CPU(s):     12-23
+
+PF0 on node0, PF1 on node1.
+
+- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
+- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
+- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
+- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
+- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
+- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
+- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
+- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
+- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
+- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
+- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
+- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
+- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
+- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
+- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
+- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
+- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
+- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
+- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
+- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
+- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
+- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
+- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
+- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
+
+Mutually exclusive features
+===========================
+
+The nature of Multi-PF, where different channels work with different PFs, conflicts with
+stateful features where the state is maintained in one of the PFs.
+For example, in the TLS device-offload feature, special context objects are created per connection
+and maintained in the PF.  Transitioning between different RQs/SQs would break the feature. Hence,
+we disable this combination for now.
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-15  3:08 ` [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev Saeed Mahameed
@ 2024-02-16  5:23   ` Jakub Kicinski
  2024-02-19 15:26     ` Tariq Toukan
  2024-02-19 18:04   ` Jiri Pirko
  1 sibling, 1 reply; 37+ messages in thread
From: Jakub Kicinski @ 2024-02-16  5:23 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Paolo Abeni, Eric Dumazet, Saeed Mahameed,
	netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

On Wed, 14 Feb 2024 19:08:14 -0800 Saeed Mahameed wrote:
> +The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to

There are multiple devlink instances, right?
In that case we should call out that there may be more than one.

> +Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF.
> +Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately.

I don't anticipate it to be particularly hard, let's not merge
half-baked code and force users to grow workarounds that are hard 
to remove.

Also could you add examples of how the queue and napis look when listed
via the netdev genl on these devices?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-16  5:23   ` Jakub Kicinski
@ 2024-02-19 15:26     ` Tariq Toukan
  2024-02-21  1:33       ` Jakub Kicinski
  0 siblings, 1 reply; 37+ messages in thread
From: Tariq Toukan @ 2024-02-19 15:26 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: David S. Miller, Paolo Abeni, Eric Dumazet, Saeed Mahameed,
	netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky



On 16/02/2024 7:23, Jakub Kicinski wrote:
> On Wed, 14 Feb 2024 19:08:14 -0800 Saeed Mahameed wrote:
>> +The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to

Hi Jakub,

> 
> There are multiple devlink instances, right?

Right.

> In that case we should call out that there may be more than one.
> 

We are combining the PFs in the netdev level.
I did not focus on the parts that we do not touch.
That's why I didn't mention the sysfs for example, until you asked.

For example, irqns for the two PFs are still reachable as they used to, 
under two distinct paths:
ll /sys/bus/pci/devices/0000\:08\:00.0/msi_irqs/
ll /sys/bus/pci/devices/0000\:09\:00.0/msi_irqs/

>> +Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF.
>> +Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately.
> 
> I don't anticipate it to be particularly hard, let's not merge
> half-baked code and force users to grow workarounds that are hard
> to remove.
> 

Changing sysfs to expose queues from multiple PFs under one path might 
be misleading and break backward compatibility. IMO it should come as an 
extension to the existing entries.

Anyway, the interesting info exposed in sysfs is now available through 
the netdev genl.

Now, is this sysfs part integral to the feature? IMO, no. This in-driver 
feature is large enough to be completed in stages and not as a one shot.

> Also could you add examples of how the queue and napis look when listed
> via the netdev genl on these devices?
> 

Sure. Example for a 24-cores system:

$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml 
--dump queue-get --json '{"ifindex": 5}'
[{'id': 0, 'ifindex': 5, 'napi-id': 539, 'type': 'rx'},
  {'id': 1, 'ifindex': 5, 'napi-id': 540, 'type': 'rx'},
  {'id': 2, 'ifindex': 5, 'napi-id': 541, 'type': 'rx'},
  {'id': 3, 'ifindex': 5, 'napi-id': 542, 'type': 'rx'},
  {'id': 4, 'ifindex': 5, 'napi-id': 543, 'type': 'rx'},
  {'id': 5, 'ifindex': 5, 'napi-id': 544, 'type': 'rx'},
  {'id': 6, 'ifindex': 5, 'napi-id': 545, 'type': 'rx'},
  {'id': 7, 'ifindex': 5, 'napi-id': 546, 'type': 'rx'},
  {'id': 8, 'ifindex': 5, 'napi-id': 547, 'type': 'rx'},
  {'id': 9, 'ifindex': 5, 'napi-id': 548, 'type': 'rx'},
  {'id': 10, 'ifindex': 5, 'napi-id': 549, 'type': 'rx'},
  {'id': 11, 'ifindex': 5, 'napi-id': 550, 'type': 'rx'},
  {'id': 12, 'ifindex': 5, 'napi-id': 551, 'type': 'rx'},
  {'id': 13, 'ifindex': 5, 'napi-id': 552, 'type': 'rx'},
  {'id': 14, 'ifindex': 5, 'napi-id': 553, 'type': 'rx'},
  {'id': 15, 'ifindex': 5, 'napi-id': 554, 'type': 'rx'},
  {'id': 16, 'ifindex': 5, 'napi-id': 555, 'type': 'rx'},
  {'id': 17, 'ifindex': 5, 'napi-id': 556, 'type': 'rx'},
  {'id': 18, 'ifindex': 5, 'napi-id': 557, 'type': 'rx'},
  {'id': 19, 'ifindex': 5, 'napi-id': 558, 'type': 'rx'},
  {'id': 20, 'ifindex': 5, 'napi-id': 559, 'type': 'rx'},
  {'id': 21, 'ifindex': 5, 'napi-id': 560, 'type': 'rx'},
  {'id': 22, 'ifindex': 5, 'napi-id': 561, 'type': 'rx'},
  {'id': 23, 'ifindex': 5, 'napi-id': 562, 'type': 'rx'},
  {'id': 0, 'ifindex': 5, 'napi-id': 539, 'type': 'tx'},
  {'id': 1, 'ifindex': 5, 'napi-id': 540, 'type': 'tx'},
  {'id': 2, 'ifindex': 5, 'napi-id': 541, 'type': 'tx'},
  {'id': 3, 'ifindex': 5, 'napi-id': 542, 'type': 'tx'},
  {'id': 4, 'ifindex': 5, 'napi-id': 543, 'type': 'tx'},
  {'id': 5, 'ifindex': 5, 'napi-id': 544, 'type': 'tx'},
  {'id': 6, 'ifindex': 5, 'napi-id': 545, 'type': 'tx'},
  {'id': 7, 'ifindex': 5, 'napi-id': 546, 'type': 'tx'},
  {'id': 8, 'ifindex': 5, 'napi-id': 547, 'type': 'tx'},
  {'id': 9, 'ifindex': 5, 'napi-id': 548, 'type': 'tx'},
  {'id': 10, 'ifindex': 5, 'napi-id': 549, 'type': 'tx'},
  {'id': 11, 'ifindex': 5, 'napi-id': 550, 'type': 'tx'},
  {'id': 12, 'ifindex': 5, 'napi-id': 551, 'type': 'tx'},
  {'id': 13, 'ifindex': 5, 'napi-id': 552, 'type': 'tx'},
  {'id': 14, 'ifindex': 5, 'napi-id': 553, 'type': 'tx'},
  {'id': 15, 'ifindex': 5, 'napi-id': 554, 'type': 'tx'},
  {'id': 16, 'ifindex': 5, 'napi-id': 555, 'type': 'tx'},
  {'id': 17, 'ifindex': 5, 'napi-id': 556, 'type': 'tx'},
  {'id': 18, 'ifindex': 5, 'napi-id': 557, 'type': 'tx'},
  {'id': 19, 'ifindex': 5, 'napi-id': 558, 'type': 'tx'},
  {'id': 20, 'ifindex': 5, 'napi-id': 559, 'type': 'tx'},
  {'id': 21, 'ifindex': 5, 'napi-id': 560, 'type': 'tx'},
  {'id': 22, 'ifindex': 5, 'napi-id': 561, 'type': 'tx'},
  {'id': 23, 'ifindex': 5, 'napi-id': 562, 'type': 'tx'}]

$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml 
--dump napi-get --json='{"ifindex": 5}'
[{'id': 562, 'ifindex': 5, 'irq': 84},
  {'id': 561, 'ifindex': 5, 'irq': 83},
  {'id': 560, 'ifindex': 5, 'irq': 82},
  {'id': 559, 'ifindex': 5, 'irq': 81},
  {'id': 558, 'ifindex': 5, 'irq': 80},
  {'id': 557, 'ifindex': 5, 'irq': 79},
  {'id': 556, 'ifindex': 5, 'irq': 78},
  {'id': 555, 'ifindex': 5, 'irq': 77},
  {'id': 554, 'ifindex': 5, 'irq': 76},
  {'id': 553, 'ifindex': 5, 'irq': 75},
  {'id': 552, 'ifindex': 5, 'irq': 74},
  {'id': 551, 'ifindex': 5, 'irq': 73},
  {'id': 550, 'ifindex': 5, 'irq': 72},
  {'id': 549, 'ifindex': 5, 'irq': 71},
  {'id': 548, 'ifindex': 5, 'irq': 70},
  {'id': 547, 'ifindex': 5, 'irq': 69},
  {'id': 546, 'ifindex': 5, 'irq': 68},
  {'id': 545, 'ifindex': 5, 'irq': 67},
  {'id': 544, 'ifindex': 5, 'irq': 66},
  {'id': 543, 'ifindex': 5, 'irq': 65},
  {'id': 542, 'ifindex': 5, 'irq': 64},
  {'id': 541, 'ifindex': 5, 'irq': 63},
  {'id': 540, 'ifindex': 5, 'irq': 39},
  {'id': 539, 'ifindex': 5, 'irq': 36}]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-15  3:08 ` [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev Saeed Mahameed
  2024-02-16  5:23   ` Jakub Kicinski
@ 2024-02-19 18:04   ` Jiri Pirko
  1 sibling, 0 replies; 37+ messages in thread
From: Jiri Pirko @ 2024-02-19 18:04 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Leon Romanovsky

Thu, Feb 15, 2024 at 04:08:14AM CET, saeed@kernel.org wrote:
>From: Tariq Toukan <tariqt@nvidia.com>
>
>Add documentation for the multi-pf netdev feature.
>Describe the mlx5 implementation and design decisions.
>
>Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>---
> Documentation/networking/index.rst           |   1 +
> Documentation/networking/multi-pf-netdev.rst | 157 +++++++++++++++++++
> 2 files changed, 158 insertions(+)
> create mode 100644 Documentation/networking/multi-pf-netdev.rst
>
>diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
>index 69f3d6dcd9fd..473d72c36d61 100644
>--- a/Documentation/networking/index.rst
>+++ b/Documentation/networking/index.rst
>@@ -74,6 +74,7 @@ Contents:
>    mpls-sysctl
>    mptcp-sysctl
>    multiqueue
>+   multi-pf-netdev
>    napi
>    net_cachelines/index
>    netconsole
>diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst
>new file mode 100644
>index 000000000000..6ef2ac448d1e
>--- /dev/null
>+++ b/Documentation/networking/multi-pf-netdev.rst
>@@ -0,0 +1,157 @@
>+.. SPDX-License-Identifier: GPL-2.0
>+.. include:: <isonum.txt>
>+
>+===============
>+Multi-PF Netdev
>+===============
>+
>+Contents
>+========
>+
>+- `Background`_
>+- `Overview`_
>+- `mlx5 implementation`_
>+- `Channels distribution`_
>+- `Topology`_
>+- `Steering`_
>+- `Mutually exclusive features`_
>+
>+Background
>+==========
>+
>+The advanced Multi-PF NIC technology enables several CPUs within a multi-socket server to
>+connect directly to the network, each through its own dedicated PCIe interface. Through either a
>+connection harness that splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a
>+single card. This results in eliminating the network traffic traversing over the internal bus
>+between the sockets, significantly reducing overhead and latency, in addition to reducing CPU
>+utilization and increasing network throughput.
>+
>+Overview
>+========
>+
>+This feature adds support for combining multiple devices (PFs) of the same port in a Multi-PF
>+environment under one netdev instance. Passing traffic through different devices belonging to
>+different NUMA sockets saves cross-numa traffic and allows apps running on the same netdev from
>+different numas to still feel a sense of proximity to the device and achieve improved performance.
>+
>+mlx5 implementation
>+===================
>+
>+Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
>+NIC and has the socket-direct property enabled, once all PFS are probed, we create a single netdev

How do you enable this property?


>+to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
>+
>+The netdev network channels are distributed between all devices, a proper configuration would utilize
>+the correct close numa node when working on a certain app/cpu.
>+
>+We pick one PF to be a primary (leader), and it fills a special role. The other devices
>+(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
>+mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
>+the leader PF (east <-> west traffic) to function. All RX/TX traffic is steered through the primary
>+to/from the secondaries.
>+
>+Currently, we limit the support to PFs only, and up to two PFs (sockets).

For the record, could you please describe why exactly you didn't use
drivers/base/component.c infrastructure for this? I know you told me,
but I don't recall. Better to have this written down, I believe.


>+
>+Channels distribution
>+=====================
>+
>+We distribute the channels between the different PFs to achieve local NUMA node performance
>+on multiple NUMA nodes.
>+
>+Each combined channel works against one specific PF, creating all its datapath queues against it. We distribute
>+channels to PFs in a round-robin policy.
>+
>+::
>+
>+        Example for 2 PFs and 6 channels:
>+        +--------+--------+
>+        | ch idx | PF idx |
>+        +--------+--------+
>+        |    0   |    0   |
>+        |    1   |    1   |
>+        |    2   |    0   |
>+        |    3   |    1   |
>+        |    4   |    0   |
>+        |    5   |    1   |
>+        +--------+--------+
>+
>+
>+We prefer this round-robin distribution policy over another suggested intuitive distribution, in
>+which we first distribute one half of the channels to PF0 and then the second half to PF1.
>+
>+The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
>+mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
>+As the channel stats are persistent across channel's closure, changing the mapping every single time
>+would turn the accumulative stats less representing of the channel's history.
>+
>+This is achieved by using the correct core device instance (mdev) in each channel, instead of them
>+all using the same instance under "priv->mdev".
>+
>+Topology
>+========
>+Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF.
>+Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately.
>+For now, debugfs is being used to reflect the topology:
>+
>+.. code-block:: bash
>+
>+        $ grep -H . /sys/kernel/debug/mlx5/0000\:08\:00.0/sd/*
>+        /sys/kernel/debug/mlx5/0000:08:00.0/sd/group_id:0x00000101
>+        /sys/kernel/debug/mlx5/0000:08:00.0/sd/primary:0000:08:00.0 vhca 0x0
>+        /sys/kernel/debug/mlx5/0000:08:00.0/sd/secondary_0:0000:09:00.0 vhca 0x2

Ugh :/

SD is something that is likely going to stay with us for some time.
Can't we have some proper UAPI instead of this? IDK.


>+
>+Steering
>+========
>+Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
>+
>+In RX, the steering tables belong to the primary PF only, and it is its role to distribute incoming
>+traffic to other PFs, via cross-vhca steering capabilities. Nothing special about the RSS table
>+content, except that it needs a capable device to point to the receive queues of a different PF.
>+
>+In TX, the primary PF creates a new TX flow table, which is aliased by the secondaries, so they can
>+go out to the network through it.
>+
>+In addition, we set default XPS configuration that, based on the cpu, selects an SQ belonging to the
>+PF on the same node as the cpu.
>+
>+XPS default config example:
>+
>+NUMA node(s):          2
>+NUMA node0 CPU(s):     0-11
>+NUMA node1 CPU(s):     12-23

How can user know which queue is bound to which cpu?


>+
>+PF0 on node0, PF1 on node1.
>+
>+- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
>+- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
>+- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
>+- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
>+- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
>+- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
>+- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
>+- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
>+- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
>+- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
>+- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
>+- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
>+- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
>+- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
>+- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
>+- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
>+- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
>+- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
>+- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
>+- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
>+- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
>+- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
>+- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
>+- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
>+
>+Mutually exclusive features
>+===========================
>+
>+The nature of Multi-PF, where different channels work with different PFs, conflicts with
>+stateful features where the state is maintained in one of the PFs.
>+For example, in the TLS device-offload feature, special context objects are created per connection
>+and maintained in the PF.  Transitioning between different RQs/SQs would break the feature. Hence,
>+we disable this combination for now.
>-- 
>2.43.0
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-19 15:26     ` Tariq Toukan
@ 2024-02-21  1:33       ` Jakub Kicinski
  2024-02-21  2:10         ` Saeed Mahameed
  2024-02-22  7:51         ` Greg Kroah-Hartman
  0 siblings, 2 replies; 37+ messages in thread
From: Jakub Kicinski @ 2024-02-21  1:33 UTC (permalink / raw)
  To: Tariq Toukan, Greg Kroah-Hartman
  Cc: Saeed Mahameed, David S. Miller, Paolo Abeni, Eric Dumazet,
	Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Leon Romanovsky

On Mon, 19 Feb 2024 17:26:36 +0200 Tariq Toukan wrote:
> > There are multiple devlink instances, right?  
> 
> Right.

Just to be clear I'm asking you questions about things which need to 
be covered by the doc :)

> > In that case we should call out that there may be more than one.
> >   
> 
> We are combining the PFs in the netdev level.
> I did not focus on the parts that we do not touch.

Sure but one of the goals here is to drive convergence.
So if another vendor is on the fence let's nudge them towards the same
decision.

> That's why I didn't mention the sysfs for example, until you asked.
> 
> For example, irqns for the two PFs are still reachable as they used to, 
> under two distinct paths:
> ll /sys/bus/pci/devices/0000\:08\:00.0/msi_irqs/
> ll /sys/bus/pci/devices/0000\:09\:00.0/msi_irqs/
> 
> >> +Currently the sysfs is kept untouched, letting the netdev sysfs point to its primary PF.
> >> +Enhancing sysfs to reflect the actual topology is to be discussed and contributed separately.  
> > 
> > I don't anticipate it to be particularly hard, let's not merge
> > half-baked code and force users to grow workarounds that are hard
> > to remove.
> 
> Changing sysfs to expose queues from multiple PFs under one path might 
> be misleading and break backward compatibility. IMO it should come as an 
> extension to the existing entries.

I don't know what "multiple PFs under one path" means, links in VFs are
one to one, right? :)

> Anyway, the interesting info exposed in sysfs is now available through 
> the netdev genl.

Right, that's true.

Greg, we have a feature here where a single device of class net has
multiple "bus parents". We used to have one attr under class net
(device) which is a link to the bus parent. Now we either need to add
more or not bother with the linking of the whole device. Is there any
precedent / preference for solving this from the device model
perspective?

> Now, is this sysfs part integral to the feature? IMO, no. This in-driver 
> feature is large enough to be completed in stages and not as a one shot.

It's not a question of size and/or implementing everything.
What I want to make sure is that you surveyed the known user space
implementations sufficiently to know what looks at those links,
and perhaps ethtool -i.
Perhaps the answer is indeed "nothing much will care" and given
we can link IRQs correctly we put that as a conclusion in the doc.

Saying "sysfs is coming soon" is not adding much information :(

> > Also could you add examples of how the queue and napis look when listed
> > via the netdev genl on these devices?
> >   
> 
> Sure. Example for a 24-cores system:

Could you reconfigure to 5 channels to make the output asymmetric and
shorter and include the example in the doc?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-21  1:33       ` Jakub Kicinski
@ 2024-02-21  2:10         ` Saeed Mahameed
  2024-02-22  7:51         ` Greg Kroah-Hartman
  1 sibling, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-02-21  2:10 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tariq Toukan, Greg Kroah-Hartman, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Leon Romanovsky

On 20 Feb 17:33, Jakub Kicinski wrote:
>On Mon, 19 Feb 2024 17:26:36 +0200 Tariq Toukan wrote:
>> > There are multiple devlink instances, right?
>>
>> Right.
>
>Just to be clear I'm asking you questions about things which need to
>be covered by the doc :)
>
>> > In that case we should call out that there may be more than one.
>> >
>>
>> We are combining the PFs in the netdev level.
>> I did not focus on the parts that we do not touch.
>
>> Anyway, the interesting info exposed in sysfs is now available through
>> the netdev genl.
>
>Right, that's true.
>

[...]

>Greg, we have a feature here where a single device of class net has
>multiple "bus parents". We used to have one attr under class net
>(device) which is a link to the bus parent. Now we either need to add
>more or not bother with the linking of the whole device. Is there any
>precedent / preference for solving this from the device model
>perspective?
>
>> Now, is this sysfs part integral to the feature? IMO, no. This in-driver
>> feature is large enough to be completed in stages and not as a one shot.
>
>It's not a question of size and/or implementing everything.
>What I want to make sure is that you surveyed the known user space
>implementations sufficiently to know what looks at those links,
>and perhaps ethtool -i.
>Perhaps the answer is indeed "nothing much will care" and given
>we can link IRQs correctly we put that as a conclusion in the doc.
>
>Saying "sysfs is coming soon" is not adding much information :(
>

linking multiple parent devices at the netdev subsystems doesn't add
anything, the netdev abstraction should stop at linking rx/tx channels to
physical irqs and NUMA nodes, complicating the sysfs will required a proper
infrastructure to model the multi-pf mode for all vendors to use uniformly,
but for what? currently there's no configuration mechanism for this feature
yet, and we don't need it at the moment, once configuration becomes necessary,
I would recommend adding one infrastructure to all vendors to register to
at the parent device level, which will handle the sysfs/devlink abstraction, 
and leave netdev abstraction as is (IRQ/NUMA) and maybe take this a step 
further and give the user control of attaching specific channels to specific
IRQs/NUMA nodes.




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-21  1:33       ` Jakub Kicinski
  2024-02-21  2:10         ` Saeed Mahameed
@ 2024-02-22  7:51         ` Greg Kroah-Hartman
  2024-02-22 23:00           ` Jakub Kicinski
  1 sibling, 1 reply; 37+ messages in thread
From: Greg Kroah-Hartman @ 2024-02-22  7:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tariq Toukan, Saeed Mahameed, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Leon Romanovsky

On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
> Greg, we have a feature here where a single device of class net has
> multiple "bus parents". We used to have one attr under class net
> (device) which is a link to the bus parent. Now we either need to add
> more or not bother with the linking of the whole device. Is there any
> precedent / preference for solving this from the device model
> perspective?

How, logically, can a netdevice be controlled properly from 2 parent
devices on two different busses?  How is that even possible from a
physical point-of-view?  What exact bus types are involved here?

This "shouldn't" be possible as in the end, it's usually a PCI device
handling this all, right?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-22  7:51         ` Greg Kroah-Hartman
@ 2024-02-22 23:00           ` Jakub Kicinski
  2024-02-23  1:23             ` Samudrala, Sridhar
  0 siblings, 1 reply; 37+ messages in thread
From: Jakub Kicinski @ 2024-02-22 23:00 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Tariq Toukan, Saeed Mahameed, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Leon Romanovsky

On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
> > Greg, we have a feature here where a single device of class net has
> > multiple "bus parents". We used to have one attr under class net
> > (device) which is a link to the bus parent. Now we either need to add
> > more or not bother with the linking of the whole device. Is there any
> > precedent / preference for solving this from the device model
> > perspective?  
> 
> How, logically, can a netdevice be controlled properly from 2 parent
> devices on two different busses?  How is that even possible from a
> physical point-of-view?  What exact bus types are involved here?

Two PCIe buses, two endpoints, two networking ports. It's one piece
of silicon, tho, so the "slices" can talk to each other internally.
The NVRAM configuration tells both endpoints that the user wants
them "bonded", when the PCI drivers probe they "find each other"
using some cookie or DSN or whatnot. And once they did, they spawn
a single netdev.

> This "shouldn't" be possible as in the end, it's usually a PCI device
> handling this all, right?

It's really a special type of bonding of two netdevs. Like you'd bond
two ports to get twice the bandwidth. With the twist that the balancing
is done on NUMA proximity, rather than traffic hash.

Well, plus, the major twist that it's all done magically "for you"
in the vendor driver, and the two "lower" devices are not visible.
You only see the resulting bond.

I personally think that the magic hides as many problems as it
introduces and we'd be better off creating two separate netdevs.
And then a new type of "device bond" on top. Small win that
the "new device bond on top" can be shared code across vendors.

But there's only so many hours in the day to argue with vendors.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-22 23:00           ` Jakub Kicinski
@ 2024-02-23  1:23             ` Samudrala, Sridhar
  2024-02-23  2:05               ` Jay Vosburgh
  2024-02-23  9:36               ` Jiri Pirko
  0 siblings, 2 replies; 37+ messages in thread
From: Samudrala, Sridhar @ 2024-02-23  1:23 UTC (permalink / raw)
  To: Jakub Kicinski, Greg Kroah-Hartman
  Cc: Tariq Toukan, Saeed Mahameed, David S. Miller, Paolo Abeni,
	Eric Dumazet, Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Leon Romanovsky, jay.vosburgh



On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>>> Greg, we have a feature here where a single device of class net has
>>> multiple "bus parents". We used to have one attr under class net
>>> (device) which is a link to the bus parent. Now we either need to add
>>> more or not bother with the linking of the whole device. Is there any
>>> precedent / preference for solving this from the device model
>>> perspective?
>>
>> How, logically, can a netdevice be controlled properly from 2 parent
>> devices on two different busses?  How is that even possible from a
>> physical point-of-view?  What exact bus types are involved here?
> 
> Two PCIe buses, two endpoints, two networking ports. It's one piece

Isn't it only 1 networking port with multiple PFs?

> of silicon, tho, so the "slices" can talk to each other internally.
> The NVRAM configuration tells both endpoints that the user wants
> them "bonded", when the PCI drivers probe they "find each other"
> using some cookie or DSN or whatnot. And once they did, they spawn
> a single netdev.
> 
>> This "shouldn't" be possible as in the end, it's usually a PCI device
>> handling this all, right?
> 
> It's really a special type of bonding of two netdevs. Like you'd bond
> two ports to get twice the bandwidth. With the twist that the balancing
> is done on NUMA proximity, rather than traffic hash.
> 
> Well, plus, the major twist that it's all done magically "for you"
> in the vendor driver, and the two "lower" devices are not visible.
> You only see the resulting bond.
> 
> I personally think that the magic hides as many problems as it
> introduces and we'd be better off creating two separate netdevs.
> And then a new type of "device bond" on top. Small win that
> the "new device bond on top" can be shared code across vendors.

Yes. We have been exploring a small extension to bonding driver to 
enable a single numa-aware multi-threaded application to efficiently 
utilize multiple NICs across numa nodes.

Here is an early version of a patch we have been trying and seems to be 
working well.

=========================================================================
bonding: select tx device based on rx device of a flow

If napi_id is cached in the sk associated with skb, use the
device associated with napi_id as the transmit device.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>

diff --git a/drivers/net/bonding/bond_main.c 
b/drivers/net/bonding/bond_main.c
index 7a7d584f378a..77e3bf6c4502 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -5146,6 +5146,30 @@ static struct slave 
*bond_xmit_3ad_xor_slave_get(struct bonding *bond,
         unsigned int count;
         u32 hash;

+       if (skb->sk) {
+               int napi_id = skb->sk->sk_napi_id;
+               struct net_device *dev;
+               int idx;
+
+               rcu_read_lock();
+               dev = dev_get_by_napi_id(napi_id);
+               rcu_read_unlock();
+
+               if (!dev)
+                       goto hash;
+
+               count = slaves ? READ_ONCE(slaves->count) : 0;
+               if (unlikely(!count))
+                       return NULL;
+
+               for (idx = 0; idx < count; idx++) {
+                       slave = slaves->arr[idx];
+                       if (slave->dev->ifindex == dev->ifindex)
+                               return slave;
+               }
+       }
+
+hash:
         hash = bond_xmit_hash(bond, skb);
         count = slaves ? READ_ONCE(slaves->count) : 0;
         if (unlikely(!count))
=========================================================================

If we make this as a configurable bonding option, would this be an 
acceptable solution to accelerate numa-aware apps?

> 
> But there's only so many hours in the day to argue with vendors.
> 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-23  1:23             ` Samudrala, Sridhar
@ 2024-02-23  2:05               ` Jay Vosburgh
  2024-02-23  5:00                 ` Samudrala, Sridhar
  2024-02-23  9:36               ` Jiri Pirko
  1 sibling, 1 reply; 37+ messages in thread
From: Jay Vosburgh @ 2024-02-23  2:05 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Jakub Kicinski, Greg Kroah-Hartman, Tariq Toukan, Saeed Mahameed,
	David S. Miller, Paolo Abeni, Eric Dumazet, Saeed Mahameed,
	netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky

Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote:
>On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>>> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>>>> Greg, we have a feature here where a single device of class net has
>>>> multiple "bus parents". We used to have one attr under class net
>>>> (device) which is a link to the bus parent. Now we either need to add
>>>> more or not bother with the linking of the whole device. Is there any
>>>> precedent / preference for solving this from the device model
>>>> perspective?
>>>
>>> How, logically, can a netdevice be controlled properly from 2 parent
>>> devices on two different busses?  How is that even possible from a
>>> physical point-of-view?  What exact bus types are involved here?
>> Two PCIe buses, two endpoints, two networking ports. It's one piece
>
>Isn't it only 1 networking port with multiple PFs?
>
>> of silicon, tho, so the "slices" can talk to each other internally.
>> The NVRAM configuration tells both endpoints that the user wants
>> them "bonded", when the PCI drivers probe they "find each other"
>> using some cookie or DSN or whatnot. And once they did, they spawn
>> a single netdev.
>> 
>>> This "shouldn't" be possible as in the end, it's usually a PCI device
>>> handling this all, right?
>> It's really a special type of bonding of two netdevs. Like you'd bond
>> two ports to get twice the bandwidth. With the twist that the balancing
>> is done on NUMA proximity, rather than traffic hash.
>> Well, plus, the major twist that it's all done magically "for you"
>> in the vendor driver, and the two "lower" devices are not visible.
>> You only see the resulting bond.
>> I personally think that the magic hides as many problems as it
>> introduces and we'd be better off creating two separate netdevs.
>> And then a new type of "device bond" on top. Small win that
>> the "new device bond on top" can be shared code across vendors.
>
>Yes. We have been exploring a small extension to bonding driver to enable
>a single numa-aware multi-threaded application to efficiently utilize
>multiple NICs across numa nodes.

	Is this referring to something like the multi-pf under
discussion, or just generically with two arbitrary network devices
installed one each per NUMA node?

>Here is an early version of a patch we have been trying and seems to be
>working well.
>
>=========================================================================
>bonding: select tx device based on rx device of a flow
>
>If napi_id is cached in the sk associated with skb, use the
>device associated with napi_id as the transmit device.
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>
>diff --git a/drivers/net/bonding/bond_main.c
>b/drivers/net/bonding/bond_main.c
>index 7a7d584f378a..77e3bf6c4502 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -5146,6 +5146,30 @@ static struct slave
>*bond_xmit_3ad_xor_slave_get(struct bonding *bond,
>        unsigned int count;
>        u32 hash;
>
>+       if (skb->sk) {
>+               int napi_id = skb->sk->sk_napi_id;
>+               struct net_device *dev;
>+               int idx;
>+
>+               rcu_read_lock();
>+               dev = dev_get_by_napi_id(napi_id);
>+               rcu_read_unlock();
>+
>+               if (!dev)
>+                       goto hash;
>+
>+               count = slaves ? READ_ONCE(slaves->count) : 0;
>+               if (unlikely(!count))
>+                       return NULL;
>+
>+               for (idx = 0; idx < count; idx++) {
>+                       slave = slaves->arr[idx];
>+                       if (slave->dev->ifindex == dev->ifindex)
>+                               return slave;
>+               }
>+       }
>+
>+hash:
>        hash = bond_xmit_hash(bond, skb);
>        count = slaves ? READ_ONCE(slaves->count) : 0;
>        if (unlikely(!count))
>=========================================================================
>
>If we make this as a configurable bonding option, would this be an
>acceptable solution to accelerate numa-aware apps?

	Assuming for the moment this is for "regular" network devices
installed one per NUMA node, why do this in bonding instead of at a
higher layer (multiple subnets or ECMP, for example)?

	Is the intent here that the bond would aggregate its interfaces
via LACP with the peer being some kind of cross-chassis link aggregation
(MLAG, et al)?

	Given that sk_napi_id seems to be associated with
CONFIG_NET_RX_BUSY_POLL, am I correct in presuming the target
applications are DPDK-style busy poll packet processors?

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-23  2:05               ` Jay Vosburgh
@ 2024-02-23  5:00                 ` Samudrala, Sridhar
  2024-02-23  9:40                   ` Jiri Pirko
  0 siblings, 1 reply; 37+ messages in thread
From: Samudrala, Sridhar @ 2024-02-23  5:00 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Jakub Kicinski, Greg Kroah-Hartman, Tariq Toukan, Saeed Mahameed,
	David S. Miller, Paolo Abeni, Eric Dumazet, Saeed Mahameed,
	netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky



On 2/22/2024 8:05 PM, Jay Vosburgh wrote:
> Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote:
>> On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>>> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>>>> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>>>>> Greg, we have a feature here where a single device of class net has
>>>>> multiple "bus parents". We used to have one attr under class net
>>>>> (device) which is a link to the bus parent. Now we either need to add
>>>>> more or not bother with the linking of the whole device. Is there any
>>>>> precedent / preference for solving this from the device model
>>>>> perspective?
>>>>
>>>> How, logically, can a netdevice be controlled properly from 2 parent
>>>> devices on two different busses?  How is that even possible from a
>>>> physical point-of-view?  What exact bus types are involved here?
>>> Two PCIe buses, two endpoints, two networking ports. It's one piece
>>
>> Isn't it only 1 networking port with multiple PFs?
>>
>>> of silicon, tho, so the "slices" can talk to each other internally.
>>> The NVRAM configuration tells both endpoints that the user wants
>>> them "bonded", when the PCI drivers probe they "find each other"
>>> using some cookie or DSN or whatnot. And once they did, they spawn
>>> a single netdev.
>>>
>>>> This "shouldn't" be possible as in the end, it's usually a PCI device
>>>> handling this all, right?
>>> It's really a special type of bonding of two netdevs. Like you'd bond
>>> two ports to get twice the bandwidth. With the twist that the balancing
>>> is done on NUMA proximity, rather than traffic hash.
>>> Well, plus, the major twist that it's all done magically "for you"
>>> in the vendor driver, and the two "lower" devices are not visible.
>>> You only see the resulting bond.
>>> I personally think that the magic hides as many problems as it
>>> introduces and we'd be better off creating two separate netdevs.
>>> And then a new type of "device bond" on top. Small win that
>>> the "new device bond on top" can be shared code across vendors.
>>
>> Yes. We have been exploring a small extension to bonding driver to enable
>> a single numa-aware multi-threaded application to efficiently utilize
>> multiple NICs across numa nodes.
> 
> 	Is this referring to something like the multi-pf under
> discussion, or just generically with two arbitrary network devices
> installed one each per NUMA node?

Normal network devices one per NUMA node

> 
>> Here is an early version of a patch we have been trying and seems to be
>> working well.
>>
>> =========================================================================
>> bonding: select tx device based on rx device of a flow
>>
>> If napi_id is cached in the sk associated with skb, use the
>> device associated with napi_id as the transmit device.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>
>> diff --git a/drivers/net/bonding/bond_main.c
>> b/drivers/net/bonding/bond_main.c
>> index 7a7d584f378a..77e3bf6c4502 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -5146,6 +5146,30 @@ static struct slave
>> *bond_xmit_3ad_xor_slave_get(struct bonding *bond,
>>         unsigned int count;
>>         u32 hash;
>>
>> +       if (skb->sk) {
>> +               int napi_id = skb->sk->sk_napi_id;
>> +               struct net_device *dev;
>> +               int idx;
>> +
>> +               rcu_read_lock();
>> +               dev = dev_get_by_napi_id(napi_id);
>> +               rcu_read_unlock();
>> +
>> +               if (!dev)
>> +                       goto hash;
>> +
>> +               count = slaves ? READ_ONCE(slaves->count) : 0;
>> +               if (unlikely(!count))
>> +                       return NULL;
>> +
>> +               for (idx = 0; idx < count; idx++) {
>> +                       slave = slaves->arr[idx];
>> +                       if (slave->dev->ifindex == dev->ifindex)
>> +                               return slave;
>> +               }
>> +       }
>> +
>> +hash:
>>         hash = bond_xmit_hash(bond, skb);
>>         count = slaves ? READ_ONCE(slaves->count) : 0;
>>         if (unlikely(!count))
>> =========================================================================
>>
>> If we make this as a configurable bonding option, would this be an
>> acceptable solution to accelerate numa-aware apps?
> 
> 	Assuming for the moment this is for "regular" network devices
> installed one per NUMA node, why do this in bonding instead of at a
> higher layer (multiple subnets or ECMP, for example)?
> 
> 	Is the intent here that the bond would aggregate its interfaces
> via LACP with the peer being some kind of cross-chassis link aggregation
> (MLAG, et al)?

Yes. basic LACP bonding setup. There could be multiple peers connecting 
to the server via switch providing LACP based link aggregation. No 
cross-chassis MLAG.

> 
> 	Given that sk_napi_id seems to be associated with
> CONFIG_NET_RX_BUSY_POLL, am I correct in presuming the target
> applications are DPDK-style busy poll packet processors?

I am using sk_napi_id to get the incoming interface. Busy poll is not a 
requirement and this can be used with any socket based apps.

In a numa-aware app, the app threads are split into pools of threads 
aligned to each numa node and the associated NIC. In the rx path, a 
thread is picked from a pool associated with a numa node using 
SO_INCOMING_CPU or similar method by setting irq affinity to the local 
cores. napi id is cached in the sk in the receive path. In the tx path, 
bonding driver picks the same NIC as the outgoing device using the 
cached sk->napi_id.

This enables numa affinitized data path for an app thread doing network 
I/O. If we also configure xps based on rx queues, tx and rx of a TCP 
flow can be aligned to the same queue pair of a NIC even when using bonding.

> 
> 	-J
> 
> ---
> 	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-23  1:23             ` Samudrala, Sridhar
  2024-02-23  2:05               ` Jay Vosburgh
@ 2024-02-23  9:36               ` Jiri Pirko
  2024-02-28  2:06                 ` Jakub Kicinski
  1 sibling, 1 reply; 37+ messages in thread
From: Jiri Pirko @ 2024-02-23  9:36 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Jakub Kicinski, Greg Kroah-Hartman, Tariq Toukan, Saeed Mahameed,
	David S. Miller, Paolo Abeni, Eric Dumazet, Saeed Mahameed,
	netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky,
	jay.vosburgh

Fri, Feb 23, 2024 at 02:23:32AM CET, sridhar.samudrala@intel.com wrote:
>
>
>On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>> > On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>> > > Greg, we have a feature here where a single device of class net has
>> > > multiple "bus parents". We used to have one attr under class net
>> > > (device) which is a link to the bus parent. Now we either need to add
>> > > more or not bother with the linking of the whole device. Is there any
>> > > precedent / preference for solving this from the device model
>> > > perspective?
>> > 
>> > How, logically, can a netdevice be controlled properly from 2 parent
>> > devices on two different busses?  How is that even possible from a
>> > physical point-of-view?  What exact bus types are involved here?
>> 
>> Two PCIe buses, two endpoints, two networking ports. It's one piece
>
>Isn't it only 1 networking port with multiple PFs?

AFAIK, yes. I have one device in hands like this. One physical port,
2 PCI slots, 2 PFs on PCI bus.


>
>> of silicon, tho, so the "slices" can talk to each other internally.
>> The NVRAM configuration tells both endpoints that the user wants
>> them "bonded", when the PCI drivers probe they "find each other"
>> using some cookie or DSN or whatnot. And once they did, they spawn
>> a single netdev.
>> 
>> > This "shouldn't" be possible as in the end, it's usually a PCI device
>> > handling this all, right?
>> 
>> It's really a special type of bonding of two netdevs. Like you'd bond
>> two ports to get twice the bandwidth. With the twist that the balancing
>> is done on NUMA proximity, rather than traffic hash.
>> 
>> Well, plus, the major twist that it's all done magically "for you"
>> in the vendor driver, and the two "lower" devices are not visible.
>> You only see the resulting bond.
>> 
>> I personally think that the magic hides as many problems as it
>> introduces and we'd be better off creating two separate netdevs.
>> And then a new type of "device bond" on top. Small win that
>> the "new device bond on top" can be shared code across vendors.
>
>Yes. We have been exploring a small extension to bonding driver to enable a
>single numa-aware multi-threaded application to efficiently utilize multiple
>NICs across numa nodes.

Bonding was my immediate response when we discussed this internally for
the first time. But I had to eventually admit it is probably not that
suitable in this case, here's why:
1) there are no 2 physical ports, only one.
2) it is basically a matter of device layout/provisioning that this
   feature should be enabled, not user configuration.
3) other subsystems like RDMA would benefit the same feature, so this
   int not netdev specific in general.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-23  5:00                 ` Samudrala, Sridhar
@ 2024-02-23  9:40                   ` Jiri Pirko
  2024-02-23 23:56                     ` Samudrala, Sridhar
  0 siblings, 1 reply; 37+ messages in thread
From: Jiri Pirko @ 2024-02-23  9:40 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Jay Vosburgh, Jakub Kicinski, Greg Kroah-Hartman, Tariq Toukan,
	Saeed Mahameed, David S. Miller, Paolo Abeni, Eric Dumazet,
	Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Leon Romanovsky

Fri, Feb 23, 2024 at 06:00:40AM CET, sridhar.samudrala@intel.com wrote:
>
>
>On 2/22/2024 8:05 PM, Jay Vosburgh wrote:
>> Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote:
>> > On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>> > > On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>> > > > On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>> > > > > Greg, we have a feature here where a single device of class net has
>> > > > > multiple "bus parents". We used to have one attr under class net
>> > > > > (device) which is a link to the bus parent. Now we either need to add
>> > > > > more or not bother with the linking of the whole device. Is there any
>> > > > > precedent / preference for solving this from the device model
>> > > > > perspective?
>> > > > 
>> > > > How, logically, can a netdevice be controlled properly from 2 parent
>> > > > devices on two different busses?  How is that even possible from a
>> > > > physical point-of-view?  What exact bus types are involved here?
>> > > Two PCIe buses, two endpoints, two networking ports. It's one piece
>> > 
>> > Isn't it only 1 networking port with multiple PFs?
>> > 
>> > > of silicon, tho, so the "slices" can talk to each other internally.
>> > > The NVRAM configuration tells both endpoints that the user wants
>> > > them "bonded", when the PCI drivers probe they "find each other"
>> > > using some cookie or DSN or whatnot. And once they did, they spawn
>> > > a single netdev.
>> > > 
>> > > > This "shouldn't" be possible as in the end, it's usually a PCI device
>> > > > handling this all, right?
>> > > It's really a special type of bonding of two netdevs. Like you'd bond
>> > > two ports to get twice the bandwidth. With the twist that the balancing
>> > > is done on NUMA proximity, rather than traffic hash.
>> > > Well, plus, the major twist that it's all done magically "for you"
>> > > in the vendor driver, and the two "lower" devices are not visible.
>> > > You only see the resulting bond.
>> > > I personally think that the magic hides as many problems as it
>> > > introduces and we'd be better off creating two separate netdevs.
>> > > And then a new type of "device bond" on top. Small win that
>> > > the "new device bond on top" can be shared code across vendors.
>> > 
>> > Yes. We have been exploring a small extension to bonding driver to enable
>> > a single numa-aware multi-threaded application to efficiently utilize
>> > multiple NICs across numa nodes.
>> 
>> 	Is this referring to something like the multi-pf under
>> discussion, or just generically with two arbitrary network devices
>> installed one each per NUMA node?
>
>Normal network devices one per NUMA node
>
>> 
>> > Here is an early version of a patch we have been trying and seems to be
>> > working well.
>> > 
>> > =========================================================================
>> > bonding: select tx device based on rx device of a flow
>> > 
>> > If napi_id is cached in the sk associated with skb, use the
>> > device associated with napi_id as the transmit device.
>> > 
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > 
>> > diff --git a/drivers/net/bonding/bond_main.c
>> > b/drivers/net/bonding/bond_main.c
>> > index 7a7d584f378a..77e3bf6c4502 100644
>> > --- a/drivers/net/bonding/bond_main.c
>> > +++ b/drivers/net/bonding/bond_main.c
>> > @@ -5146,6 +5146,30 @@ static struct slave
>> > *bond_xmit_3ad_xor_slave_get(struct bonding *bond,
>> >         unsigned int count;
>> >         u32 hash;
>> > 
>> > +       if (skb->sk) {
>> > +               int napi_id = skb->sk->sk_napi_id;
>> > +               struct net_device *dev;
>> > +               int idx;
>> > +
>> > +               rcu_read_lock();
>> > +               dev = dev_get_by_napi_id(napi_id);
>> > +               rcu_read_unlock();
>> > +
>> > +               if (!dev)
>> > +                       goto hash;
>> > +
>> > +               count = slaves ? READ_ONCE(slaves->count) : 0;
>> > +               if (unlikely(!count))
>> > +                       return NULL;
>> > +
>> > +               for (idx = 0; idx < count; idx++) {
>> > +                       slave = slaves->arr[idx];
>> > +                       if (slave->dev->ifindex == dev->ifindex)
>> > +                               return slave;
>> > +               }
>> > +       }
>> > +
>> > +hash:
>> >         hash = bond_xmit_hash(bond, skb);
>> >         count = slaves ? READ_ONCE(slaves->count) : 0;
>> >         if (unlikely(!count))
>> > =========================================================================
>> > 
>> > If we make this as a configurable bonding option, would this be an
>> > acceptable solution to accelerate numa-aware apps?
>> 
>> 	Assuming for the moment this is for "regular" network devices
>> installed one per NUMA node, why do this in bonding instead of at a
>> higher layer (multiple subnets or ECMP, for example)?
>> 
>> 	Is the intent here that the bond would aggregate its interfaces
>> via LACP with the peer being some kind of cross-chassis link aggregation
>> (MLAG, et al)?

No.

>
>Yes. basic LACP bonding setup. There could be multiple peers connecting to
>the server via switch providing LACP based link aggregation. No cross-chassis
>MLAG.

LACP does not make any sense, when you have only a single physical port.
That applies to ECMP mentioned above too I believe.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-23  9:40                   ` Jiri Pirko
@ 2024-02-23 23:56                     ` Samudrala, Sridhar
  2024-02-24 12:48                       ` Jiri Pirko
  0 siblings, 1 reply; 37+ messages in thread
From: Samudrala, Sridhar @ 2024-02-23 23:56 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jay Vosburgh, Jakub Kicinski, Greg Kroah-Hartman, Tariq Toukan,
	Saeed Mahameed, David S. Miller, Paolo Abeni, Eric Dumazet,
	Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Leon Romanovsky



On 2/23/2024 3:40 AM, Jiri Pirko wrote:
> Fri, Feb 23, 2024 at 06:00:40AM CET, sridhar.samudrala@intel.com wrote:
>>
>>
>> On 2/22/2024 8:05 PM, Jay Vosburgh wrote:
>>> Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote:
>>>> On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>>>>> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>>>>>> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>>>>>>> Greg, we have a feature here where a single device of class net has
>>>>>>> multiple "bus parents". We used to have one attr under class net
>>>>>>> (device) which is a link to the bus parent. Now we either need to add
>>>>>>> more or not bother with the linking of the whole device. Is there any
>>>>>>> precedent / preference for solving this from the device model
>>>>>>> perspective?
>>>>>>
>>>>>> How, logically, can a netdevice be controlled properly from 2 parent
>>>>>> devices on two different busses?  How is that even possible from a
>>>>>> physical point-of-view?  What exact bus types are involved here?
>>>>> Two PCIe buses, two endpoints, two networking ports. It's one piece
>>>>
>>>> Isn't it only 1 networking port with multiple PFs?
>>>>
>>>>> of silicon, tho, so the "slices" can talk to each other internally.
>>>>> The NVRAM configuration tells both endpoints that the user wants
>>>>> them "bonded", when the PCI drivers probe they "find each other"
>>>>> using some cookie or DSN or whatnot. And once they did, they spawn
>>>>> a single netdev.
>>>>>
>>>>>> This "shouldn't" be possible as in the end, it's usually a PCI device
>>>>>> handling this all, right?
>>>>> It's really a special type of bonding of two netdevs. Like you'd bond
>>>>> two ports to get twice the bandwidth. With the twist that the balancing
>>>>> is done on NUMA proximity, rather than traffic hash.
>>>>> Well, plus, the major twist that it's all done magically "for you"
>>>>> in the vendor driver, and the two "lower" devices are not visible.
>>>>> You only see the resulting bond.
>>>>> I personally think that the magic hides as many problems as it
>>>>> introduces and we'd be better off creating two separate netdevs.
>>>>> And then a new type of "device bond" on top. Small win that
>>>>> the "new device bond on top" can be shared code across vendors.
>>>>
>>>> Yes. We have been exploring a small extension to bonding driver to enable
>>>> a single numa-aware multi-threaded application to efficiently utilize
>>>> multiple NICs across numa nodes.
>>>
>>> 	Is this referring to something like the multi-pf under
>>> discussion, or just generically with two arbitrary network devices
>>> installed one each per NUMA node?
>>
>> Normal network devices one per NUMA node
>>
>>>
>>>> Here is an early version of a patch we have been trying and seems to be
>>>> working well.
>>>>
>>>> =========================================================================
>>>> bonding: select tx device based on rx device of a flow
>>>>
>>>> If napi_id is cached in the sk associated with skb, use the
>>>> device associated with napi_id as the transmit device.
>>>>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>
>>>> diff --git a/drivers/net/bonding/bond_main.c
>>>> b/drivers/net/bonding/bond_main.c
>>>> index 7a7d584f378a..77e3bf6c4502 100644
>>>> --- a/drivers/net/bonding/bond_main.c
>>>> +++ b/drivers/net/bonding/bond_main.c
>>>> @@ -5146,6 +5146,30 @@ static struct slave
>>>> *bond_xmit_3ad_xor_slave_get(struct bonding *bond,
>>>>          unsigned int count;
>>>>          u32 hash;
>>>>
>>>> +       if (skb->sk) {
>>>> +               int napi_id = skb->sk->sk_napi_id;
>>>> +               struct net_device *dev;
>>>> +               int idx;
>>>> +
>>>> +               rcu_read_lock();
>>>> +               dev = dev_get_by_napi_id(napi_id);
>>>> +               rcu_read_unlock();
>>>> +
>>>> +               if (!dev)
>>>> +                       goto hash;
>>>> +
>>>> +               count = slaves ? READ_ONCE(slaves->count) : 0;
>>>> +               if (unlikely(!count))
>>>> +                       return NULL;
>>>> +
>>>> +               for (idx = 0; idx < count; idx++) {
>>>> +                       slave = slaves->arr[idx];
>>>> +                       if (slave->dev->ifindex == dev->ifindex)
>>>> +                               return slave;
>>>> +               }
>>>> +       }
>>>> +
>>>> +hash:
>>>>          hash = bond_xmit_hash(bond, skb);
>>>>          count = slaves ? READ_ONCE(slaves->count) : 0;
>>>>          if (unlikely(!count))
>>>> =========================================================================
>>>>
>>>> If we make this as a configurable bonding option, would this be an
>>>> acceptable solution to accelerate numa-aware apps?
>>>
>>> 	Assuming for the moment this is for "regular" network devices
>>> installed one per NUMA node, why do this in bonding instead of at a
>>> higher layer (multiple subnets or ECMP, for example)?
>>>
>>> 	Is the intent here that the bond would aggregate its interfaces
>>> via LACP with the peer being some kind of cross-chassis link aggregation
>>> (MLAG, et al)?
> 
> No.
> 
>>
>> Yes. basic LACP bonding setup. There could be multiple peers connecting to
>> the server via switch providing LACP based link aggregation. No cross-chassis
>> MLAG.
> 
> LACP does not make any sense, when you have only a single physical port.
> That applies to ECMP mentioned above too I believe.

I meant for the 2 regular NICs on 2 numa node setup, not for multi-PF 1 
port setup.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-23 23:56                     ` Samudrala, Sridhar
@ 2024-02-24 12:48                       ` Jiri Pirko
  0 siblings, 0 replies; 37+ messages in thread
From: Jiri Pirko @ 2024-02-24 12:48 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Jay Vosburgh, Jakub Kicinski, Greg Kroah-Hartman, Tariq Toukan,
	Saeed Mahameed, David S. Miller, Paolo Abeni, Eric Dumazet,
	Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Leon Romanovsky

Sat, Feb 24, 2024 at 12:56:52AM CET, sridhar.samudrala@intel.com wrote:
>
>
>On 2/23/2024 3:40 AM, Jiri Pirko wrote:
>> Fri, Feb 23, 2024 at 06:00:40AM CET, sridhar.samudrala@intel.com wrote:
>> > 
>> > 
>> > On 2/22/2024 8:05 PM, Jay Vosburgh wrote:
>> > > Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote:
>> > > > On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>> > > > > On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>> > > > > > On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>> > > > > > > Greg, we have a feature here where a single device of class net has
>> > > > > > > multiple "bus parents". We used to have one attr under class net
>> > > > > > > (device) which is a link to the bus parent. Now we either need to add
>> > > > > > > more or not bother with the linking of the whole device. Is there any
>> > > > > > > precedent / preference for solving this from the device model
>> > > > > > > perspective?
>> > > > > > 
>> > > > > > How, logically, can a netdevice be controlled properly from 2 parent
>> > > > > > devices on two different busses?  How is that even possible from a
>> > > > > > physical point-of-view?  What exact bus types are involved here?
>> > > > > Two PCIe buses, two endpoints, two networking ports. It's one piece
>> > > > 
>> > > > Isn't it only 1 networking port with multiple PFs?
>> > > > 
>> > > > > of silicon, tho, so the "slices" can talk to each other internally.
>> > > > > The NVRAM configuration tells both endpoints that the user wants
>> > > > > them "bonded", when the PCI drivers probe they "find each other"
>> > > > > using some cookie or DSN or whatnot. And once they did, they spawn
>> > > > > a single netdev.
>> > > > > 
>> > > > > > This "shouldn't" be possible as in the end, it's usually a PCI device
>> > > > > > handling this all, right?
>> > > > > It's really a special type of bonding of two netdevs. Like you'd bond
>> > > > > two ports to get twice the bandwidth. With the twist that the balancing
>> > > > > is done on NUMA proximity, rather than traffic hash.
>> > > > > Well, plus, the major twist that it's all done magically "for you"
>> > > > > in the vendor driver, and the two "lower" devices are not visible.
>> > > > > You only see the resulting bond.
>> > > > > I personally think that the magic hides as many problems as it
>> > > > > introduces and we'd be better off creating two separate netdevs.
>> > > > > And then a new type of "device bond" on top. Small win that
>> > > > > the "new device bond on top" can be shared code across vendors.
>> > > > 
>> > > > Yes. We have been exploring a small extension to bonding driver to enable
>> > > > a single numa-aware multi-threaded application to efficiently utilize
>> > > > multiple NICs across numa nodes.
>> > > 
>> > > 	Is this referring to something like the multi-pf under
>> > > discussion, or just generically with two arbitrary network devices
>> > > installed one each per NUMA node?
>> > 
>> > Normal network devices one per NUMA node
>> > 
>> > > 
>> > > > Here is an early version of a patch we have been trying and seems to be
>> > > > working well.
>> > > > 
>> > > > =========================================================================
>> > > > bonding: select tx device based on rx device of a flow
>> > > > 
>> > > > If napi_id is cached in the sk associated with skb, use the
>> > > > device associated with napi_id as the transmit device.
>> > > > 
>> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > > > 
>> > > > diff --git a/drivers/net/bonding/bond_main.c
>> > > > b/drivers/net/bonding/bond_main.c
>> > > > index 7a7d584f378a..77e3bf6c4502 100644
>> > > > --- a/drivers/net/bonding/bond_main.c
>> > > > +++ b/drivers/net/bonding/bond_main.c
>> > > > @@ -5146,6 +5146,30 @@ static struct slave
>> > > > *bond_xmit_3ad_xor_slave_get(struct bonding *bond,
>> > > >          unsigned int count;
>> > > >          u32 hash;
>> > > > 
>> > > > +       if (skb->sk) {
>> > > > +               int napi_id = skb->sk->sk_napi_id;
>> > > > +               struct net_device *dev;
>> > > > +               int idx;
>> > > > +
>> > > > +               rcu_read_lock();
>> > > > +               dev = dev_get_by_napi_id(napi_id);
>> > > > +               rcu_read_unlock();
>> > > > +
>> > > > +               if (!dev)
>> > > > +                       goto hash;
>> > > > +
>> > > > +               count = slaves ? READ_ONCE(slaves->count) : 0;
>> > > > +               if (unlikely(!count))
>> > > > +                       return NULL;
>> > > > +
>> > > > +               for (idx = 0; idx < count; idx++) {
>> > > > +                       slave = slaves->arr[idx];
>> > > > +                       if (slave->dev->ifindex == dev->ifindex)
>> > > > +                               return slave;
>> > > > +               }
>> > > > +       }
>> > > > +
>> > > > +hash:
>> > > >          hash = bond_xmit_hash(bond, skb);
>> > > >          count = slaves ? READ_ONCE(slaves->count) : 0;
>> > > >          if (unlikely(!count))
>> > > > =========================================================================
>> > > > 
>> > > > If we make this as a configurable bonding option, would this be an
>> > > > acceptable solution to accelerate numa-aware apps?
>> > > 
>> > > 	Assuming for the moment this is for "regular" network devices
>> > > installed one per NUMA node, why do this in bonding instead of at a
>> > > higher layer (multiple subnets or ECMP, for example)?
>> > > 
>> > > 	Is the intent here that the bond would aggregate its interfaces
>> > > via LACP with the peer being some kind of cross-chassis link aggregation
>> > > (MLAG, et al)?
>> 
>> No.
>> 
>> > 
>> > Yes. basic LACP bonding setup. There could be multiple peers connecting to
>> > the server via switch providing LACP based link aggregation. No cross-chassis
>> > MLAG.
>> 
>> LACP does not make any sense, when you have only a single physical port.
>> That applies to ECMP mentioned above too I believe.
>
>I meant for the 2 regular NICs on 2 numa node setup, not for multi-PF 1 port
>setup.

Okay, not sure how it is related to this thread then :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-23  9:36               ` Jiri Pirko
@ 2024-02-28  2:06                 ` Jakub Kicinski
  2024-02-28  8:13                   ` Jiri Pirko
  0 siblings, 1 reply; 37+ messages in thread
From: Jakub Kicinski @ 2024-02-28  2:06 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Samudrala, Sridhar, Greg Kroah-Hartman, Tariq Toukan,
	Saeed Mahameed, David S. Miller, Paolo Abeni, Eric Dumazet,
	Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Leon Romanovsky, jay.vosburgh

On Fri, 23 Feb 2024 10:36:25 +0100 Jiri Pirko wrote:
> >> It's really a special type of bonding of two netdevs. Like you'd bond
> >> two ports to get twice the bandwidth. With the twist that the balancing
> >> is done on NUMA proximity, rather than traffic hash.
> >> 
> >> Well, plus, the major twist that it's all done magically "for you"
> >> in the vendor driver, and the two "lower" devices are not visible.
> >> You only see the resulting bond.
> >> 
> >> I personally think that the magic hides as many problems as it
> >> introduces and we'd be better off creating two separate netdevs.
> >> And then a new type of "device bond" on top. Small win that
> >> the "new device bond on top" can be shared code across vendors.  
> >
> >Yes. We have been exploring a small extension to bonding driver to enable a
> >single numa-aware multi-threaded application to efficiently utilize multiple
> >NICs across numa nodes.  
> 
> Bonding was my immediate response when we discussed this internally for
> the first time. But I had to eventually admit it is probably not that
> suitable in this case, here's why:
> 1) there are no 2 physical ports, only one.

Right, sorry, number of PFs matches number of ports for each bus.
But it's not necessarily a deal breaker - it's similar to a multi-host
device. We also have multiple netdevs and PCIe links, they just go to
different host rather than different NUMA nodes on one host.

> 2) it is basically a matter of device layout/provisioning that this
>    feature should be enabled, not user configuration.

We can still auto-instantiate it, not a deal breaker.

I'm not sure you're right in that assumption, tho. At Meta, we support
container sizes ranging from few CPUs to multiple NUMA nodes. Each NUMA
node may have it's own NIC, and the orchestration needs to stitch and
un-stitch NICs depending on whether the cores were allocated to small
containers or a huge one.

So it would be _easier_ to deal with multiple netdevs. Orchestration
layer already understands netdev <> NUMA mapping, it does not understand
multi-NUMA netdevs, and how to match up queues to nodes.

> 3) other subsystems like RDMA would benefit the same feature, so this
>    int not netdev specific in general.

Yes, looks RDMA-centric. RDMA being infamously bonding-challenged.

Anyway, back to the initial question - from Greg's reply I'm guessing
there's no precedent for doing such things in the device model either.
So we're on our own.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-28  2:06                 ` Jakub Kicinski
@ 2024-02-28  8:13                   ` Jiri Pirko
  2024-02-28 17:06                     ` Jakub Kicinski
  0 siblings, 1 reply; 37+ messages in thread
From: Jiri Pirko @ 2024-02-28  8:13 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Samudrala, Sridhar, Greg Kroah-Hartman, Tariq Toukan,
	Saeed Mahameed, David S. Miller, Paolo Abeni, Eric Dumazet,
	Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Leon Romanovsky, jay.vosburgh

Wed, Feb 28, 2024 at 03:06:19AM CET, kuba@kernel.org wrote:
>On Fri, 23 Feb 2024 10:36:25 +0100 Jiri Pirko wrote:
>> >> It's really a special type of bonding of two netdevs. Like you'd bond
>> >> two ports to get twice the bandwidth. With the twist that the balancing
>> >> is done on NUMA proximity, rather than traffic hash.
>> >> 
>> >> Well, plus, the major twist that it's all done magically "for you"
>> >> in the vendor driver, and the two "lower" devices are not visible.
>> >> You only see the resulting bond.
>> >> 
>> >> I personally think that the magic hides as many problems as it
>> >> introduces and we'd be better off creating two separate netdevs.
>> >> And then a new type of "device bond" on top. Small win that
>> >> the "new device bond on top" can be shared code across vendors.  
>> >
>> >Yes. We have been exploring a small extension to bonding driver to enable a
>> >single numa-aware multi-threaded application to efficiently utilize multiple
>> >NICs across numa nodes.  
>> 
>> Bonding was my immediate response when we discussed this internally for
>> the first time. But I had to eventually admit it is probably not that
>> suitable in this case, here's why:
>> 1) there are no 2 physical ports, only one.
>
>Right, sorry, number of PFs matches number of ports for each bus.
>But it's not necessarily a deal breaker - it's similar to a multi-host
>device. We also have multiple netdevs and PCIe links, they just go to
>different host rather than different NUMA nodes on one host.

That is a different scenario. You have multiple hosts and a switch
between them and the physical port. Yeah, it might be invisible switch,
but there still is one. On DPU/smartnic, it is visible and configurable.


>
>> 2) it is basically a matter of device layout/provisioning that this
>>    feature should be enabled, not user configuration.
>
>We can still auto-instantiate it, not a deal breaker.

"Auto-instantiate" in meating of userspace orchestration deamon,
not kernel, that's what you mean?


>
>I'm not sure you're right in that assumption, tho. At Meta, we support
>container sizes ranging from few CPUs to multiple NUMA nodes. Each NUMA
>node may have it's own NIC, and the orchestration needs to stitch and
>un-stitch NICs depending on whether the cores were allocated to small
>containers or a huge one.

Yeah, but still, there is one physical port for NIC-numanode pair.
Correct? Does the orchestration setup a bond on top of them or some other
master device or let the container use them independently?


>
>So it would be _easier_ to deal with multiple netdevs. Orchestration
>layer already understands netdev <> NUMA mapping, it does not understand
>multi-NUMA netdevs, and how to match up queues to nodes.
>
>> 3) other subsystems like RDMA would benefit the same feature, so this
>>    int not netdev specific in general.
>
>Yes, looks RDMA-centric. RDMA being infamously bonding-challenged.

Not really. It's just needed to consider all usecases, not only netdev.


>
>Anyway, back to the initial question - from Greg's reply I'm guessing
>there's no precedent for doing such things in the device model either.
>So we're on our own.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-28  8:13                   ` Jiri Pirko
@ 2024-02-28 17:06                     ` Jakub Kicinski
  2024-02-28 17:43                       ` Jakub Kicinski
  2024-02-29  8:21                       ` Jiri Pirko
  0 siblings, 2 replies; 37+ messages in thread
From: Jakub Kicinski @ 2024-02-28 17:06 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Samudrala, Sridhar, Greg Kroah-Hartman, Tariq Toukan,
	Saeed Mahameed, David S. Miller, Paolo Abeni, Eric Dumazet,
	Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Leon Romanovsky, jay.vosburgh

On Wed, 28 Feb 2024 09:13:57 +0100 Jiri Pirko wrote:
> >> 2) it is basically a matter of device layout/provisioning that this
> >>    feature should be enabled, not user configuration.  
> >
> >We can still auto-instantiate it, not a deal breaker.  
> 
> "Auto-instantiate" in meating of userspace orchestration deamon,
> not kernel, that's what you mean?

Either kernel, or pass some hints to a user space agent, like networkd
and have it handle the creation. We have precedent for "kernel side
bonding" with the VF<>virtio bonding thing.

> >I'm not sure you're right in that assumption, tho. At Meta, we support
> >container sizes ranging from few CPUs to multiple NUMA nodes. Each NUMA
> >node may have it's own NIC, and the orchestration needs to stitch and
> >un-stitch NICs depending on whether the cores were allocated to small
> >containers or a huge one.  
> 
> Yeah, but still, there is one physical port for NIC-numanode pair.

Well, today there is.

> Correct? Does the orchestration setup a bond on top of them or some other
> master device or let the container use them independently?

Just multi-nexthop routing and binding sockets to the netdev (with
some BPF magic, I think).

> >So it would be _easier_ to deal with multiple netdevs. Orchestration
> >layer already understands netdev <> NUMA mapping, it does not understand
> >multi-NUMA netdevs, and how to match up queues to nodes.
> >  
> >> 3) other subsystems like RDMA would benefit the same feature, so this
> >>    int not netdev specific in general.  
> >
> >Yes, looks RDMA-centric. RDMA being infamously bonding-challenged.  
> 
> Not really. It's just needed to consider all usecases, not only netdev.

All use cases or lowest common denominator, depends on priorities.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-28 17:06                     ` Jakub Kicinski
@ 2024-02-28 17:43                       ` Jakub Kicinski
  2024-03-02  7:31                         ` Saeed Mahameed
  2024-02-29  8:21                       ` Jiri Pirko
  1 sibling, 1 reply; 37+ messages in thread
From: Jakub Kicinski @ 2024-02-28 17:43 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Samudrala, Sridhar, Greg Kroah-Hartman, Tariq Toukan,
	Saeed Mahameed, David S. Miller, Paolo Abeni, Eric Dumazet,
	Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Leon Romanovsky, jay.vosburgh

On Wed, 28 Feb 2024 09:06:04 -0800 Jakub Kicinski wrote:
> > >Yes, looks RDMA-centric. RDMA being infamously bonding-challenged.    
> > 
> > Not really. It's just needed to consider all usecases, not only netdev.  
> 
> All use cases or lowest common denominator, depends on priorities.

To be clear, I'm not trying to shut down this proposal, I think both
have disadvantages. This one is better for RDMA and iperf, the explicit
netdevs are better for more advanced TCP apps. All I want is clear docs
so users are not confused, and vendors don't diverge pointlessly.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-28 17:06                     ` Jakub Kicinski
  2024-02-28 17:43                       ` Jakub Kicinski
@ 2024-02-29  8:21                       ` Jiri Pirko
  2024-02-29 14:34                         ` Jakub Kicinski
  1 sibling, 1 reply; 37+ messages in thread
From: Jiri Pirko @ 2024-02-29  8:21 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Samudrala, Sridhar, Greg Kroah-Hartman, Tariq Toukan,
	Saeed Mahameed, David S. Miller, Paolo Abeni, Eric Dumazet,
	Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Leon Romanovsky, jay.vosburgh

Wed, Feb 28, 2024 at 06:06:04PM CET, kuba@kernel.org wrote:
>On Wed, 28 Feb 2024 09:13:57 +0100 Jiri Pirko wrote:
>> >> 2) it is basically a matter of device layout/provisioning that this
>> >>    feature should be enabled, not user configuration.  
>> >
>> >We can still auto-instantiate it, not a deal breaker.  
>> 
>> "Auto-instantiate" in meating of userspace orchestration deamon,
>> not kernel, that's what you mean?
>
>Either kernel, or pass some hints to a user space agent, like networkd
>and have it handle the creation. We have precedent for "kernel side
>bonding" with the VF<>virtio bonding thing.
>
>> >I'm not sure you're right in that assumption, tho. At Meta, we support
>> >container sizes ranging from few CPUs to multiple NUMA nodes. Each NUMA
>> >node may have it's own NIC, and the orchestration needs to stitch and
>> >un-stitch NICs depending on whether the cores were allocated to small
>> >containers or a huge one.  
>> 
>> Yeah, but still, there is one physical port for NIC-numanode pair.
>
>Well, today there is.
>
>> Correct? Does the orchestration setup a bond on top of them or some other
>> master device or let the container use them independently?
>
>Just multi-nexthop routing and binding sockets to the netdev (with
>some BPF magic, I think).

Yeah, so basically 2 independent ports, 2 netdevices working
independently. Not sure I see the parallel to the subject we discuss
here :/


>
>> >So it would be _easier_ to deal with multiple netdevs. Orchestration
>> >layer already understands netdev <> NUMA mapping, it does not understand
>> >multi-NUMA netdevs, and how to match up queues to nodes.
>> >  
>> >> 3) other subsystems like RDMA would benefit the same feature, so this
>> >>    int not netdev specific in general.  
>> >
>> >Yes, looks RDMA-centric. RDMA being infamously bonding-challenged.  
>> 
>> Not really. It's just needed to consider all usecases, not only netdev.
>
>All use cases or lowest common denominator, depends on priorities.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-29  8:21                       ` Jiri Pirko
@ 2024-02-29 14:34                         ` Jakub Kicinski
  0 siblings, 0 replies; 37+ messages in thread
From: Jakub Kicinski @ 2024-02-29 14:34 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Samudrala, Sridhar, Greg Kroah-Hartman, Tariq Toukan,
	Saeed Mahameed, David S. Miller, Paolo Abeni, Eric Dumazet,
	Saeed Mahameed, netdev, Tariq Toukan, Gal Pressman,
	Leon Romanovsky, jay.vosburgh

On Thu, 29 Feb 2024 09:21:26 +0100 Jiri Pirko wrote:
> >> Correct? Does the orchestration setup a bond on top of them or some other
> >> master device or let the container use them independently?  
> >
> >Just multi-nexthop routing and binding sockets to the netdev (with
> >some BPF magic, I think).  
> 
> Yeah, so basically 2 independent ports, 2 netdevices working
> independently. Not sure I see the parallel to the subject we discuss
> here :/

From the user's perspective it's almost exactly the same.
User wants NUMA nodes to have a way to reach the network without
crossing the interconnect. Whether you do that with 2 200G NICs
or 1 400G NIC connected to two nodes is an implementation detail.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev
  2024-02-28 17:43                       ` Jakub Kicinski
@ 2024-03-02  7:31                         ` Saeed Mahameed
  0 siblings, 0 replies; 37+ messages in thread
From: Saeed Mahameed @ 2024-03-02  7:31 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jiri Pirko, Samudrala, Sridhar, Greg Kroah-Hartman, Tariq Toukan,
	David S. Miller, Paolo Abeni, Eric Dumazet, Saeed Mahameed,
	netdev, Tariq Toukan, Gal Pressman, Leon Romanovsky,
	jay.vosburgh

On 28 Feb 09:43, Jakub Kicinski wrote:
>On Wed, 28 Feb 2024 09:06:04 -0800 Jakub Kicinski wrote:
>> > >Yes, looks RDMA-centric. RDMA being infamously bonding-challenged.
>> >
>> > Not really. It's just needed to consider all usecases, not only netdev.
>>
>> All use cases or lowest common denominator, depends on priorities.
>
>To be clear, I'm not trying to shut down this proposal, I think both
>have disadvantages. This one is better for RDMA and iperf, the explicit
>netdevs are better for more advanced TCP apps. All I want is clear docs
>so users are not confused, and vendors don't diverge pointlessly.

Just posted v4 with updated documentation that should cover the basic
feature which we believe is the most basic that all vendors should
implement, mlx5 implementation won't change much if we decide later to move
to some sort of a "generic netdev" interface, we don't agree it should be a
new kind of bond, as bond was meant for actual link aggregation of
multi-port devices, but again the mlx5 implementation will remain the same
regardless of any future extension of the feature, the defaults are well
documented and carefully selected for best user expectations.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2024-03-02  7:31 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-15  3:07 [pull request][net-next V3 00/15] mlx5 socket direct (Multi-PF) Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 01/15] net/mlx5: Add MPIR bit in mcam_access_reg Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 02/15] net/mlx5: SD, Introduce SD lib Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 03/15] net/mlx5: SD, Implement basic query and instantiation Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 04/15] net/mlx5: SD, Implement devcom communication and primary election Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 05/15] net/mlx5: SD, Implement steering for primary and secondaries Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 06/15] net/mlx5: SD, Add informative prints in kernel log Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 07/15] net/mlx5: SD, Add debugfs Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 08/15] net/mlx5e: Create single netdev per SD group Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 09/15] net/mlx5e: Create EN core HW resources for all secondary devices Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 10/15] net/mlx5e: Let channels be SD-aware Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 11/15] net/mlx5e: Support cross-vhca RSS Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 12/15] net/mlx5e: Support per-mdev queue counter Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 13/15] net/mlx5e: Block TLS device offload on combined SD netdev Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 14/15] net/mlx5: Enable SD feature Saeed Mahameed
2024-02-15  3:08 ` [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev Saeed Mahameed
2024-02-16  5:23   ` Jakub Kicinski
2024-02-19 15:26     ` Tariq Toukan
2024-02-21  1:33       ` Jakub Kicinski
2024-02-21  2:10         ` Saeed Mahameed
2024-02-22  7:51         ` Greg Kroah-Hartman
2024-02-22 23:00           ` Jakub Kicinski
2024-02-23  1:23             ` Samudrala, Sridhar
2024-02-23  2:05               ` Jay Vosburgh
2024-02-23  5:00                 ` Samudrala, Sridhar
2024-02-23  9:40                   ` Jiri Pirko
2024-02-23 23:56                     ` Samudrala, Sridhar
2024-02-24 12:48                       ` Jiri Pirko
2024-02-23  9:36               ` Jiri Pirko
2024-02-28  2:06                 ` Jakub Kicinski
2024-02-28  8:13                   ` Jiri Pirko
2024-02-28 17:06                     ` Jakub Kicinski
2024-02-28 17:43                       ` Jakub Kicinski
2024-03-02  7:31                         ` Saeed Mahameed
2024-02-29  8:21                       ` Jiri Pirko
2024-02-29 14:34                         ` Jakub Kicinski
2024-02-19 18:04   ` Jiri Pirko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.